Named entity recognition (NER) is a core component in many natural language processing applications. Most NER systems rely on supervised machine learning methods, which depend on time-consuming and expensive annotat...Named entity recognition (NER) is a core component in many natural language processing applications. Most NER systems rely on supervised machine learning methods, which depend on time-consuming and expensive annotations in different languages and domains. This paper presents a method for automatically building silver-standard NER corpora from Chinese Wikipedia. We refine novel and language-dependent features by exploiting the text and structure of Chinese Wikipedia. To reduce tagging errors caused by entity classification, we design four types of heuristic rules based on the characteristics of Chinese Wikipedia and train a supervised NE classifier, and a combined method is used to improve the precision and coverage. Then, we realize type identification of implicit mention by using boundary information of outgoing links. By selecting the sentences related with the domains of test data, we can train better NER models. In the experiments, large-scale NER corpora containing 2.3 million sentences are built from Chinese Wikipedia. The results show the effectiveness of automatically annotated corpora, and the trained NER models achieve the best performance when combining our silver-standard corpora with gold-standard corpora.展开更多
The extraction and understanding of text knowledge become increasingly crucial in the age of big data.One of the current research areas in the field of natural language processing(NLP)is how to accurately understand t...The extraction and understanding of text knowledge become increasingly crucial in the age of big data.One of the current research areas in the field of natural language processing(NLP)is how to accurately understand the text and collect accurate linguistic information because Chinese vocabulary is diverse and ambiguous.This paper mainly studies the candidate entity generation module of the entity link system.The candidate entity generation module constructs an entity reference expansion algorithm to improve the recall rate of candidate entities.In order to improve the efficiency of the connection algorithm of the entire system while ensuring the recall rate of candidate entities,we design a graph model filtering algorithm that fuses shallow semantic information to filter the list of candidate entities,and verify and analyze the efficiency of the algorithm through experiments.By analyzing the related technology of the entity linking algorithm,we study the related technology of candidate entity generation and entity disambiguation,improve the traditional entity linking algorithm,and give an innovative and practical entity linking model.The recall rate exceeds 82%,and the link accuracy rate exceeds 73%.Efficient and accurate entity linking can help machines to better understand text semantics,further promoting the development of NLP and improving the users’knowledge acquisition experience on the text.展开更多
基金Project supported by the National Natural Science Foundation of China(No.14BXW028)
文摘Named entity recognition (NER) is a core component in many natural language processing applications. Most NER systems rely on supervised machine learning methods, which depend on time-consuming and expensive annotations in different languages and domains. This paper presents a method for automatically building silver-standard NER corpora from Chinese Wikipedia. We refine novel and language-dependent features by exploiting the text and structure of Chinese Wikipedia. To reduce tagging errors caused by entity classification, we design four types of heuristic rules based on the characteristics of Chinese Wikipedia and train a supervised NE classifier, and a combined method is used to improve the precision and coverage. Then, we realize type identification of implicit mention by using boundary information of outgoing links. By selecting the sentences related with the domains of test data, we can train better NER models. In the experiments, large-scale NER corpora containing 2.3 million sentences are built from Chinese Wikipedia. The results show the effectiveness of automatically annotated corpora, and the trained NER models achieve the best performance when combining our silver-standard corpora with gold-standard corpora.
基金supported by the Sichuan Science and Technology Program under Grant No.2021YFQ0009。
文摘The extraction and understanding of text knowledge become increasingly crucial in the age of big data.One of the current research areas in the field of natural language processing(NLP)is how to accurately understand the text and collect accurate linguistic information because Chinese vocabulary is diverse and ambiguous.This paper mainly studies the candidate entity generation module of the entity link system.The candidate entity generation module constructs an entity reference expansion algorithm to improve the recall rate of candidate entities.In order to improve the efficiency of the connection algorithm of the entire system while ensuring the recall rate of candidate entities,we design a graph model filtering algorithm that fuses shallow semantic information to filter the list of candidate entities,and verify and analyze the efficiency of the algorithm through experiments.By analyzing the related technology of the entity linking algorithm,we study the related technology of candidate entity generation and entity disambiguation,improve the traditional entity linking algorithm,and give an innovative and practical entity linking model.The recall rate exceeds 82%,and the link accuracy rate exceeds 73%.Efficient and accurate entity linking can help machines to better understand text semantics,further promoting the development of NLP and improving the users’knowledge acquisition experience on the text.