Annotating named entity recognition (NER) training corpora is a costly but necessary process for supervised NER approaches. This paper presents a general framework to generate large-scale NER training data from para...Annotating named entity recognition (NER) training corpora is a costly but necessary process for supervised NER approaches. This paper presents a general framework to generate large-scale NER training data from parallel corpora. In our method, we first employ a high performance NER system on one side of a bilingual corpus. Then, we project the named entity (NE) labels to the other side according to the word level alignments. Finally, we propose several strategies to select high-quality auto-labeled NER training data. We apply our approach to Chinese NER using an English-Chinese parallel corpus. Experimental results show that our approach can collect high-quality labeled data and can help improve Chinese NER.展开更多
It is of great significance to guarantee the efficient statistics of high-speed railway on-board equipment fault information,which also improves the efficiency of fault analysis. Considering this background, this pape...It is of great significance to guarantee the efficient statistics of high-speed railway on-board equipment fault information,which also improves the efficiency of fault analysis. Considering this background, this paper presents an empirical exploration of named entity recognition(NER) of on-board equipment fault information. Based on the historical fault records of on-board equipment, a fault information recognition model based on multi-neural network collaboration is proposed. First, considering Chinese recorded data characteristics, a method of constructing semantic features and additional features based on character granularity is proposed. Then, the two feature representations are concatenated and passed into the gated convolutional layer to extract the dependencies from multiple different subspaces and adjacent characters in parallel. Next, the local features are transmitted to the bidirectional long short-term memory(BiLSTM) to learn long-term dependency information. On top of BiLSTM, the sequential conditional random field(CRF) is used to jointly decode the optimized tag sequence of the whole sentence. The model is tested and compared with other representative baseline models. The results show that the proposed model not only considers the language characteristics of on-board fault records, but also has obvious advantages on the performance of fault information recognition.展开更多
基金This work was supported by the National Natural Science Foundation of China (Grant Nos. 61133012, 61273321) and the National 863 Leading Technology Research Project (2012AA011102). Special thanks to Wanxiang Che, Yanyan Zhao, Wei He, Fikadu Gemechu, Yuhang Guo, Zhenghua Li, Meishan Zhang and the anonymous reviewers for insightful comments and suggestions.
文摘Annotating named entity recognition (NER) training corpora is a costly but necessary process for supervised NER approaches. This paper presents a general framework to generate large-scale NER training data from parallel corpora. In our method, we first employ a high performance NER system on one side of a bilingual corpus. Then, we project the named entity (NE) labels to the other side according to the word level alignments. Finally, we propose several strategies to select high-quality auto-labeled NER training data. We apply our approach to Chinese NER using an English-Chinese parallel corpus. Experimental results show that our approach can collect high-quality labeled data and can help improve Chinese NER.
基金supported by National Natural Science Foundation of China(No.61763025)Gansu Science and Technology Program Project(No.18JR3RA104)+1 种基金Industrial Support Program for Colleges and Universities in Gansu Province(No.2020C-19)Lanzhou Science and Technology Project(No.2019-4-49)。
文摘It is of great significance to guarantee the efficient statistics of high-speed railway on-board equipment fault information,which also improves the efficiency of fault analysis. Considering this background, this paper presents an empirical exploration of named entity recognition(NER) of on-board equipment fault information. Based on the historical fault records of on-board equipment, a fault information recognition model based on multi-neural network collaboration is proposed. First, considering Chinese recorded data characteristics, a method of constructing semantic features and additional features based on character granularity is proposed. Then, the two feature representations are concatenated and passed into the gated convolutional layer to extract the dependencies from multiple different subspaces and adjacent characters in parallel. Next, the local features are transmitted to the bidirectional long short-term memory(BiLSTM) to learn long-term dependency information. On top of BiLSTM, the sequential conditional random field(CRF) is used to jointly decode the optimized tag sequence of the whole sentence. The model is tested and compared with other representative baseline models. The results show that the proposed model not only considers the language characteristics of on-board fault records, but also has obvious advantages on the performance of fault information recognition.