A new joint decoding strategy that combines the character-based and word-based conditional random field model is proposed.In this segmentation framework,fragments are used to generate candidate Out-of-Vocabularies(OOV...A new joint decoding strategy that combines the character-based and word-based conditional random field model is proposed.In this segmentation framework,fragments are used to generate candidate Out-of-Vocabularies(OOVs).After the initial segmentation,the segmentation fragments are divided into two classes as "combination"(combining several fragments as an unknown word) and "segregation"(segregating to some words).So,more OOVs can be recalled.Moreover,for the characteristics of the cross-domain segmentation,context information is reasonably used to guide Chinese Word Segmentation(CWS).This method is proved to be effective through several experiments on the test data from Sighan Bakeoffs 2007 and Bakeoffs 2010.The rates of OOV recall obtain better performance and the overall segmentation performances achieve a good effect.展开更多
A hybrid approach to English Part-of-Speech(PoS) tagging with its target application being English-Chinese machine translation in business domain is presented,demonstrating how a present tagger can be adapted to learn...A hybrid approach to English Part-of-Speech(PoS) tagging with its target application being English-Chinese machine translation in business domain is presented,demonstrating how a present tagger can be adapted to learn from a small amount of data and handle unknown words for the purpose of machine translation.A small size of 998 k English annotated corpus in business domain is built semi-automatically based on a new tagset;the maximum entropy model is adopted,and rule-based approach is used in post-processing.The tagger is further applied in Noun Phrase(NP) chunking.Experiments show that our tagger achieves an accuracy of 98.14%,which is a quite satisfactory result.In the application to NP chunking,the tagger gives rise to 2.21% increase in F-score,compared with the results using Stanford tagger.展开更多
文摘命名实体识别(named entity recognition,NER)是自然语言处理中重要的基础任务,而中文命名实体识别(Chinese named entity recognition,CNER)因分词歧义和一词多义等问题使其尤显困难。针对这些问题,提出多头注意力机制(multi-heads attention mechanism,Multi-Attention)与字词融合的中文命名实体识别模型(CWA-CNER)。将汉语文本字向量与其在句中可能成词的词向量进行拼接,并将其送入长短时记忆网络(bidirectional long short-term memory neural network,BiLSTM)提取上下文语义信息,进而利用多头注意力机制捕获句中元素间联系的紧密程度,最后通过条件随机场(conditional random field,CRF)进行实体标注。该模型在Boson数据集,1998和2014年《人民日报》三种语料上进行实验,其F1值均达到90%以上,结果表明了模型的有效性。
基金supported by the National Natural Science Foundation of China under Grants No.61173100,No.61173101the Fundamental Research Funds for the Central Universities under Grant No.DUT10RW202
文摘A new joint decoding strategy that combines the character-based and word-based conditional random field model is proposed.In this segmentation framework,fragments are used to generate candidate Out-of-Vocabularies(OOVs).After the initial segmentation,the segmentation fragments are divided into two classes as "combination"(combining several fragments as an unknown word) and "segregation"(segregating to some words).So,more OOVs can be recalled.Moreover,for the characteristics of the cross-domain segmentation,context information is reasonably used to guide Chinese Word Segmentation(CWS).This method is proved to be effective through several experiments on the test data from Sighan Bakeoffs 2007 and Bakeoffs 2010.The rates of OOV recall obtain better performance and the overall segmentation performances achieve a good effect.
基金supported by the National Natural Science Foundation of China under Grant No.61173100the Fundamental Research Funds for the Central Universities under Grant No.GDUT10RW202
文摘A hybrid approach to English Part-of-Speech(PoS) tagging with its target application being English-Chinese machine translation in business domain is presented,demonstrating how a present tagger can be adapted to learn from a small amount of data and handle unknown words for the purpose of machine translation.A small size of 998 k English annotated corpus in business domain is built semi-automatically based on a new tagset;the maximum entropy model is adopted,and rule-based approach is used in post-processing.The tagger is further applied in Noun Phrase(NP) chunking.Experiments show that our tagger achieves an accuracy of 98.14%,which is a quite satisfactory result.In the application to NP chunking,the tagger gives rise to 2.21% increase in F-score,compared with the results using Stanford tagger.