An N-gram Chinese language model incorporating linguistic rules is presented. By constructing elements lattice, rules information is incorporated in statistical frame. To facilitate the hybrid modeling, novel methods ...An N-gram Chinese language model incorporating linguistic rules is presented. By constructing elements lattice, rules information is incorporated in statistical frame. To facilitate the hybrid modeling, novel methods such as MI-based rule evaluating, weighted rule quantification and element-based n-gram probability approximation are presented. Dynamic Viterbi algorithm is adopted to search the best path in lattice. To strengthen the model, transformation-based error-driven rules learning is adopted. Applying proposed model to Chinese Pinyin-to-character conversion, high performance has been achieved in accuracy, flexibility and robustness simultaneously. Tests show correct rate achieves 94.81% instead of 90.53% using bi-gram Markov model alone. Many long-distance dependency and recursion in language can be processed effectively.展开更多
This paper applied Maximum Entropy (ME) model to Pinyin-To-Character (PTC) conversion in-stead of Hidden Markov Model (HMM) that could not include complicated and long-distance lexical informa-tion. Two ME models were...This paper applied Maximum Entropy (ME) model to Pinyin-To-Character (PTC) conversion in-stead of Hidden Markov Model (HMM) that could not include complicated and long-distance lexical informa-tion. Two ME models were built based on simple and complex templates respectively, and the complex one gave better conversion result. Furthermore, conversion trigger pair of y A → y B cBwas proposed to extract the long-distance constrain feature from the corpus; and then Average Mutual Information (AMI) was used to se-lect conversion trigger pair features which were added to the ME model. The experiment shows that conver-sion error of the ME with conversion trigger pairs is reduced by 4% on a small training corpus, comparing with HMM smoothed by absolute smoothing.展开更多
With international exchanges and gatherings becoming more common, Chinese in pinyin, mainly the names of people and places has become more and more common. Pinyin is a system for transliterating Chinese characters int...With international exchanges and gatherings becoming more common, Chinese in pinyin, mainly the names of people and places has become more and more common. Pinyin is a system for transliterating Chinese characters into the Roman alphabet and was officially adopted by the People's Republic of China in 1979.展开更多
新词识别作为自然语言处理的基础任务之一,为构建中文词典、分析词语情感倾向等提供了支持。然而,目前的新词识别方法没有考虑针对谐音新词的识别,导致谐音新词识别的准确率不高。为了解决这一问题,提出一种基于拼音相似度的中文谐音新...新词识别作为自然语言处理的基础任务之一,为构建中文词典、分析词语情感倾向等提供了支持。然而,目前的新词识别方法没有考虑针对谐音新词的识别,导致谐音新词识别的准确率不高。为了解决这一问题,提出一种基于拼音相似度的中文谐音新词发现方法,引入新旧词拼音比较来提高谐音新词识别的准确率。首先,对文本进行预处理,计算平均互信息(AMI)以判定候选词的内部结合度,并使用改进邻接熵确定候选新词的边界;然后,将保留下的词转换成发音相近的汉语拼音与中文词典中的旧词拼音进行相似度比较,并保留最相似的比较结果;最后,若比较结果超过阈值,则将结果中的新词作为谐音新词,对应的旧词即为谐音新词的原有词。在自建的微博数据集上的实验结果表明,与BNshCNs(Blended Numeric and symbolic homophony Chinese Neologisms)、依存句法与语义信息结合的相似性计算模型(DSSCNN)相比,所提方法的准确率、召回率和F1分数分别提高了0.51和5.27个百分点、2.91和6.31个百分点以及1.75和5.81个百分点。可见所提方法具有更好的中文谐音新词识别效果。展开更多
Guangzhou and Foshan enjoy relatively mature metro network.However,some names of metro stations are over-transliterated in Pinyin.Such a translation method is used in translating general names,nouns of locality and so...Guangzhou and Foshan enjoy relatively mature metro network.However,some names of metro stations are over-transliterated in Pinyin.Such a translation method is used in translating general names,nouns of locality and some names of tourist destinations.With translation landscape and linguistic landscape theories,the reasons and impacts of over-transliteration in Guangzhou and Foshan metro will be discussed from the perspective of symbolic function.English names of Metro stations in other cities serve as a reference so as to appropriate solutions.展开更多
基金Acknowledgements: This research was partially supported by the National Natural Science Foundation of China (No. 60435020 and No. 90612005), the Goal-oriented Lessons from the National 863 Program of China (No.2006AA01Z197) and Project of Microsoft Research Asia.
文摘An N-gram Chinese language model incorporating linguistic rules is presented. By constructing elements lattice, rules information is incorporated in statistical frame. To facilitate the hybrid modeling, novel methods such as MI-based rule evaluating, weighted rule quantification and element-based n-gram probability approximation are presented. Dynamic Viterbi algorithm is adopted to search the best path in lattice. To strengthen the model, transformation-based error-driven rules learning is adopted. Applying proposed model to Chinese Pinyin-to-character conversion, high performance has been achieved in accuracy, flexibility and robustness simultaneously. Tests show correct rate achieves 94.81% instead of 90.53% using bi-gram Markov model alone. Many long-distance dependency and recursion in language can be processed effectively.
基金Supported by the National Natural Science Foundation of China as key program (No.60435020) and The HighTechnology Research and Development Programme of China (2002AA117010-09).
文摘This paper applied Maximum Entropy (ME) model to Pinyin-To-Character (PTC) conversion in-stead of Hidden Markov Model (HMM) that could not include complicated and long-distance lexical informa-tion. Two ME models were built based on simple and complex templates respectively, and the complex one gave better conversion result. Furthermore, conversion trigger pair of y A → y B cBwas proposed to extract the long-distance constrain feature from the corpus; and then Average Mutual Information (AMI) was used to se-lect conversion trigger pair features which were added to the ME model. The experiment shows that conver-sion error of the ME with conversion trigger pairs is reduced by 4% on a small training corpus, comparing with HMM smoothed by absolute smoothing.
文摘With international exchanges and gatherings becoming more common, Chinese in pinyin, mainly the names of people and places has become more and more common. Pinyin is a system for transliterating Chinese characters into the Roman alphabet and was officially adopted by the People's Republic of China in 1979.
文摘新词识别作为自然语言处理的基础任务之一,为构建中文词典、分析词语情感倾向等提供了支持。然而,目前的新词识别方法没有考虑针对谐音新词的识别,导致谐音新词识别的准确率不高。为了解决这一问题,提出一种基于拼音相似度的中文谐音新词发现方法,引入新旧词拼音比较来提高谐音新词识别的准确率。首先,对文本进行预处理,计算平均互信息(AMI)以判定候选词的内部结合度,并使用改进邻接熵确定候选新词的边界;然后,将保留下的词转换成发音相近的汉语拼音与中文词典中的旧词拼音进行相似度比较,并保留最相似的比较结果;最后,若比较结果超过阈值,则将结果中的新词作为谐音新词,对应的旧词即为谐音新词的原有词。在自建的微博数据集上的实验结果表明,与BNshCNs(Blended Numeric and symbolic homophony Chinese Neologisms)、依存句法与语义信息结合的相似性计算模型(DSSCNN)相比,所提方法的准确率、召回率和F1分数分别提高了0.51和5.27个百分点、2.91和6.31个百分点以及1.75和5.81个百分点。可见所提方法具有更好的中文谐音新词识别效果。
文摘Guangzhou and Foshan enjoy relatively mature metro network.However,some names of metro stations are over-transliterated in Pinyin.Such a translation method is used in translating general names,nouns of locality and some names of tourist destinations.With translation landscape and linguistic landscape theories,the reasons and impacts of over-transliteration in Guangzhou and Foshan metro will be discussed from the perspective of symbolic function.English names of Metro stations in other cities serve as a reference so as to appropriate solutions.