摘要
方言研究领域中的语音研究、词汇研究及语法研究是方言研究的三个重要组成部分,如何识别方言词汇,是方言词汇研究首要的环节。目前,汉语方言词汇研究的语料收集与整理主要通过专家人工整理的形式进行,耗时耗力。随着信息技术的发展,人们的交流广泛通过网络进行,而输入法数据包含海量的语料资源以及地域信息,可以帮助进行方言词汇语料的自动发现。然而,目前尚没有文献研究如何利用拼音输入法数据对方言词汇进行系统化分析,因此在本文中,我们探讨借助中文输入法的用户行为来自动发现各地域方言词汇的方法。特别的,我们归纳得到输入法数据中表征方言词汇的两类特征,并基于对特征的不同组合识别方言词汇。最后我们通过实验评价了两类特征的不同组合方法对方言词汇识别效果的影响。
The study of dialect is composed of voice study,vocabulary study and grammar study,of which the first step is to recognize the dialect vocabulary.By now,collection of Chinese idiom words is mainly accomplished by experts,and it is time-consuming and labor-intensive.With the development of information technology,people communicate widely through the network,and thus input method data contains vast amount of vocabulary resources as well as the geographical information,which can help automatically discover dialect words corpus.However,in literature,there have been very few studies on how to exploit the input method data to systematically investigate the dialects.Therefore this paper analyzes the user behavior of Chinese input method,and based on which we propose to automatically discover the geographical dialect vocabulary.Specifically,the paper gets the two representative features of dialects in Chinese input method,and uses different combinations of these two features to recognize dialect words.Finally,extensive experiments are performed to evaluate the impacts of the feature combinations on the dialect word recognition.
出处
《中文信息学报》
CSCD
北大核心
2013年第5期22-28,共7页
Journal of Chinese Information Processing
基金
国家自然科学基金重点资助项目(61133012)
国家863计划资助项目(2012AA011102)
关键词
方言词汇识别
中文拼音输入法
特征融合
dialect detection
Chinese Pinyin input method
feature combination