摘要
字标注分词方法是当前中文分词领域中一种较为有效的分词方法,但由于中文汉字本身带有语义信息,不同字在不同语境中其含义与作用不同,导致每个字的构词规律存在差异。针对这一问题,提出了一种基于字簇的多模型中文分词方法,首先对每个字进行建模,然后对学习出的模型参数进行聚类分析形成字簇,最后基于字簇重新训练模型参数。实验结果表明,该方法能够有效地发现具有相同或相近构词规律的字簇,很好地区别了同类特征对不同字的作用程度。
Character-based tagging method is currently an effective method in Chinese word segmentation.However,the Chinese characters have their own semantic information,different characters have different meanings and functions in different contexts,which lead to different correlations with context,resulting in the difference of word-formation rules for each word.To solve this problem,this paper proposed a multi-model segmentation method based on character clusters.Firstly,the method separately constructed a model for each word,then clustered the model parameters to form character clusters,and finally retrained the model parameters based on the character clusters.Experimental results show that this method can effectively find character clusters with the same or similar word-formation rules,and distinguish the effect of similar features for different characters.
作者
李对红
王裴岩
张桂平
张少阳
Li Duihong;Wang Peiyan ;Zhang Guiping;Zhang Shaoyang(Human-Computer Intelligence Research Center,Shenyang Aerospace University,Shenyang 110136,China)
出处
《计算机应用研究》
CSCD
北大核心
2020年第2期355-359,374,共6页
Application Research of Computers
基金
辽宁省自然科学基金计划重点项目(20170540705)
国家教育部人文社会青年科学研究基金资助项目(17YJC740087).
关键词
中文分词
构词规律
模型参数
聚类
Chinese word segmentation
word-formation rules
model parameters
clustering