摘要
为提高信息处理效率,文本信息检索系统通常将停用词作为噪音过滤掉,影响了文本处理的效果。针对该问题,提出一种应用于维吾尔语的停用词抽取方法。在分析维吾尔语停用词特点的基础上,采用文档频数、词项频率和信息熵的方法对大量语料进行统计,并分析候选停用词的词性分布情况。通过文本分类实验确定停用词阈值,结果表明,使用该方法进行停用词过滤后,文本分类的计算复杂度降低,分类准确率达到80.8%。
In order to improve the efficiency of information processing,the text information retrieval system usually filters out the stop words as noise,which affects the effect of text processing.Aiming at this problem,a stop words extraction method in Uyghur language is proposed.On the basis of analyzing the characteristics of Uyghur stop words,the statistics on a large number of corpus is carried out by means of Document Frequency(DF),Term Frequency(TF)and Entropy(EN),and the part of speech distribution of candidate stop words is analyzed.The threshold of stop words is determined by text classification experiments.Experimental results show that after filtering stop words with the proposed method,the computational complexity of text classification is reduced,and the classification precision reaches 80.8%.
作者
塞麦提·麦麦提敏
司马义·阿不都热依木
SAIMAITI Maimaitimin;ESMAEL Abdurehim(Chinese Languages School,Xinjiang University,Urumqi 830046,China;Xinjiang Research Center for Chinese-Ethnic Languages Translation,Urumqi 830046,China)
出处
《计算机工程》
CAS
CSCD
北大核心
2019年第10期288-292,300,共6页
Computer Engineering
基金
国家社会科学基金(17XYY034)
教育部人文社会科学研究青年项目(16XJJC740001)
关键词
信息检索
停用词
维吾尔语
文本分类
语料统计
information retrieval
stop words
Uyghur
text classification
corpus statistics