期刊文献+

基于Naive Bayes的维吾尔文文本分类算法及其性能分析 被引量:7

UYGHUR TEXT CLASSIFICATION BASED ON NAIVE BAYES AND ITS PERFORMANCE ANALYSIS
下载PDF
导出
摘要 以大规模网络维吾尔文文本的自动分类技术研究为背景,设计模块化结构的维吾尔文本分类系统,在深入调研基础上选择Naive Bayes算法为分类引擎,用C#实现分类系统。预处理中,结合维吾尔语的词法特征,通过引入词干提取方法大大降低特征维数。在包含10大类共计3 000多个较大规模文本语料库基础上给出分类实验结果,再通过x2统计方法选择不同数目的特征,也分别给出分类实验结果。结果表明,预处理后的维吾尔文特征空间中只有1%-3%特征是最佳的,因而进一步确定哪些是最佳特征或降低特征空间维数是有可能的。 In this paper, taking the automatic classification of large-scale Uyghur text collected from the network as the research background, we have designed the Uyghur text classification system with modular structure, and based on through investigations, we chose the Naive Bayes algorithm as the classification engine, and have implemented the classification system using C-sharp. In the preprocessing part, combining with the lexical characteristics of Uyghur language and by introducing the stem extraction method into the procedure, we have greatly reduced the whole feature dimensions. The classification experimental results on the basis of large-scale text corpus includes more than 3000 documents which are belongs to different 10 categories are given, and the results of the classification experiments for different number of features selected by using x2 statistical method are also given respectively. Results show that only 1% to 3% of the features in Uyghur feature space are critical, so it is possible to determine which ones are the best features or to further reduce the feature space dimensions.
出处 《计算机应用与软件》 CSCD 北大核心 2012年第12期27-29,共3页 Computer Applications and Software
基金 国家自然科学基金项目(61063022 61163033)
关键词 维吾尔文 文本分类 NAIVE Bayes词干提取 停用词 Uyghur Text classification Naive Bayes Stem Extract Stop words
  • 相关文献

参考文献3

二级参考文献2

共引文献15

同被引文献58

  • 1伍洋,钟鸣,姜艳,李石君.面向审计领域的短文本分类技术研究[J].微电子学与计算机,2015,32(1):5-10. 被引量:7
  • 2顾益军,樊孝忠,王建华,汪涛,黄维金.中文停用词表的自动选取[J].北京理工大学学报,2005,25(4):337-340. 被引量:35
  • 3薛为民,陆玉昌.文本挖掘技术研究[J].北京联合大学学报,2005,19(4):59-63. 被引量:63
  • 4鲁明羽.Bayes文本分类器的改进方法研究[J].计算机工程,2006,32(17):63-65. 被引量:11
  • 5胡燕,吴虎子,钟珞.中文文本分类中基于词性的特征提取方法研究[J].武汉理工大学学报,2007,29(4):132-135. 被引量:26
  • 6艾山·吾买尔,吐尔根·依布拉音,早克热·卡德尔.维吾尔语名词词干提取算法的研究[C]//第四届全国信息检索与内容安全学术会议,中国北京,2008.
  • 7Sebastiani F. Machine learning in automated text categoriza- tion [ J ]. ACM computing surveys,2002,34 ( 1 ) : 1-47.
  • 8Yang Y, Liu X. A re-examination of text categorization meth- ods[C]//Proceedings of 22nd annual international ACM SI- GIR conference on research and development in information retrieval. Berkeley : [ s. n. ], 1999:42-49.
  • 9Yang Xiquan, Sun Na. The application of latent semantic inde- xing and ontology in text classification [ J ]. International jour- nal of innovative computing, information and control, 2009,5 (12) :1-9.
  • 10Vries A, Mamoulis N, Nes N, et al. Efficient k-NN search on vertically decomposed data [ C ]//Proceedings of the ACM SIGMOD conference on management of data. [ s. 1. ] : [ s. n. ] ,2002:322-333.

引证文献7

二级引证文献40

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部