期刊文献+

W-POS语言模型及其选择与匹配算法 被引量:3

W-POS language model and its selecting and matching algorithms
下载PDF
导出
摘要 n-grams语言模型旨在利用多个词的组合形式生成文本特征,以此训练分类器对文本进行分类。然而n-grams自身存在冗余词,并且在与训练集匹配量化的过程中会产生大量稀疏数据,严重影响分类准确率,限制了其使用范围。对此,基于n-grams语言模型,提出一种改进的n-grams语言模型——W-POS。将分词后文本中出现概率较小的词和冗余词用词性代替,得到由词和词性的不规则排列组成的W-POS语言模型,并提出该语言模型的选择规则、选择算法以及与测试集的匹配算法。在复旦大学中文语料库和英文语料库20Newsgroups中的实验结果表明,W-POS语言模型既继承了n-grams语言模型减少特征数量、携带部分语义和提高精度的优点,又克服了n-grams语言模型产生大量稀疏数据、含有冗余词的缺陷,并验证了选择和匹配算法的有效性。 n-grams language model aims to use text feature combined of some words to train classifier. But it contains many redundancy words, and a lot of sparse data will be generated when n-grams matches or quantifies the test data, which badly influences the classification precision and limites its application. Therefore, an improved language model named W-POS ( Word-Parts of Speech) was proposed based on n-grams language model. After words segmentation, parts of speeches were used to replace the words that rarely appeared and were redundant, then the W-POS language model was composed of words and parts of speeches. The selection rules, selecting algorithm and matching algorithm of W-POS language model were also put forward. The experimental results in Fudan University Chinese Corpus and 20Newsgroups show that the W-POS language model can not only inherit the advantages of n-grams including reducing amount of features and carrying parts of semantics, but also overcome the shortages of producing large sparse data and containing redundancy words. The experiments also verify the effectiveness and feasibility of the selecting and matching algorhhms.
出处 《计算机应用》 CSCD 北大核心 2015年第8期2210-2214,2248,共6页 journal of Computer Applications
基金 国家自然科学基金资助项目(70971059) 辽宁省创新团队项目(2009T045) 辽宁省高等学校杰出青年学者成长计划项目(LJQ2012027)
关键词 n-grams语言模型 词性 冗余度 稀疏数据 特征选择 n-grams language model parts of speech redundancy sparse data feature selection
  • 相关文献

参考文献19

  • 1PAULS A, KLEIN D. Faster and smaller n-gram language models [ C] //HLT '11 : Proceedings of the 49th Annual Meeting of the As- sociation for Computational Linguistics: Human Language Technolo- gies. Stroudsburg: Association for Computational Linguistics, 2011: 258 - 267.
  • 2于津凯,王映雪,陈怀楚.一种基于N-Gram改进的文本特征提取算法[J].图书情报工作,2004,48(8):48-50. 被引量:17
  • 3PENAGARIKANO M, VARONA A, RODR~GUEZ-FUENTES L J, et al. Dimensionality reduction for using hlgh-order n-grams in SVM- based phonotactic language recognition [ C] // INTERSPEECH 2011: Proceedings of the 12th Annual Conference of the Internation- al Speech Communication Association. London: dblp Computer Sci- ence Bibliography, 2011 : 853 - 856.
  • 4ZAKI T, ES-SAADY Y, MAMMASS D, et al. A hybrid method n-grams-TFIDF with radial basis for indexing and classification of Arabic document [ J]. International Journal of Software Engineer- ing and Its Applications, 2014, 8(2) : 127 - 144.
  • 5SIDOROV G, VELASQUEZ F, STAMATATOS E, et al. Syntac- tic dependency-based n-grams as classification features [ C ]// MICAI 2012: Proceedings of the llth Mexican International Con- ference on Artificial Intelligence, LNCS 7630. Berlin: Springer, 2013: 1-11.
  • 6YI Y, GUAN J, ZHOU S. Effective clustering of microRNA se- quences by n-grams and feature weighting [ C] // Proceedings of the 2012 IEEE 6th International Conference on Systems Biology. Piscataway: IEEE, 2012:203-210.
  • 7BOURAS C, TSOGKAS V. Enhancing news articles clustering u- sing word n-grams [ C] // DATA 2013: Proceedings of the 2nd International Conference on Data Technologies and Applications. London: dblp Computer Science Bibliography, 2013:53 - 60.
  • 8GHANNAY S, BARRAULT L. Using hypothesis selection based features for confusion network MT system combination [ C] // EACL 2014: Proceedings of the 3rd Workshop on Hybrid Approa- ches to Translation (HyTra). Stroudsburg: Association for Compu- tational Linguistics, 2014:2-6.
  • 9SIDOROV G, VELASQUEZ F, STAMATAOS E, et al. Syntactic n- grams as machine learning features for natural language processing [ J]. Expert Systems with Applications, 2014, 41(3) : 853 - 860.
  • 10HAN Q, GUO J, SCH13TZE H. CodeX: combining an SVM clas- sifier and character n-gram language models for sentiment analysis on Twitter text [ C]// SemEval 2013: Proceedings of the Second Joint Conference on Lexical and Computational Semantics ( * SEM), Volume 2: Proceedings of the Seventh International Workshop on Semantic Evaluation. Stroudsburg: Association for Computational Linguistics, 2013:520-524.

二级参考文献9

共引文献48

同被引文献42

  • 1崔世起,刘群,孟遥,于浩,西野文人.基于大规模语料库的新词检测[J].计算机研究与发展,2006,43(5):927-932. 被引量:32
  • 2黄昌宁,赵海.中文分词十年回顾[J].中文信息学报,2007,21(3):8-19. 被引量:249
  • 3熊忠阳,张鹏招,张玉芳.基于χ~2统计的文本分类特征选择方法的研究[J].计算机应用,2008,28(2):513-514. 被引量:44
  • 4廖健,王素格,李德玉,陈鑫.基于构词规则与互信息的微博情感新词发现与判定.见:第六届中文倾向性分析评测会议论文集.昆明:中国中文信息学会,2014.90-96.
  • 5Li HQ, Huang CN, Gao JF, Fan XZ. The use of SVM for Chinese new word identification. Natural Language Processing- IJCNLP 2004. Springer Berlin Heidelberg. 2005.723-732.
  • 6Feng HD, Chen K, Deng XT, Zheng WM. Accessor variety criteria for Chinese word extraction. Computational Linguistics, 2004, 30(1): 75-93.
  • 7Li HQ, Huang CN, Gao JF, Fan XZ. The use of SVM for chinese new word identification. Natural Language Processing-IJCNLP 2004. Berlin Heidelberg: Springer-Verlag, 2004: 723-732.
  • 8Chooi-ling Q Masayuki A, Yuji M. Training multi-classifiers for Chinese unknown word detection. Journal of Chinese Language and Computing, 2005, 15(1): 1-12.
  • 9Ye YM, Wu QY, Li ~, Chow KP, Hui LCK, Yiu SM. Unknown Chinese word extraction based on variety of overlapping strings. Information Processing & Management, 2013, 49(2): 497-512.
  • 10Guthrie D, Allison B, Liu W, Guthrie L, Wilks Y. A closer look at skip-gram modelling. Proc. of the Fifth International Conference on Language Resources and Evaluation. Is.1.]: Conference Publications. 2006. 1222-1225.

引证文献3

二级引证文献7

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部