W-POS语言模型及其选择与匹配算法被引量：3

W-POS language model and its selecting and matching algorithms

下载PDF

导出

摘要 n-grams语言模型旨在利用多个词的组合形式生成文本特征,以此训练分类器对文本进行分类。然而n-grams自身存在冗余词,并且在与训练集匹配量化的过程中会产生大量稀疏数据,严重影响分类准确率,限制了其使用范围。对此,基于n-grams语言模型,提出一种改进的n-grams语言模型——W-POS。将分词后文本中出现概率较小的词和冗余词用词性代替,得到由词和词性的不规则排列组成的W-POS语言模型,并提出该语言模型的选择规则、选择算法以及与测试集的匹配算法。在复旦大学中文语料库和英文语料库20Newsgroups中的实验结果表明,W-POS语言模型既继承了n-grams语言模型减少特征数量、携带部分语义和提高精度的优点,又克服了n-grams语言模型产生大量稀疏数据、含有冗余词的缺陷,并验证了选择和匹配算法的有效性。 n-grams language model aims to use text feature combined of some words to train classifier. But it contains many redundancy words, and a lot of sparse data will be generated when n-grams matches or quantifies the test data, which badly influences the classification precision and limites its application. Therefore, an improved language model named W-POS （ Word-Parts of Speech） was proposed based on n-grams language model. After words segmentation, parts of speeches were used to replace the words that rarely appeared and were redundant, then the W-POS language model was composed of words and parts of speeches. The selection rules, selecting algorithm and matching algorithm of W-POS language model were also put forward. The experimental results in Fudan University Chinese Corpus and 20Newsgroups show that the W-POS language model can not only inherit the advantages of n-grams including reducing amount of features and carrying parts of semantics, but also overcome the shortages of producing large sparse data and containing redundancy words. The experiments also verify the effectiveness and feasibility of the selecting and matching algorhhms.

作者邱云飞刘世兴魏海超邵良杉

机构地区辽宁工程技术大学软件学院辽宁工程技术大学系统工程研究所

出处《计算机应用》 CSCD 北大核心 2015年第8期2210-2214,2248,共6页 journal of Computer Applications

基金国家自然科学基金资助项目(70971059) 辽宁省创新团队项目(2009T045) 辽宁省高等学校杰出青年学者成长计划项目(LJQ2012027)

关键词 n-grams语言模型词性冗余度稀疏数据特征选择 n-grams language model parts of speech redundancy sparse data feature selection

分类号 TP18 [自动化与计算机技术—控制理论与控制工程] TP301.6 [自动化与计算机技术—计算机系统结构]

引文网络
相关文献

参考文献19

1PAULS A, KLEIN D. Faster and smaller n-gram language models [ C] //HLT '11 : Proceedings of the 49th Annual Meeting of the As- sociation for Computational Linguistics: Human Language Technolo- gies. Stroudsburg: Association for Computational Linguistics, 2011: 258 - 267.
2于津凯,王映雪,陈怀楚.一种基于N-Gram改进的文本特征提取算法[J].图书情报工作,2004,48(8):48-50. 被引量：17
3PENAGARIKANO M, VARONA A, RODR~GUEZ-FUENTES L J, et al. Dimensionality reduction for using hlgh-order n-grams in SVM- based phonotactic language recognition [ C] // INTERSPEECH 2011: Proceedings of the 12th Annual Conference of the Internation- al Speech Communication Association. London: dblp Computer Sci- ence Bibliography, 2011 : 853 - 856.
4ZAKI T, ES-SAADY Y, MAMMASS D, et al. A hybrid method n-grams-TFIDF with radial basis for indexing and classification of Arabic document [ J]. International Journal of Software Engineer- ing and Its Applications, 2014, 8(2) : 127 - 144.
5SIDOROV G, VELASQUEZ F, STAMATATOS E, et al. Syntac- tic dependency-based n-grams as classification features [ C ]// MICAI 2012: Proceedings of the llth Mexican International Con- ference on Artificial Intelligence, LNCS 7630. Berlin: Springer, 2013: 1-11.
6YI Y, GUAN J, ZHOU S. Effective clustering of microRNA se- quences by n-grams and feature weighting [ C] // Proceedings of the 2012 IEEE 6th International Conference on Systems Biology. Piscataway: IEEE, 2012:203-210.
7BOURAS C, TSOGKAS V. Enhancing news articles clustering u- sing word n-grams [ C] // DATA 2013: Proceedings of the 2nd International Conference on Data Technologies and Applications. London: dblp Computer Science Bibliography, 2013:53 - 60.
8GHANNAY S, BARRAULT L. Using hypothesis selection based features for confusion network MT system combination [ C] // EACL 2014: Proceedings of the 3rd Workshop on Hybrid Approa- ches to Translation (HyTra). Stroudsburg: Association for Compu- tational Linguistics, 2014:2-6.
9SIDOROV G, VELASQUEZ F, STAMATAOS E, et al. Syntactic n- grams as machine learning features for natural language processing [ J]. Expert Systems with Applications, 2014, 41(3) : 853 - 860.
10HAN Q, GUO J, SCH13TZE H. CodeX: combining an SVM clas- sifier and character n-gram language models for sentiment analysis on Twitter text [ C]// SemEval 2013: Proceedings of the Second Joint Conference on Lexical and Computational Semantics ( * SEM), Volume 2: Proceedings of the Seventh International Workshop on Semantic Evaluation. Stroudsburg: Association for Computational Linguistics, 2013:520-524.

二级参考文献9

1周强.规则和统计相结合的汉语词类标注方法[J].中文信息学报,1995,9(3):1-10. 被引量：43
2赵伯璋徐力.计算机中文信息处理（下册）[M].北京:宇航出版社,1988..
3周水庚.中文文本数据库的若干关键技术研究：博士论文[M].上海：复旦大学,2000..
4赵伯璋，计算机中文信息处理.下，1988年
5周水庚，博士论文，2000年
6邹淘，中文信息学报，1999年，13卷，3期，26页
7黄萱菁,吴立德.基于向量空间模型的文档分类系统[J].模式识别与人工智能,1998,11(2):147-153. 被引量：24
8邹涛,王继成,黄源,张福炎.中文文档自动分类系统的设计与实现[J].中文信息学报,1999,13(3):26-32. 被引量：45
9何浩,杨海棠.一种基于N-Gram技术的中文文献自动分类方法[J].情报学报,2002,21(4):421-427. 被引量：18

共引文献48

1刘雅琦,李得志,王瑞雪.中文社交媒体用户性别预测研究——以新浪微博短文本内容为例[J].知识管理论坛,2021(4):213-227.
2刘世兴.基于多尺度的n-grams特征选择加权及匹配算法[J].智能计算机与应用,2020,0(1):61-66. 被引量：1
3李长虹,李堂秋.一种改进的特征选择方法在文本分类系统中的应用[J].学术问题研究,2005,0(1):94-98.
4宋枫溪,郑如冰,王积忠.自动文本分类中两种文本表示方式的比较[J].计算机工程,2004,30(18):124-126. 被引量：6
5刘壁松,李春平.一个可扩展的文本分类系统的设计与实现[J].计算机工程与应用,2004,40(30):102-106. 被引量：2
6李长虹,李堂秋.一种改进的特征选择方法在文本分类系统中的应用[J].厦门大学学报（自然科学版）,2005,44(B06):239-242. 被引量：3
7庞景安.Web文本特征提取方法的研究与发展[J].情报理论与实践,2006,29(3):338-340. 被引量：17
8陈晓云,陈袆,王雷,李荣陆,胡运发.基于分类规则树的频繁模式文本分类[J].软件学报,2006,17(5):1017-1025. 被引量：19
9陈思睿,张永,杨志勇.基于粗糙集的特征选择方法的研究[J].计算机工程与应用,2006,42(21):159-161. 被引量：7
10陈晓云,李荣陆,胡运发.基于最小词频阈值的文档特征选择[J].模式识别与人工智能,2006,19(4):531-537. 被引量：7

同被引文献42

1崔世起,刘群,孟遥,于浩,西野文人.基于大规模语料库的新词检测[J].计算机研究与发展,2006,43(5):927-932. 被引量：32
2黄昌宁,赵海.中文分词十年回顾[J].中文信息学报,2007,21(3):8-19. 被引量：249
3熊忠阳,张鹏招,张玉芳.基于χ~2统计的文本分类特征选择方法的研究[J].计算机应用,2008,28(2):513-514. 被引量：44
4廖健,王素格,李德玉,陈鑫.基于构词规则与互信息的微博情感新词发现与判定.见:第六届中文倾向性分析评测会议论文集.昆明:中国中文信息学会,2014.90-96.
5Li HQ, Huang CN, Gao JF, Fan XZ. The use of SVM for Chinese new word identification. Natural Language Processing- IJCNLP 2004. Springer Berlin Heidelberg. 2005.723-732.
6Feng HD, Chen K, Deng XT, Zheng WM. Accessor variety criteria for Chinese word extraction. Computational Linguistics, 2004, 30(1): 75-93.
7Li HQ, Huang CN, Gao JF, Fan XZ. The use of SVM for chinese new word identification. Natural Language Processing-IJCNLP 2004. Berlin Heidelberg: Springer-Verlag, 2004: 723-732.
8Chooi-ling Q Masayuki A, Yuji M. Training multi-classifiers for Chinese unknown word detection. Journal of Chinese Language and Computing, 2005, 15(1): 1-12.
9Ye YM, Wu QY, Li ~, Chow KP, Hui LCK, Yiu SM. Unknown Chinese word extraction based on variety of overlapping strings. Information Processing & Management, 2013, 49(2): 497-512.
10Guthrie D, Allison B, Liu W, Guthrie L, Wilks Y. A closer look at skip-gram modelling. Proc. of the Fifth International Conference on Language Resources and Evaluation. Is.1.]: Conference Publications. 2006. 1222-1225.

引证文献3

1于洁.基于Spark和DN-gram模型的定义抽取研究[J].北京信息科技大学学报（自然科学版）,2017,32(4):64-68. 被引量：2
2于洁.Skip-Gram模型融合词向量投影的微博新词发现[J].计算机系统应用,2016,25(7):130-136. 被引量：3
3李惠富,陆光,景维鹏.文本分类中基于K-Sprinkling的特征提取方法[J].计算机工程,2017,43(12):141-146. 被引量：2

二级引证文献7

1贾晓婷,王名扬,曹宇.结合Doc2Vec与改进聚类算法的中文单文档自动摘要方法研究[J].数据分析与知识发现,2018,2(2):86-95. 被引量：18
2宋莉娜,冯旭鹏,刘利军,黄青松.基于SOM聚类的微博话题发现[J].计算机应用研究,2018,35(3):671-674. 被引量：10
3杨肖楠,花季伟.互联网中非法文本特征自适应提取仿真研究[J].计算机仿真,2019,36(6):434-437. 被引量：1
4如先姑力·阿布都热西提,亚森·艾则孜,郭文强.维语网页中n-gram模型结合类不平衡SVM的不良文本过滤方法[J].计算机应用研究,2019,36(11):3410-3414. 被引量：5
5阳萍,谢志鹏.基于BiLSTM模型的定义抽取方法[J].计算机工程,2020,46(3):40-45. 被引量：9
6曹春萍,黄伟.基于用户权威度与热度分配聚类的微博热点发现[J].计算机工程与设计,2020,41(3):664-669. 被引量：1
7罗有志,陈征明,陈明,梅文涛.一种基于自适应关联熵的关键字提取算法[J].计算机与现代化,2020,0(4):67-71. 被引量：1

1王贤明,胡智文,谷琼.一种基于随机n-Grams的文本相似度计算方法[J].情报学报,2013,32(7):716-723. 被引量：9
2邱云飞,刘世兴,邵良杉.基于字矩阵交运算的n-grams特征选择加权算法[J].计算机工程与应用,2016,52(22):86-92. 被引量：1
3张金美,舒希勇.基于基尼系数的n-grams特征约简加权算法[J].淮阴工学院学报,2016,25(1):25-28.
4邱云飞,刘世兴,林明明,邵良杉.基于相关性及语义的n-grams特征加权算法[J].模式识别与人工智能,2015,28(11):992-1001. 被引量：2
5史岳鹏,张明慧,朱颢东.新的结合互信息和粗糙集的特征选择[J].计算机工程与应用,2011,47(16):135-137. 被引量：1
6翟军昌,车伟伟.一种基于条件熵的垃圾邮件过滤算法[J].计算机与现代化,2014(2):129-132. 被引量：3
7詹永照,谢志峰,毛启容.协同学习环境中感知本体的构建方法[J].江苏大学学报（自然科学版）,2007,28(2):164-167. 被引量：4
8李文波,孙乐,张大鲲.基于Labeled-LDA模型的文本分类新算法[J].计算机学报,2008,31(4):620-627. 被引量：103
9翟军昌,秦玉平,车伟伟.应用特征词分类贡献的垃圾邮件过滤研究[J].计算机工程与应用,2012,48(34):116-119. 被引量：2
10余正涛,樊孝忠,郭剑毅.基于支持向量机的汉语问句分类[J].华南理工大学学报（自然科学版）,2005,33(9):25-29. 被引量：20

计算机应用

2015年第8期

浏览历史

内容加载中请稍等...

W-POS语言模型及其选择与匹配算法被引量：3

参考文献19

二级参考文献9

共引文献48

同被引文献42

引证文献3

二级引证文献7

相关作者

相关机构

相关主题

浏览历史

W-POS语言模型及其选择与匹配算法 被引量：3

参考文献19

二级参考文献9

共引文献48

同被引文献42

引证文献3

二级引证文献7

相关作者

相关机构

相关主题

浏览历史

W-POS语言模型及其选择与匹配算法被引量：3