融入内部语义关系对文本分类的影响研究被引量：3

Research on Effect of Adding Internal Semantic Relationship into Text Categorization

下载PDF

导出

摘要为了在不加入外部语义知识的前提下改善向量空间模型的文本分类效果,通过挖掘语料库内部蕴含的词间关系和文本间关系,并以不同的方式融入原始的词文本矩阵,然后选择常用的SVM和KNN算法,在领域性较强的法律语料库和领域性较宽泛的新闻语料库上进行文本分类的对比实验。实验证明,加入词间关系和文本间关系通常能有效改善文本分类的效果,但是对不同的分类方法和领域特征有不同的影响,在实际应用中应该区别对待。 In order to improve the effect of text categorization on the premise of no addition of the external knowledge, this paper presented a feature matrix-based categorization framework. First, the internal knowledge of corpus is mined and added into the original word-text matrix in different ways. Two common algorithms named SVM and KNN are cho- sen for contrastive experiment of text categorization in highly territorial legal corpus and domain-wide news corpus. Experi-mental results show that it is generally helpful when adding the semantic relationships extracted from corpus in- to the original matrix, but the adding method should be chosen according to different classification methods and domain chara-cteristics.

作者朱建林杨小平彭鲸桥

机构地区中国人民大学财政金融学院中国人民大学信息学院

出处《计算机科学》 CSCD 北大核心 2016年第9期82-86,共5页 Computer Science

基金国家自然科学基金(71271209) 北京市自然科学基金(4132067) 教育部人文社会科学青年基金(11YJC630268) 河北省自然科学基金项目(A2013410011)资助

关键词向量空间模型文本分类语义挖掘特征矩阵 Vector space model, Text categorization, Semantic mining, Feature matrix

分类号 TP391 [自动化与计算机技术—计算机应用技术]

引文网络
相关文献

参考文献24

1Salton G,Yang C S. On the specification of term values in auto- matic indexing[J]. Journal of Documentation, 1973,29 (4):351- 372.
2Alfred R, Anthony P, Alias S, et aL Enrichment of BOW Repre- sentation with Syntactic and Semantic Background Knowledge [M]//Soft Computing Applications and Intelligent Systems. Springer Berlin Heidelberg, 2013 : 283-292.
3Hotho A, Staab S, Stumme G. Ontologies improve text docu- ment clustering[C]//Third IEEE International Conference on Data Mining, 2003 (ICDM 2003). IEEE, 2003 : 541-544.
4Miller G A. WordNet: a lexical database for English[J]. Com- munications of the ACM, 1995,38 ( 11 ) : 39-41.
5BIoehdorn S, Cimiano P, Hotho A. Learning ontologies to im- prove text clustering and classification[M]//From Data and In- formation Analysis to Knowledge Engineering. Springer Berlin Heidelberg, 2006 : 334-341.
6Gabrilovich E, Markovitch S. Wikipedia-based semantic inter- pretation for natural language processing[J]. Journal of Artifi- cial Intelligence Research, 2009,34(2) : 443-498.
7Huang A, Milne D, Frank E, et al. Clustering documents using a Wikipedia-based concept representation [ M ] // Advances in Knowledge Discovery and Data Mining. Springer Berlin Heidel- berg, 2009 : 628-636.
8Cilibrasi R L, Vitanyi P M B. The google similarity distance[J]. IEEE Transactions on Knowledge and Data Engineering, 2007, 19(3) :370-383.
9Deerwester S C, Dumais S T, Landauer T K, et al. Indexing by latent semantic analysis[J]. JASIS, 1990,41 (6) : 391-407.
10Kontostatbis A, Pottenger W M. A framework for understan- ding Latent Semantic Indexing (LSI) performance[J]. Informa- tion Processing & Management,2006,42(1):56-73.

二级参考文献22

1申红,吕宝粮,内山将夫,井佐原均.文本分类的特征提取方法比较与改进[J].计算机仿真,2006,23(3):222-224. 被引量：28
2苏金树,张博锋,徐昕.基于机器学习的文本分类技术研究进展[J].软件学报,2006,17(9):1848-1859. 被引量：386
3Yang Y, Liu X.A re-examination of text categorization meth- ods[C]//Proceedings of 22nd Annual International ACM SI-GIR Conference on Research and Development in Infor- mation Retrieval.New York:ACM, 1999:42-49.
4Novovicova J, Malik A.Information theoretic feature selec- tion algorithms for text classification[C]//Proceedings of IEEE International Joint Conference on Neural Networks. Washington:IEEE Computer Society,2005:3272-3277.
5Yang Y, Pedersen J Q.A comparative study on feature selec- tion in text categorization[C]//Proceedings of the 14th Inter- national Conference on Machine Learning.Nashville: Morgan Kaufmann Publishers, 1997:412-420.
6Qiu Liqing,Zhao Ruyi,Zhou Gang,et al.An extensive em- pirical study of feature selection for text categorization[C]// Proceedings of the 7th IEEE/ACIS International Confer- ence on Computer and Information Science.Washington,DC: IEEE Computer Society, 2008 : 312-315.
7Lan M,Tan C L,Su J,et al.Supervised and traditional term weighting methods for automatic text categorization[J].IEEE Trans on Pattern Anal and Machine Intel, 2009, 31 (4): 721-735.
8Wasikowski M, Chen Xuewen.Combating the small sample class imbalance problem using feature selection[J].IEEE Trans on Knowledge and Data Engineering, 2010,22 (10) : 1388-1400.
9Xue Gui-Rong, Xing Di-Kan, Yang Qiang, et al. Deep classification in large- scale text hierarchies/ /Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. Singapore, 2008: 619-626.
10Dh H, Choi y, Myaeng S. Combining global and local information for enhanced deep classification/ /Proceedings of the 25th ACM SIGAPP Symposium on Applied Computing. Sierre , Switzerland, 2010: 1760-1767.

共引文献23

1景欣.大数据分析下足球射门旋转飞行轨迹点实时标定方法[J].周口师范学院学报,2020,37(2):121-124.
2雷瑜,杨慧中.一种基于加权核Fisher准则的朴素贝叶斯分类器[J].江南大学学报（自然科学版）,2013,12(5):510-514.
3朱建林,彭鲸桥,杨小平,王倩.融入词和文本关系的文本表示模型研究[J].山西大学学报（自然科学版）,2015,38(3):392-398.
4郑诚,吴文岫,代宁.融合BTM主题特征的短文本分类方法[J].计算机工程与应用,2016,52(13):95-100. 被引量：11
5胡改蝶,樊孝仁,崔艺馨.文本分类中基于改进特征选择方法的研究[J].计算机与数字工程,2016,45(7):1290-1292. 被引量：1
6张忠林,刘述昌,江粉桃.深层次分类中候选类别搜索算法[J].计算机应用,2017,37(3):635-639. 被引量：1
7邓茹仁,伍应环.QLA-Means:检索结果聚类方法[J].计算机工程与设计,2017,38(4):1067-1070. 被引量：1
8龚静,黄欣阳.文档分类中的多特征最大值法及其改进方法[J].计算机工程与设计,2017,38(8):2262-2268.
9张璜.公共云存储中私密数据的去重删除研究[J].现代电子技术,2017,40(23):73-76. 被引量：3
10陈贵平.大型Web网络数据中心资源高效挖掘技术研究[J].现代电子技术,2017,40(24):18-20. 被引量：3

同被引文献22

1鲁川,缑瑞隆,董丽萍.现代汉语基本句模[J].世界汉语教学,2000,14(4):11-24. 被引量：28
2陈雪天,李荣陆.使用最大熵模型进行文本分类[J].计算机工程与应用,2004,40(35):78-79. 被引量：4
3尹邦才.试论“语义搭配的可能性”[J].理论观察,2008(6):134-135. 被引量：6
4孙海霞,钱庆,成颖.基于本体的语义相似度计算方法研究综述[J].现代图书情报技术,2010(1):51-56. 被引量：61
5曹素青,曾伏虎,曹焕光.一个中文文本自动分类数学模型[J].情报学报,1999,18(1):27-32. 被引量：18
6许相莉,张利彪,刘向东,于哲舟,周春光.基于粒子群的图像检索相关反馈算法[J].电子学报,2010,38(8):1935-1940. 被引量：33
7周薇,李筱菁.基于信息熵理论的综合评价方法[J].科学技术与工程,2010,10(23):5839-5843. 被引量：72
8田久乐,赵蔚.基于同义词词林的词语相似度计算方法[J].吉林大学学报（信息科学版）,2010,28(6):602-608. 被引量：178
9李学明,李海瑞,薛亮,何光军.基于信息增益与信息熵的TFIDF算法[J].计算机工程,2012,38(8):37-40. 被引量：48
10李明涛,罗军勇,尹美娟,路林.结合词义的文本特征词权重计算方法[J].计算机应用,2012,32(5):1355-1358. 被引量：9

引证文献3

1赵小艳,刘宏哲,袁家政,杨少鹏.图像重排序技术的研究进展[J].计算机科学,2018,45(5):15-23. 被引量：1
2王勇,王李福,邹辉,何养明.结合类别与语义贡献度的特征权重计算方法[J].计算机工程与设计,2018,39(6):1619-1622. 被引量：4
3余小鹏,王振佩,殷浩,徐健儿.基于特征增强的KNN文本分类方法研究[J].信息技术与信息化,2023(9):17-20.

二级引证文献5

1仇宽永.基于语料库的中德媒体中“和谐”话题的话语-历史分析[J].话语研究论丛,2023(2):98-114.
2郑秋梅,孙燕翔,马茂东.基于FCCA的多特征融合的检索方法[J].电子设计工程,2019,27(14):181-184. 被引量：2
3张磊,丁香乾,宫会丽,吴丽君,白晓莉,罗林.改进和声搜索算法的近红外光谱特征变量选择[J].光谱学与光谱分析,2020,40(6):1869-1875. 被引量：7
4胡加圣,管新潮.文学翻译中的语义迁移研究--以基于信息贡献度的主题词提取方法为例[J].外语电化教学,2020(2):28-34. 被引量：8
5黄静.基于Python语义分析的中英能源话语对比研究[J].科技传播,2022,14(19):87-91.

1曹瑛,涂伟,甘丽新.基于Markov网络的结果重排技术[J].重庆邮电大学学报（自然科学版）,2013,25(6):859-864.
2岳红,张杨.基于自建新闻语料库对think及其同义词的搭配特征分析[J].内江科技,2009,30(2):32-33. 被引量：2
3孙茂松,黄昌宁,高海燕,方捷.中文姓名的自动辨识[J].中文信息学报,1995,9(2):16-27. 被引量：87
4王文霞.一种基于LSA与FCM的文本聚类算法[J].山西大同大学学报（自然科学版）,2016,32(1):8-11.
5王永贵,高月.一种基于NMF_(SC)的文本聚类方法[J].计算机系统应用,2011,20(9):78-81.
6张磊,冯晓森,项学智.基于非负矩阵分解的中文文本主题分类[J].计算机工程,2009,35(13):26-27. 被引量：3
7林丽.基于新闻语料库的越南语框架语义标注研究[J].中文信息学报,2013,27(6):201-208. 被引量：3
8陈戴丹.小型日语新闻语料库现状及教学应用综述[J].佳木斯职业学院学报,2015,31(10):376-377.
9朱颢东,钟勇.结合优化的文档频和LSA的特征选择方法[J].计算机工程与应用,2009,45(34):121-123. 被引量：1
10卜师霞.比喻式复合词内部结构及其释义[J].励耘语言学刊,2011(2):17-23.

计算机科学

2016年第9期

浏览历史

内容加载中请稍等...

融入内部语义关系对文本分类的影响研究被引量：3

参考文献24

二级参考文献22

共引文献23

同被引文献22

引证文献3

二级引证文献5

相关作者

相关机构

相关主题

浏览历史

融入内部语义关系对文本分类的影响研究 被引量：3

参考文献24

二级参考文献22

共引文献23

同被引文献22

引证文献3

二级引证文献5

相关作者

相关机构

相关主题

浏览历史

融入内部语义关系对文本分类的影响研究被引量：3