一种基于词汇相关度的网络文本分类算法研究

Research of,Web Text Classification Algorithm based on Lexical Relatedness

导出

摘要传统文本分类算法,在特征选择这一阶段,采用统计观点和方法机械处理词语与类别的联系,假定词语之间相互独立,忽略特征关键词之间的语义关系。本文提出一种新的特征选择方法,用基于上下文统计的词汇相关度方法,计算特征词之间的词汇相关度,设定相关度阀值,进行特征选择。降低了特征空间的高维稀疏性,并有效的减少噪声,提高了分类精度和算法效率。 Traditional text classification algorithms,on the stage of feature selection,use statistical point and methods handle the links between words and categories,and assume that words are independent,ignore the semantic relationships between keywords.This paper presents a new feature selection method,and use lexical relatedness based on the context of statistics,calculate the words’lexical relatedness and set the relevant threshold values for feature selection.Reduce the scarcity of high dimensional feature space,and effectively reduce noise,improve the classification accuracy and efficiency of the algorithm.

作者邱前智刘忠

机构地区桂林理工大学

出处《网络安全技术与应用》 2012年第5期33-34,40,共3页 Network Security Technology & Application

关键词文本分类特征选择词汇相关度 Text Categorization Feature Selection Lexical Relatedness

分类号 TP391.1 [自动化与计算机技术—计算机应用技术]

引文网络
相关文献

参考文献4

1申红,吕宝粮,内山将夫,井佐原均.文本分类的特征提取方法比较与改进[J].计算机仿真,2006,23(3):222-224. 被引量：28
2刘群李素建.基于《知网》的词汇语义相似度算.Linguistics and Chinese Language Processing,2002,.
3张燕平,史科,徐庆鹏,谢飞.基于词共现模型的垃圾邮件过滤方法研究[J].中文信息学报,2009,23(6):61-66. 被引量：4
4Boll gala, D.,Matsuo,Y.,and Ishizuka,M.(20-07) Measuring.semantic similarity between words using web search engines.In Proc.2007.

二级参考文献22

1王斌,潘文锋.基于内容的垃圾邮件过滤技术综述[J].中文信息学报,2005,19(5):1-10. 被引量：129
2William W. Cohen. Fast effective rule induction[C]// Machine Learning Proceedings of the Twelfth International Conference on Machine Learning. Tahoe City, California, USA: Morgan Kaufmann, 1995: 115-123.
3X. Carreras, L. Marquez. Boosting Trees for Anti Spam Email Filtering [C]//Proceedings of Euro Conference Recent Advances in NLP (RANLP-2001). 2001: 58-64.
4I. Androutsopoulos, G. Paliouras, V. Karkaletsis, etc, Learning to Filter Spam E-Mail: A Comparison of a Naive Bayesian and a Memory-Based Approach[C]// Proc. 4th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD 2000). 2000: 1-13.
5H. Drueker, D. Wu, V. N. Vapnik, Support Vector Machines for Spam Categorization [ J/OL ]. IEEE Transactions on Neural Networks, 1999, 20 (5) : 1048-1054.
6M. Sahami, S. Dumais, D. Heckerman etc, A Bayesian approach to filtering junk e-mail [C]//Proc. of AAAI Workshop on Learning for Text Categorization. 1998: 55-62.
7Peat H J, Willet P. The limitations of term co-occurrence data for query expansion in document retrieval systems [J/OL]. JASIS, 1991, 42(5):378-383.
8G Salton, A Wong, C S Yang. On the specification of term values in automatic indexing [J/OL]. Journal of Documentation, 1973, 29(4) :351-372.
9Y. Yang. A Comparative Study on Feature Selection in Text Categorization [C]//Proceeding of the Fourteenth International Conference on Machine Learning (ICML'97) . 1997, 412-420.
10Sebastiani F. Machine learning in automated text categorization [J].ACM Computing Surveys, 2002, 34 (1) : 1-47.

共引文献30

1王荣荣.全局和局部特征提取相融合的中文文本特征提取方法研究[J].河北北方学院学报（自然科学版）,2013,29(3):35-38.
2李兆翠,刘培玉,周洪利.基于贝叶斯方法的客户端邮件过滤器的设计与实现[J].信息技术与信息化,2007(3):90-92. 被引量：1
3李新福.组合降维技术在中文网页分类中的应用[J].计算机工程与应用,2007,43(24):169-171. 被引量：3
4张元虹,郭剑毅,龚华明,薛征山.基于DF与LSA相结合的降维法的文本分类系统的研究[J].山西电子技术,2008(4):3-4. 被引量：1
5郑雅婷,张鹰.Web文本挖掘技术在网上购物中的应用[J].牡丹江师范学院学报（自然科学版）,2008,34(4):11-13.
6熊忠阳,蒋健,张玉芳.新的CDF文本分类特征提取方法[J].计算机应用,2009,29(7):1755-1757. 被引量：11
7王培涌,陈好刚,王树峰.一种改进的中文文本特征选择方法[J].现代计算机,2009,15(12):75-77.
8夏晶晶,朱颢东.基于特征辨别能力和分形维数的特征选择方法[J].微型机与应用,2010,29(7):68-71. 被引量：2
9肖可,奉国和.1999～2008年国内文本分类研究文献计量分析[J].情报学报,2010,29(4):679-687. 被引量：6
10陈吕强,朱颢东,伏明兰.使用类内集中度和分层递阶约简的特征选择方法[J].计算机工程与应用,2010,46(30):134-137.

1谭振华,程维,常桂然,高晓兴.基于词汇相关度模型的个性化信息检索算法[J].东北大学学报（自然科学版）,2008,29(4):504-507. 被引量：3
2李倩.基于SVM的网络文本分类[J].电子技术（上海）,2014,0(10):8-11. 被引量：2
3陈枭,刘天华,朱宏峰,刘骏.基于词汇相关度模型的个性化元搜索引擎[J].计算机工程与设计,2007,28(19):4758-4761. 被引量：4
4张伟刚,谭建豪.基于人工免疫系统的网络文本分类研究[J].科学技术与工程,2006,6(22):3621-3623.
5冯华.基于网络文本分类技术的应用研究[J].科协论坛（下半月）,2009(11):40-40.
6刘剑波.浅谈会计信息化[J].科技致富向导,2011(32):254-254.
7李晓明,儒林.基于短语的网络文本挖掘分类的再探讨[J].光盘技术,2008(10):47-47. 被引量：1
8DavidDejean,伍颖文.Network Central为工作组注入活力[J].个人电脑,1995,0(2):14-15.
9姚轶.浅谈网络文本挖掘分类[J].科技风,2009(3). 被引量：1
10冯钧,许潇,唐志贤,卞一路.面向水利信息资源目录服务的分布式语义检索方法研究[J].计算机与现代化,2015(2):122-126. 被引量：4

网络安全技术与应用

2012年第5期

浏览历史

内容加载中请稍等...

一种基于词汇相关度的网络文本分类算法研究

参考文献4

二级参考文献22

共引文献30

相关作者

相关机构

相关主题

浏览历史