期刊文献+

基于改进HTML-Tree的中文网页特征向量提取方法 被引量:3

A method of feature selection for Chinese Web page based on improved HTML-Tree
下载PDF
导出
摘要 中文网页特征向量的提取是提高中文网页分类准确度和召回度的关键。经过研究HTML网页的结构特点,提出一种基于改进的HTML-Tree及网页元素权重的中文网页文本预处理方法,并在此基础上进行网页文本特征向量的提取。该方法充分利用不同类别网页的特点,考虑了网页内各种元素权重的贡献。经过实验验证,该方法提高了网页特征向量提取的效率,有效提高了中文网页分类的准确度和召回度。 Feature selection for Chinese web page is a key to improving accuracy and recall of Chinese web page classification. A method of preprocessing web page based on improved HTML-Tree and MTML tag weights is proposed with studies on the structure of HTML page. And feature selection for web page is thus processed based on this method. This method takes well advantage of featmes of different type web pages and takes the conmbutions variety information weights of web page make into consideration. The experimental results show the proposed method is good for improving efficiency of web page feature selection and accuracy and recall of Chinese web page classification.
作者 李铭岳 周军
出处 《信息技术》 2009年第1期10-14,共5页 Information Technology
基金 国家自然科学基金CNGI项目(CNG1-04-15-2A) 上海市科学技术委员会资助项目(05DZ22102)
关键词 HTML-Tree 特征向量 网页分类 HTML-Tree feature selection web page classification
  • 相关文献

参考文献9

  • 1Fang Yuan, Liu Yang, Ge Yu. Improving the K-NN and applying it in to Chinese text classification [ C ]. Guangzhou: Proceedings of the Fourth lntemational Conference on Machine Leaming and Cybernetics, August 2005 : 18 - 21.
  • 2Jiu-Zhen Hang. SVM based Chinese web page automatic classification [C]. Xi' an: Proceedings of the Second Intemational Conference on Machine Learning and Cybernetics, November 2003:2265 - 2268.
  • 3HTMIA. 01 Specification [ EB/OJ ]. http://www, w3c. org/TR/html4/, W3C Recommendation 24 December 1999.
  • 4宋斌,方小璐.基于网页特征的TFIDF改进算法[J].微计算机应用,2002,23(1):18-20. 被引量:9
  • 5许建潮,胡明.中文Web文本的特征获取与分类[J].计算机工程,2005,31(8):24-25. 被引量:24
  • 6李晓明,阎宏飞,王继民.搜索引擎一原理、技术与系统[M].北京:科学出版社,2006.
  • 7Yue-Heng Sun, Pi-Lian He, Zhi-Gang Chen. An improved term weighting scheme for vector space model [ C ]. Shanghai : Proceedings of the Third International Conference on Machine learning and Cybernetics, August 2004:1692 - 1695.
  • 8中文网页分类训练集CCT2002-v1.1[ EB/OL]. http ://www. cwirf.org/SharedRes/DataSet/cct, html, 2002.
  • 9中文网页分类评测[EB/OL].http ://www. cwirf, org/2007Web Track/cct/cct2(X)7result. zip, 2007 - 3.

二级参考文献11

  • 1Yang Y. An Evaluation of Statistical Approaches to Text Categorization. Journal of Information (Retrieval 1 ),1999:69-90.
  • 2Mladenic M. Feature Subset Selection in Text-learning. http://www.ai.ijs.si/DunjaMladenic.
  • 3Wulfekuhler M R,Punch W F,Finding Salient Features for Personal Web Page Categorization. In Proc.of 6th International World Wide Web Conference,1997.
  • 4Salton G,Wong A,Yang C. A Vector Space Model for Automatic Indexing. Communications of the ACM,1995,18:613-620.
  • 5Lin Shian-hua. Extracting Classification Knowledge of Intemet Documents With Mining Term Associations: a Semantic Approach. In Proceedings of International ACM SIGIR Conference on Research and Development in Information Retrieval,1998:241-240.
  • 6Cohen W J,Singer Y. Context-sensitive Learning Methods for Text Categorization. In SIGIR'96:Proc. 19th Annual Int. ACM SIGIR Conf.on Research and Development in Information Retrieval,1996:307-315.
  • 7Yang Y,Pedersen J O. A Comparative Study on Feature Selection in Text Categorization. In the 14th Int. Conf. on Machine Learning,1997:412-420.
  • 8Yang Y,Liu X. A Re-examination of Text Categorization Methods.In 22nd Annual International ACM SIGIR Conference on Researchand Development in Information Retrieval(SIGIR'99),1999:42-49.
  • 9汪晓岩,胡庆生,李斌,庄镇泉.面向Internet的个性化智能信息检索[J].计算机研究与发展,1999,36(9):1039-1046. 被引量:81
  • 10谢宜辰.网络智能文本分类系统的研究与实现[J].湘潭大学自然科学学报,2000,22(1):12-15. 被引量:3

共引文献30

同被引文献12

引证文献3

二级引证文献4

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部