期刊文献+

基于Web超链接结构信息的网页分类技术研究 被引量:4

Improving Technology of Webpage Classification Based on Hyperlinks Structure Information
下载PDF
导出
摘要 充分利用相邻网页(包括链入和链出)的相关信息,提出一种基于Web超链接结构信息的网页分类改进方法.其方法分为5步:(1)预处理训练集,提取文本信息和超链接结构信息;(2)抽取特征向量和训练一个Web页面的全文本分类器;(3)根据网页的各个入口的锚点文本和扩展锚点文本创建虚文档,用虚文档代替Web页面全文本训练一个虚文档分类器;(4)利用Naive Bayes方法协调两个分类器得到初步分类结果;(5)利用链出网页对初步分类结果进行修正,得到最终分类结果.根据改进方法实现了网页自动分类原型系统,并进行分类实验,实验表明该方法有效提高了分类性能. This paper presents a new method to improve webpage classification by making use of the Hyperlinks structure information. The method is fundamental divided into five steps. (1)Preprocessing training set, extracting text messages and Hyperlinks structure information. (2)Constructing feature vector and training a full text classifier of the Web pages. (3)Creating virtual documents from the anchortext and inbound extended anchortext,and then using the virtual documents as a replacement for the full-text to train a virtual document classifier. (4)Coordinating the preliminary results gotten from the two classifiers by using NaiveBayes methods. (5)Revising the preliminary results to get the final classification. Finally, an automatic web page classification prototype system based on the method proposed in this paper is implemented. The experiment shows that the new method improves classification.
出处 《泉州师范学院学报》 2008年第4期25-29,47,共6页 Journal of Quanzhou Normal University
关键词 网页分类 锚文本 链接 NAIVE BAYES webpage classification anchor text links Naive Bayes
  • 相关文献

参考文献8

  • 1YANG Y,SLATTERY S,GHANI R. A study of approaches to hypertext categorization[J]. Journal of Intelligent Information Systems, 2002,18(2/3):219- 241.
  • 2GLOVER E J, TSIOUTSIOULIKLIS K, LAWRENCE S, et al. Using web structure for classifying and describing Web pages. In: Proc. of the Int'l Conf. on the World Wide Web (WWW-2002). Honolulu:ACM Press,2002:562-569.
  • 3CHOON Y. Classification of world wide web documents[D]. Pitts burgh:Carnegie Mellon Univ,2000.
  • 4范焱,郑诚,王清毅,蔡庆生,刘洁.用Naive Bayes方法协调分类Web网页[J].软件学报,2001,12(9):1386-1392. 被引量:53
  • 5CRASWELL N, HAWKING D, ROBERTSON S. Effective site finding using link anchor infornation[C]//In proceedings of the 24th annual international ACM SIGIR conference on research and development in information retrieval(NEW orleans, Louisiana, United States). SIGIR'OL, ACM press, New York, 2001:250- 257.
  • 6DOU SHEN, SUN JIANTAO, QIANG YANG, et al. A comparison of implicit and explicit links for Web page classification[C]//Proceedings of the 15th International Conference on World Wide Web, 2006:643- 650.
  • 7LEWIS D D, SCHAPIRE R E, CALLAN JP, et al. Training algorithms for linear text classifiers. In H.-P. Frei, D. Harman, P. Sch "auble,and R. Wilkinson, editors, Proceedings of SIGIR-96,19th ACM International Conference on Research and Development in Information Retrieval, pages 2,98-306.Z"urich, CH, 1996. ACM Press, New York, US.
  • 8冯是聪,单松巍,龚笔宏,张志刚,李晓明.“天网”目录导航服务研究[J].计算机研究与发展,2004,41(4):653-659. 被引量:8

二级参考文献11

  • 1Lang K,Proc the 12th Int Conference on Machine Learning(ICML 95),1995年,331页
  • 2WebInfomallWebsitshttp://net.cs.pku.edu.cn/-webg/infomall/index.html . 2002
  • 3TianwangsearchengineWebsits http://e.pku.edu.cn . 1997
  • 4http://cn.yahoo.com . 2003
  • 5YYang,XLiu.Are examinationoftextcategorizationmethods[].ACMSIGIRConfonResearchandDevelopmentinInformationRetrieval.1999
  • 6FengShicong,ShanSongwei,ZhangZhigongetal.AdatasetofChineseWebpagesanditscategorization[].ProcoftheCross straitInformationTechnologyWorkshop.2002
  • 7YYang,JanOPedersen.Acomparativestudyonfeatureselectionintextcategorization[].ThethInt’’lConfonMachineLearning.1997
  • 8YYang.Astudyonthresholdingstrategiesfortextcategoriza tion[].ACMSIGIRConfonResearchandDevelopmentinInforma tionRetrieval.2001
  • 9SChakrabarti.Dataminingforhypertext:Atutorialsurvey[].ACMSIGKDDExplorations.2000
  • 10LeiMing,WangJianyong,ChenBaojueetal.Improvedrele vancerankinginwebgather[].JournalofComputerScienceandTechnology.2001

共引文献59

同被引文献33

  • 1吴军,王作英,禹锋,王侠.汉语语料的自动分类[J].中文信息学报,1995,9(4):25-32. 被引量:24
  • 2曾致远,张莉.基于向量空间模型的网页文本表示改进算法[J].计算机工程,2006,32(3):134-135. 被引量:10
  • 3谷峰,刘晨曦,吴扬扬.基于序列数据挖掘的中文网页特征选择方法[J].山东大学学报(理学版),2006,41(3):97-100. 被引量:2
  • 4刘晨曦,吴扬扬.一种基于块分析的网页去噪音方法[J].广西师范大学学报(自然科学版),2007,25(2):149-152. 被引量:8
  • 5王小冷,王斌.一种抗噪音的中文网页分类方法[J].中文信息学报,2007,21(4):48-54. 被引量:1
  • 6Lin Shian-Hua ,Ho Jan-Ming.Discovering Informative Content Blocks from Web Documents[A].Proceedings of theeighth ACM SIGKDD International Conference on Knowled geDiscovery & Data Mining[C].NewYork,US-A: [s.n.] ,2002.588-593.
  • 7Yang Y, Slattery S, Ghani R. A study of approaches to hypertext categorization. Journal of Intelligent Infor- mation Systems,2002,18(2-3):219-241.
  • 8Glover EJ, Tsioutsiouliklis K, Lawrence S, Pennock DM, Flake GW. Using web structure for classifying and describing Web pages. In: Proc. of the Int'l Conf. on the World Wide Web (WWW-2002).Hon-olulu: ACMPress, 2002.562-569.
  • 9Furnkranz J. Exploiting structural information for text classification on the WWW. In: Hand DJ, Kok JN, Berthold MR, eds. Proc. of the Advances in Intelligent Data Analysis. Springer-Verlag,1999.487-497.
  • 10Kan MY, Thi HON. Fast Webpage classification using URL features. In: Otthein H, Hans JS, Norbert F, Abdur C, Wilfried T, eds. Proc. of the 14th ACM Conf. on Information and Knowledge Management (CIKM-05). Bremen: ACM Press, 2005. 325-326.

引证文献4

二级引证文献1

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部