基于Web超链接结构信息的网页分类技术研究被引量：4

Improving Technology of Webpage Classification Based on Hyperlinks Structure Information

下载PDF

导出

摘要充分利用相邻网页(包括链入和链出)的相关信息,提出一种基于Web超链接结构信息的网页分类改进方法.其方法分为5步:(1)预处理训练集,提取文本信息和超链接结构信息;(2)抽取特征向量和训练一个Web页面的全文本分类器;(3)根据网页的各个入口的锚点文本和扩展锚点文本创建虚文档,用虚文档代替Web页面全文本训练一个虚文档分类器;(4)利用Naive Bayes方法协调两个分类器得到初步分类结果;(5)利用链出网页对初步分类结果进行修正,得到最终分类结果.根据改进方法实现了网页自动分类原型系统,并进行分类实验,实验表明该方法有效提高了分类性能. This paper presents a new method to improve webpage classification by making use of the Hyperlinks structure information. The method is fundamental divided into five steps. （1）Preprocessing training set, extracting text messages and Hyperlinks structure information. （2）Constructing feature vector and training a full text classifier of the Web pages. （3）Creating virtual documents from the anchortext and inbound extended anchortext,and then using the virtual documents as a replacement for the full-text to train a virtual document classifier. （4）Coordinating the preliminary results gotten from the two classifiers by using NaiveBayes methods. （5）Revising the preliminary results to get the final classification. Finally, an automatic web page classification prototype system based on the method proposed in this paper is implemented. The experiment shows that the new method improves classification.

作者郭淼霞吴扬扬

机构地区泉州师范学院理工学院华侨大学计算机系

出处《泉州师范学院学报》 2008年第4期25-29,47,共6页 Journal of Quanzhou Normal University

关键词网页分类锚文本链接 NAIVE BAYES webpage classification anchor text links Naive Bayes

分类号 TP391.1 [自动化与计算机技术—计算机应用技术]

引文网络
相关文献

参考文献8

1YANG Y,SLATTERY S,GHANI R. A study of approaches to hypertext categorization[J]. Journal of Intelligent Information Systems, 2002,18(2/3):219- 241.
2GLOVER E J, TSIOUTSIOULIKLIS K, LAWRENCE S, et al. Using web structure for classifying and describing Web pages. In: Proc. of the Int'l Conf. on the World Wide Web (WWW-2002). Honolulu:ACM Press,2002:562-569.
3CHOON Y. Classification of world wide web documents[D]. Pitts burgh:Carnegie Mellon Univ,2000.
4范焱,郑诚,王清毅,蔡庆生,刘洁.用Naive Bayes方法协调分类Web网页[J].软件学报,2001,12(9):1386-1392. 被引量：53
5CRASWELL N, HAWKING D, ROBERTSON S. Effective site finding using link anchor infornation[C]//In proceedings of the 24th annual international ACM SIGIR conference on research and development in information retrieval(NEW orleans, Louisiana, United States). SIGIR'OL, ACM press, New York, 2001:250- 257.
6DOU SHEN, SUN JIANTAO, QIANG YANG, et al. A comparison of implicit and explicit links for Web page classification[C]//Proceedings of the 15th International Conference on World Wide Web, 2006:643- 650.
7LEWIS D D, SCHAPIRE R E, CALLAN JP, et al. Training algorithms for linear text classifiers. In H.-P. Frei, D. Harman, P. Sch "auble,and R. Wilkinson, editors, Proceedings of SIGIR-96,19th ACM International Conference on Research and Development in Information Retrieval, pages 2,98-306.Z"urich, CH, 1996. ACM Press, New York, US.
8冯是聪,单松巍,龚笔宏,张志刚,李晓明.“天网”目录导航服务研究[J].计算机研究与发展,2004,41(4):653-659. 被引量：8

二级参考文献11

1Lang K，Proc the 12th Int Conference on Machine Learning（ICML 95），1995年，331页
2WebInfomallWebsitshttp://net.cs.pku.edu.cn/-webg/infomall/index.html . 2002
3TianwangsearchengineWebsits http://e.pku.edu.cn . 1997
4http://cn.yahoo.com . 2003
5YYang,XLiu.Are examinationoftextcategorizationmethods[].ACMSIGIRConfonResearchandDevelopmentinInformationRetrieval.1999
6FengShicong,ShanSongwei,ZhangZhigongetal.AdatasetofChineseWebpagesanditscategorization[].ProcoftheCross straitInformationTechnologyWorkshop.2002
7YYang,JanOPedersen.Acomparativestudyonfeatureselectionintextcategorization[].ThethInt’’lConfonMachineLearning.1997
8YYang.Astudyonthresholdingstrategiesfortextcategoriza tion[].ACMSIGIRConfonResearchandDevelopmentinInforma tionRetrieval.2001
9SChakrabarti.Dataminingforhypertext:Atutorialsurvey[].ACMSIGKDDExplorations.2000
10LeiMing,WangJianyong,ChenBaojueetal.Improvedrele vancerankinginwebgather[].JournalofComputerScienceandTechnology.2001

共引文献59

1张莉.网页自动分类技术概念分析[J].娄底职业技术学院学报（职教与经济研究）,2007(2):58-62.
2张茂元,卢正鼎.基于特征选取及模糊学习的网页分类方法研究[J].小型微型计算机系统,2004,25(7):1397-1400. 被引量：4
3刘壁松,李春平.一个可扩展的文本分类系统的设计与实现[J].计算机工程与应用,2004,40(30):102-106. 被引量：2
4钟茂生.WEB页面的模糊聚类[J].华东交通大学学报,2004,21(5):59-62. 被引量：2
5王丽侠,房福亭.分级聚类与平面划分结合方法在网页分类中的应用[J].计算机工程与应用,2004,40(35):139-141. 被引量：2
6梁春燕,郭力,夏诏杰,杨章远.网络搜索引擎的性能优化策略和相关技术[J].计算机工程与应用,2004,40(36):179-182. 被引量：5
7许勇,宋柔.基于HMM的百科辞典文本中句子的知识点分类[J].计算机工程与应用,2005,41(4):35-37. 被引量：5
8贾泂,梁久祯.基于支持向量机的中文网页自动分类[J].计算机工程,2005,31(10):145-147. 被引量：12
9邵浩然,张亮,马范援.基于损失最小化的SVM多类网页分类算法[J].计算机应用与软件,2005,22(7):16-17.
10李明杰.特征抽取方法在网页分类中的应用[J].常熟理工学院学报,2005,19(4):106-108. 被引量：1

同被引文献33

1吴军,王作英,禹锋,王侠.汉语语料的自动分类[J].中文信息学报,1995,9(4):25-32. 被引量：24
2曾致远,张莉.基于向量空间模型的网页文本表示改进算法[J].计算机工程,2006,32(3):134-135. 被引量：10
3谷峰,刘晨曦,吴扬扬.基于序列数据挖掘的中文网页特征选择方法[J].山东大学学报（理学版）,2006,41(3):97-100. 被引量：2
4刘晨曦,吴扬扬.一种基于块分析的网页去噪音方法[J].广西师范大学学报（自然科学版）,2007,25(2):149-152. 被引量：8
5王小冷,王斌.一种抗噪音的中文网页分类方法[J].中文信息学报,2007,21(4):48-54. 被引量：1
6Lin Shian-Hua ,Ho Jan-Ming.Discovering Informative Content Blocks from Web Documents[A].Proceedings of theeighth ACM SIGKDD International Conference on Knowled geDiscovery & Data Mining[C].NewYork,US-A: [s.n.] ,2002.588-593.
7Yang Y, Slattery S, Ghani R. A study of approaches to hypertext categorization. Journal of Intelligent Infor- mation Systems,2002,18(2-3):219-241.
8Glover EJ, Tsioutsiouliklis K, Lawrence S, Pennock DM, Flake GW. Using web structure for classifying and describing Web pages. In: Proc. of the Int'l Conf. on the World Wide Web (WWW-2002).Hon-olulu: ACMPress, 2002.562-569.
9Furnkranz J. Exploiting structural information for text classification on the WWW. In: Hand DJ, Kok JN, Berthold MR, eds. Proc. of the Advances in Intelligent Data Analysis. Springer-Verlag,1999.487-497.
10Kan MY, Thi HON. Fast Webpage classification using URL features. In: Otthein H, Hans JS, Norbert F, Abdur C, Wilfried T, eds. Proc. of the 14th ACM Conf. on Information and Knowledge Management (CIKM-05). Bremen: ACM Press, 2005. 325-326.

引证文献4

1郭淼霞.网页分类中的数据预处理方法研究[J].莆田学院学报,2011,18(5):82-86.
2郭淼霞.中文网页分类研究综述[J].赤峰学院学报（自然科学版）,2011,27(12):51-53.
3李大辉,何清刚,王佰玲,邹新一.基于网页结构的网站检测研究[J].高技术通讯,2015,25(10):912-918.
4徐鹥.用Selenium实现超链接正确性的自动化测试[J].福建电脑,2023,39(11):72-74. 被引量：1

二级引证文献1

1于述春.由Web UI元素不可见引发的Web自动化测试脚本错误[J].计算机应用文摘,2024,40(19):181-183.

1刘红.利用扩展锚点文本来分类网页[J].计算机应用研究,2004,21(3):112-113. 被引量：1
2甲骨文推出Oracle SOA套件11g升级版[J].CAD/CAM与制造业信息化,2010(8):1-1.
3杨单.基于Lucene的校园信息搜索引擎的设计与实现[J].中南民族大学学报（自然科学版）,2013,32(4):97-101. 被引量：2
4蒋英华.利用数据挖掘算法实现一个XML文档分类器[J].科技资讯,2005,3(25):66-70.
5郭淼霞,吴扬扬.一种利用相邻页面信息修正分类结果的方法[J].福建电脑,2008(4):78-79.
6田正军,张鸿彦.文档分类器的研究与实现[J].测绘通报,2005(12):56-58.
7陈凤娇.基于Lucene的搜索引擎技术的研究与改进[J].现代计算机,2011,17(15):18-20.
8陶荣,陈燕.基于Lucene小型搜索引擎的研究与实现[J].大众科技,2010,12(2):19-21. 被引量：1
9张琳,陶振凯.基于Lucene的全文检索系统的改进方法[J].沈阳理工大学学报,2008,27(4):33-36. 被引量：1
10汉化工作室[J].电脑迷,2008,0(11):94-94.

泉州师范学院学报

2008年第4期

浏览历史

内容加载中请稍等...

基于Web超链接结构信息的网页分类技术研究被引量：4

参考文献8

二级参考文献11

共引文献59

同被引文献33

引证文献4

二级引证文献1

相关作者

相关机构

相关主题

浏览历史

基于Web超链接结构信息的网页分类技术研究 被引量：4

参考文献8

二级参考文献11

共引文献59

同被引文献33

引证文献4

二级引证文献1

相关作者

相关机构

相关主题

浏览历史

基于Web超链接结构信息的网页分类技术研究被引量：4