期刊文献+

Nave Bayes分类器制导的专业网页爬取算法 被引量:3

Nave Bayesian Classifier Guided Domain Specific Webpage Crawling Algorithm
下载PDF
导出
摘要 从Web中快速、准确地检索出所需信息的迫切需求催生了专业搜索引擎技术。在专业搜索引擎中,网络爬虫(Crawler)负责在Web上搜集特定专业领域的信息,是专业搜索引擎的重要核心部件。该文对中文专业网页的爬取问题进行了研究,基于KL距离验证了网页内容与链接前后文在分布上的差异,在此基础上提出了以链接锚文本及其前后文为特征、Nave Bayes分类器制导的中文专业网页爬取算法,设计了自动获取带链接类标的训练数据的算法。以金融专业网页的爬取为例,分别对所提出的算法进行了离线和在线测试,结果表明,Nave Bayes分类器制导的网络爬虫可以达到近90%的专业网页收割率。 The urgent need for quick and accurate information from the Web contributes to the domain specific search engine,in which the crawler is a keycomponent to the desired WebPages.Focused on the crawling of domain specific Chinese WebPages,this paper first examines the the distributional difference between WebPages and link contexts via the KL distance,and then proposes a Nave Bayesian classifier-guided algorithm to collect the domain specific Chinese WebPages.The classifier uses anchor text of hyperlink and its context as features.An algorithm is further designed to automatically collect labeled hyperlinks necessary for training the classifier.Taking the financial WebPages as an example,both the off4line and on-line tests are performed to validate the algorithm.The results show that the crawler guided by the Nave Bayesian classifier reaches nearly 90% accuracy in the domain specific WebPages.
出处 《中文信息学报》 CSCD 北大核心 2010年第4期32-38,62,共8页 Journal of Chinese Information Processing
关键词 计算机应用 中文信息处理 搜索引擎 专业爬虫 Nave BAYESIAN CLASSIFIER 链接前后文 computer application Chinese information processing search engine domain dpecific crawler Nave Bayesian Classifier hyperlink context
  • 相关文献

参考文献13

  • 1S. Chakrabarti, M. van den Berg, B. Dom. Focused Crawling:A New Approach to Topic-Specific Web Resource Discovery[J]. Computer Networks, 1999, 31 (11-16):1623-1640.
  • 2J. Kleinberg. Authoritative Sources in a Hyporlinked Environment[J].Journal of the ACM, 1999,46(5): 604-632.
  • 3M. Diligenti, F. M. Coetzee, S. Lawrence, 等. Focused Crawling Using Context Graphs[C]// Proc. of Intl. Conf. On Very Large Databases (VLDB'00), Morgan Kaufmann Publishers Inc. San Francisco, CA, USA, 2000:527-534.
  • 4P. M. E. De Bra, R. D.J. Post, Information Retrieval in the World Wide Web: Making Client-Based Searching Feasible [C]// Proceedings of the First International World-Wide Web Conference, CERN, Switzerland, May, 1994.
  • 5M. Iwazume, K. Shirakami, K. Hatadani, 等. IICA: An Ontology-Based Internet Navigation System[C]// Proc. AAAI-96 Workshop Internet Based Information Systems, 1996.
  • 6M. Hersovici, M. Jacovi, Y. S. Maarek, 等. The Shark-Search Algorithm--An Application: Tailored Web Site Mapping[C]// Proc. Seventh Int'l World Wide Web Conf. , 1998.
  • 7S. Chakrabarti, K. Punera, M. Subramanyam. Accelerated Focused Crawling through Online Relevance Feedbaek[C]// Proc. llth Int'l World Wide Web Conf. , May 2002.
  • 8周立柱,林玲.聚焦爬虫技术研究综述[J].计算机应用,2005,25(9):1965-1969. 被引量:153
  • 9蒋宗礼,徐学可,李帅.一种基于超链接引导的主题搜索的主题敏感爬行方法[J].计算机应用,2008,28(4):942-944. 被引量:9
  • 10李勇,韩亮.主题搜索引擎中网络爬虫的搜索策略研究[J].计算机工程与科学,2008,30(3):4-6. 被引量:37

二级参考文献52

  • 1陈红英,杨宜民.基于多智能体的网络信息系统的原理与实现[J].微电子学与计算机,2005,22(3):57-59. 被引量:2
  • 2吴友政,赵军,段湘煜,徐波.问答式检索技术及评测研究综述[J].中文信息学报,2005,19(3):1-13. 被引量:48
  • 3EHRIG M, MAEDCHE A. Ontology-focused crawling of Web documents[A]. Proceedings of the 2003 ACM symposium on Applied computing[C], March 2003.
  • 4GUO Q, GUO H, ZHANG ZQ, et al. Schema Driven Topic Specific Web Crawling[A]. DASFAA[C], 2005.
  • 5GRAUPMANN J, BIWER M, ZIMMER C, et al. COMPASS: A Concept-based Web Search Engine for HTML, XML, and Deep Web Data[A]. Proceedings of the 30th VLDB Conference[C],2004.
  • 6QIN JL, ZHOU YL, CHAU M. Building domain-specific web collections for scientific digital libraries: a meta-search enhanced focused crawling method[A]. Proceedings of the 4th ACM/IEEE-CS joint conference on Digital libraries[C], June 2004.
  • 7CHO J , GARCIA - MOLINA H , PAGE L . Efficient crawling through URL ordering[A]. Proceedings of the seventh international conference on World Wide Web 7[C], April 1998.
  • 8FLORESCU D, LEVY AY, MENDELZON AO. Database techniques for the world-wide web: A survey[J]. SIGMOD Record, 1998,27(3) :59 -74.
  • 9LAWRENCE S, GILES CL. Searching the World Wide Web[J].Science, 1998,280(5360):98.
  • 10CHAKRABARTI S, VAN DEN BERG M, DOM B. Focused crawling: A new approach to topicspecific web resource discovery[A].Proceedings of the Eighth International World-Wide Web Conference[C], 1999.

共引文献192

同被引文献19

引证文献3

二级引证文献35

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部