期刊文献+

一种有效的专题信息集中和检索策略 被引量:4

Effective strategy of topic distillation and retrieval
下载PDF
导出
摘要 Internet上专题资源网页汇聚和检索是垂直搜索引擎中的核心问题,HITS算法是早期解决这个问题的经典算法,很多文献对它进行了改进,但无论索引的主题相关率还是引擎的查准率都有提高的余地。提出一种基于锚文本和标题信息过滤并结合网页内容相关度判断的HITS专题检索策略,利用专题训练集判断主题相关度,很好地解决了只依靠查询字符串判断的弊端。实验表明,此策略能很好地提高专题信息汇聚精确度和检索的准确率,并且减少了非相关URL的下载量。 The strategy of topic distillation and retrieval on Internet is the key work in research of vertical search engine. HITS algorithm is a classical method for this problem at an earlier time, and some improvements are made by other researchers afterwards. Nevertheless, no matter the theme relation rate or accuracy grade of engine still have room to be improved. This paper proposed a strategy of topic distillation and retrieval by filtering Web pages based on anchor texts and titles combining relation grade of Web pages. Using the topic training collection to judge relation grade could overcome the shortcomings of depending on inquiring strings. The experiment results prove that this strategy can improve the accuracy of topic distillation and retrieval, and reduce the downloaded information of unrelated URLs.
出处 《计算机应用研究》 CSCD 北大核心 2010年第6期2106-2108,共3页 Application Research of Computers
关键词 HITS算法 锚文本 网页标题 专题相关度 向量模型 专题训练集 HITS algorithm anchor text Web page title relation grade of topic vector model topic training collection
  • 相关文献

参考文献10

  • 1AWEKAR A,KANG J.Selective approach to handing topic oriented tasks on the world wide Web[C] //Proc of IEEE Symposium on Computational Intelligence and Data Mining.2007:343-348.
  • 2FLAKE G,LAWRENCE S,GILES C L.Efficient identification of Web communities[C] //Proc of the 7th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.New York:ACM Press,2000:150-160.
  • 3CHAU M,CHEN H.Comparison of three vertical search spiders[J].Computer,2003,36(5):56-62.
  • 4LV Lin-tao,CHEN Li-ping,ZHOU Hong-fang.An improved topic relevance algorithm for vertical search engines[C] //Proc of International Conference on Wavelet Analysis and Pattern Recognition.2008:753-757.
  • 5肖明军,黄刘生,罗永龙.SHITS:一种基于超链接和内容的网页排序方法[J].小型微型计算机系统,2006,27(12):2177-2182. 被引量:6
  • 6BHARAT K,HENZINGER M.Improved algorithms for topic distillation in a hyperlinked environment[C] //Proc of the 21st Internatio-nal ACM SIGIR Conference on Research and Development in Information Retrieval.New York:ACM Press,1998:104-111.
  • 7ALAMPANIDIS G,KOTROPOULOS C,PITAS I.Combining text and link analysis for focused crawling:an application for vertical search engines[J].Information Systems,2007,32(6):886-908.
  • 8刘金红,陆余良.主题网络爬虫研究综述[J].计算机应用研究,2007,24(10):26-29. 被引量:131
  • 9KLEINBERG J.Authoritative sources in a hyperlinked environment[C] //Proc of the 9th Annual ACM-SIAM Symposium on Discrete Algorithms Table of Contents.New York:ACM Press,1998:668-677.
  • 10张博,蔡皖东.面向主题的网络蜘蛛技术研究及系统实现[J].微电子学与计算机,2009,26(5):52-55. 被引量:13

二级参考文献41

  • 1林海霞,原福永,陈金森.主题网络蜘蛛搜索策略贪婪性解决方法[J].微电子学与计算机,2006,23(z1):278-280. 被引量:4
  • 2吴丽辉,王斌,余智华.一个基于Web的信息获取系统的框架与实现[J].微电子学与计算机,2004,21(10):121-123. 被引量:2
  • 3周立柱,林玲.聚焦爬虫技术研究综述[J].计算机应用,2005,25(9):1965-1969. 被引量:153
  • 4李卫,刘建毅,何华灿,王枞.基于主题的智能Web信息采集系统的研究与实现[J].计算机应用研究,2006,23(2):163-166. 被引量:15
  • 5Brin S,Page L.The anatomy of a large-scale hypertextual Web search engine[J].Computer Networks and ISDN Systems,April,1998,30(1-7):107-117.
  • 6Kleinberg J.Authoritative sources in a hyperlinked environment[C].In Proceedings of the 9th Annual ACM-SIAM Symposium on DiscreteAlgorithms,San Francisco,California,United States,January 1998,668-677.
  • 7Lempel R,Moran S.The stochastic approach for link-structure analysis(SALSA) and the TKC effect[J].Computer Networks,June,2000,33 (1-6):387-401.
  • 8Cohn D,Chang H.Learning to probabilistically identify authoritative documents[C].In Proceedings of the 17th International Conference on Machine Learning(ICML-2000),Stanford University,United States,June 2000,167-174.
  • 9Borodin A,Roberts G O,Rosenthal J S,etal.Finding Authorities and Hubs From Link Structures on the World Wide Web[C].In Proceedings of the 10th International Conference on World Wide Web,Hong Kong,China,May 2001,:415-429.
  • 10Chakrabarti S,Dom B,Raghavan P,etal.Automatic resource compilation by analyzing hyperlink structure and associated text[J].Computer Networks and ISDN Systems,April,1998,30(1-7):65-74.

共引文献144

同被引文献30

引证文献4

二级引证文献2

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部