期刊文献+

基于本体语义的定题爬虫 被引量:11

Ontology based on focused crawler
下载PDF
导出
摘要 定题爬虫能迅速获取网络上特定主题的大量信息,对专业搜索引擎及数据挖掘应用都具有重大价值.针对目前通用的基于关键词主题过滤策略的不足,在概念聚集思想启发下,提出了基于本体语义的主题过滤策略.同时根据网页具有不同位置不同信息重要性的特点,提出了改进的加权特征项权值计算公式,实现基于语义的网页实时过滤.为进一步提高爬虫的工作效率提出链接相关度预测算法.对比实验表明此策略具有可行性. Focused crawler can fetch large quantities of domain resources from the Web in a short time. It is very helpful in both foused search engines and data mining companies. In order to overcome the deficiency of topic filtering strategy based on keywords widly used nowadays, the paper proposed a topic filtering stratege based on concept elicited by concept congregation idea. The paper also proposed an authority modified weight calculation formula based on different importance of Web page information. By doing this, real time Web page filtering based on concept can be achieved. In the hope of improving focused crawler's work efficiency more, the paper also proposed a link forecast algorithm. At last, the comparative experiment shows that the strategies proposed in this paper are pratical.
出处 《山东大学学报(理学版)》 CAS CSCD 北大核心 2006年第3期106-110,共5页 Journal of Shandong University(Natural Science)
基金 厦门大学985二期信息创新平台资助项目(0000-X07204)
关键词 定题爬虫 主题过滤 本体语义 链接分析 focused-crawler topic-filtering ontology-semantic-analyse hyperlink-analyse
  • 相关文献

参考文献7

  • 1Marc Ehring, Mexander maedche. Ontology-focused crawling of Web documents[J], Proceedings of the 2003 ACM Symposium on Applied Computing, 2003, 1(3) :624 - 626.
  • 2董振东,董强.Ontology和HowNet[EB/OL].http://www.keenage.com/html/c-index.html., 2003-08/2006-02.
  • 3Cutler M, Shih Y, Meng W. Using the structure of HTML documents to improve retrieval [A]. Proceedings of the USENIX Symposium on Intemet Technologies and Systems Monterey[C]. California: California Press, 1997. 241 - 251.
  • 4Mdiligenti F Coetzee. Focused crawling using context graphs[A]. Proceedings of the 26th International Conference on Very Large Data Bases[C]. Cairo: Cairo Press, 2000. 527 - 534.
  • 5Ricardo Baeza-yates, Berthier Ribeiro-neto. Modem Information Retrieval[M]. Beijing: China Machine Press, 2005.
  • 6刘林,汪涛,樊孝忠.主题爬虫的解决方案[J].华南理工大学学报(自然科学版),2004,32(z1):137-141. 被引量:10
  • 7龙宇巍,王永成,许欢庆.定题搜索引擎Robot的设计与算法[J].计算机仿真,2004,21(4):69-72. 被引量:9

二级参考文献4

  • 1[7]Page L,Brin S,Motwani R,et al. The PageRank citation ranking:Bringing order to the Web [ EB/OL]. http://www-db. stanford. edu/~ backrub/pageranksub. ps, 1998 -01 - 20/2003 - 03 - 25.
  • 2[8]Brin S,Page L. The anatomy of a large-scale hypertextual web search engine [J]. Computer Networks and ISDN Systems, 1998,30:107 - 117.
  • 3曹军.Google的PageRank技术剖析[J].情报杂志,2002,21(10):15-18. 被引量:70
  • 4雷鸣,王建勇,陈葆珏,李晓明.Improved Relevance Ranking in WebGather[J].Journal of Computer Science & Technology,2001,16(5):410-417. 被引量:4

共引文献18

同被引文献124

引证文献11

二级引证文献41

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部