期刊文献+

基于相似度的网页标题抽取方法 被引量:6

Title Extraction from HTML Documents Based on Similarity
下载PDF
导出
摘要 目前网页标题的抽取方法大多结合HTML结构和标签特征进行抽取,但是这些方法并没有考虑标题与正文信息之间内容上的联系。该文提出一种基于相似度的网页标题抽取方法,该方法利用网页标题与正文信息之间的关系,通过计算语言"单位"之间的相似度和对应的权值,并引入HITS算法模型对权值进行调整,根据特定的选取方法抽取出真实标题。实验结果表明,该方法不仅对"非标准网页"的抽取达到满意的效果,而且对"标准网页"具有较高的泛化能力。 Most of the methods for title extraction from HTML documents are based on the structure of HTML document or the features of label.They do not considered the correlation between the title and the content.This paper proposes a method of title extraction from HTML documents based on similarity,which makes full use of the correlation between the title and the main body.The similarity between units are calculated and adjusted by the HITS algorithm.Then the "real title" is extracted in a series of steps.Experimental results show that this method performs well for "nonstandard HTML document" and has good generalization ability for "standard HTML document".
出处 《中文信息学报》 CSCD 北大核心 2011年第2期32-37,共6页 Journal of Chinese Information Processing
基金 国家自然科学基金资助项目(60970083) 国家社会科学基金资助项目(09BTQ027)
关键词 网页标题抽取 相似度 WEB信息抽取 title extraction similarity Web information retrieval
  • 相关文献

参考文献22

  • 1郑州大学校内搜索引擎.http://search.ha.edu.cn/zzu/[CP/OL].
  • 2Freitag D. Machine Learning for Information Extraction in Informal Domains [J]. Machine Learning, 2000,39 (2-3) : 169-202.
  • 3Soderland S. Learning Information Extraction Rules for Semi-structured and Free Text[J]. Machine Learning, 1999,34(1-3) :233-272.
  • 4Yipu Wu, Xuejie Zhang, Qing Li, Jing Chen. Title Extraction from Loosely Structured Data Records [C]//Proceedings of the Seventh International Conference on Machine Learning and Cybernetics, 2008.
  • 5Crescenzi, V., Mecca, G. and Merialdo, P. Roadrunner: Towards Automatic Data Extraction from Large Web Sites[C]//Proceedings of the Twenty-seventh International Conference on Very Large Databases (VLDB2001), 2002.
  • 6Chidlovskii, B. ,Ragetli, J. , and de Rijke, M. Wrapper Generation via Grammar Induction[C]//Proceedings of the Eleventh European Conference on Machine Learning(ECML2000), 2000.
  • 7Crescenzi, V. , Mecca, G, and Merialdo, P. Wrapping-Oriented Classification of Web pages[C]//Proeceedings of the 2002 ACM Symposium on Applied Conaputing(SAC-2002), 2002 : 1108-1112.
  • 8Craven, T. C. HTML Tags as Extraction Cues for Web Page Description Construction[J]. Informing Science Journal, 2003,6 : 1-12.
  • 9Hsu C N, Dung M T. Generating Finite-State Transducers for Semi-Structured Data Extraction from the Web[J]. Information Systems, 1998,23(8) :521-538.
  • 10Kushmerick N, Weld D S. Doorenbos R. Wrapper Induction for Information Extraction[J]. 15th International Joint Conference on Artificial Intelligence (IJCAI-97), Nagoya, 1997:729-737.

二级参考文献25

  • 1J. Zhang, M. S. Ackerman, and L. Adamic. Expertise networks in online communities: structure and algorithms[C]//Proc. 16th WWW, Banff, Canada May 2007. 2007:221-230.
  • 2I. Muslea, S. Minton, C. Knoblock. A Hierarchical Approach to Wrapper Induction [C]//Third International Conference on Autonomous Agents, (Agents' 99), Seattle, May 1999.
  • 3S. Soderland. Learning Information Extraction Rules for Semistructured and Free Text[J]. Machine Learning, 1999.
  • 4Liu B. , Grossman R. , Zhai Y. Mining Data Records in Web Pages [C]//KDD 2003 : 601-606.
  • 5Z. Yanhong and L. Bing, Web Data Extraction Based on Partial Tree Alignment[C]//Proceedings of the ACM, 2005: 76-85.
  • 6Liu, B. and Zhai, Y. , NET - A System for Extracting Web Data from Flat and Nested Data Records[C]// WISE 2005, 2005: 487-495.
  • 7Justin Park and Denilson Barbosa. Adaptive Record Extraction From Web Pages[C]//WWW 2007.
  • 8Gusfield, D. Algorithms on strings, tree, and sequence[M]. Cambridge. 1997.
  • 9J. Carbonell, J. Goldstein, 1998. The use of MMR, diversity-based reranking for reordering documents and producing summaries [ A],In: Proceedings of the 21st ACM-SIGIR International Conference on Research and Development in Information Retrieval [C], Melbourne, Australia.
  • 10Lin, Chin-Yew and E. H. Hovy 2003. Automatic Evaluation of Summaries Using N-gram Co-occurrence Statistics [ A ]. In Proceedings of 2003 Language Technology Conference (HLT-NAACL 2003) [C],Edmonton,Canada,May 27- June 1,2003.

共引文献41

同被引文献46

引证文献6

二级引证文献21

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部