期刊文献+

基于DOM的Web文本分割

DOM-based Web Document Segmentation
原文传递
导出
摘要 利用web文档的半结构化信息,提出一种基于DOM的web文本分割算法。该算法充分挖掘web网页中控制网页内容结构和显示的HTML标签信息,构建HTMLDOM树。首先通过改进传统的平面文本分割方法,使之适用于web文本分割;然后利用DOM树中的节点平滑平面文本分割的结果,初步实验表明该算法能有效提高web文本分割的精确度。 Utilizing the semi-structure information in the web pages, the paper presents a model about DOM-based web text segmentation. This model parses the HTML tags which organize the structure of web page to build the DOM tree of the page. By improving the traditional plain text segmentation algorithms, the paper expands these algorithms to fit for the web text segmentation and gives the theoretic basis of the algorithms. Then, the paper puts forward another algorithm to smooth the results of segment results. With the boundaries between the nodes in the DOM tree, the precision of the segmentation result can be increased.
作者 罗建利
出处 《图书情报工作》 CSSCI 北大核心 2009年第4期116-120,共5页 Library and Information Service
关键词 DOM 文本分割 主题边界 文本节点 DOM text segmentation topic boundary text nodes
  • 相关文献

参考文献9

  • 1Hearst M A, Plaunt C. Subtopic Structuring for Full-Length Document Access. Proc. of 16^th ACM-SIGIR, 1993:59 -68.
  • 2Kozima H. Text Segmentation Based on Similarity Between Words. Proc. of ACL -93, 1993 : 286 -288.
  • 3Beeferman D, Berger A, Lafferty J. Statistical Models for Text Segmentation. Machine Learning, 1999, 34 ( 1 - 3 ) : 177 - 210.
  • 4Hearst M. A. Multi-Paragraph Segmentation of Expository Texts. Proceedings of the ACL - 94, 1994 : 9 - 16.
  • 5Reynar J C. An Automatic Method of Finding Topic Boundaries. Proceedings of the 32nd Annual Meeting of the Association for Computational Linguistics, 1994 : 331 - 333.
  • 6Richmond K, Smith A, Amitay E. Detecting subject boundaries within text: A language independent statistical approach. Exploratory Methods in Natural Language Processing, 1997:357 -364.
  • 7Clark A. CyberNeko HTML Parser. [2007 - 05 - 02]. http:// sourceforge. net/projects/nekohtml.
  • 8Hanzlik S. Gorilla Design Studios Presents: Using the Host File. [ 2002 - 08 - 31 ]. http ://accs-net. com/hosts.
  • 9DOM Interest Group. Document Object Model (DOM). [2005 -06 - 19]. http://www. w3. org/DOM/.

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部