摘要
利用web文档的半结构化信息,提出一种基于DOM的web文本分割算法。该算法充分挖掘web网页中控制网页内容结构和显示的HTML标签信息,构建HTMLDOM树。首先通过改进传统的平面文本分割方法,使之适用于web文本分割;然后利用DOM树中的节点平滑平面文本分割的结果,初步实验表明该算法能有效提高web文本分割的精确度。
Utilizing the semi-structure information in the web pages, the paper presents a model about DOM-based web text segmentation. This model parses the HTML tags which organize the structure of web page to build the DOM tree of the page. By improving the traditional plain text segmentation algorithms, the paper expands these algorithms to fit for the web text segmentation and gives the theoretic basis of the algorithms. Then, the paper puts forward another algorithm to smooth the results of segment results. With the boundaries between the nodes in the DOM tree, the precision of the segmentation result can be increased.
出处
《图书情报工作》
CSSCI
北大核心
2009年第4期116-120,共5页
Library and Information Service