摘要
针对网页噪音和网页非结构化信息抽取复杂度高的问题,提出一种基于标签路径(XPATH)聚类的文本信息抽取算法。该算法首先对网页噪音预处理,根据网页的DOM树结构进行标签路径聚类,通过自动训练的阈值和网页.分割算法快速判定网页的关键部分,根据数据块中的嵌套结构获取网页文本抽取模板。对不同类型网站实验表明,该方法获得快速和较高准确度的效果。
This paper proposes a new approach for text information extraction based on tag xpath clustering,in order to solve the problem of high complexity in extracting webpage noise and unstructured webpage information.The method first carries out the web noise pre-treatment,as well as the tag xpath clustering according to the DOM tree structure of the webpage,and fast determines key parts of the webpage through automatically trained threshold value and webpage segmentation algorithm,then finds webpage's text extracted template based on the embedded structure of data block.The experiments performed on several different kinds of website show that this method obtains faster effect with higher accuracy.
出处
《计算机应用与软件》
CSCD
2010年第11期199-202,共4页
Computer Applications and Software
关键词
XPATH
网页分割
信息抽取
聚类
阈值
Xpath
Webpage segmentation
Information extraction
Clustering
Threshold