摘要
网页中的信息主要以重复的HTML结构进行组织并形成一致的展现形式,主要研究具备复杂重复模式的网页主题信息块识别,提出一种改进的基于逆序匹配重复模式的算法。该算法依据HTML标签结构和class属性改进DOM树,重构页面的向量空间模型,逆序匹配重复结构模式并完成对主题信息的提取。实验结果表明,该方法能准确识别复杂页面结构中主题重复模式,有效避免非主题重复模式的干扰,有较好的召回率和准确率。
The information in webpage is mainly arranged with repetitive HTML structure and presents in consistent display style.In the paper we put emphasis on studying the recognition of the webpage theme information with complicated repetitive pattern and propose an improved algorithm which is based on repetitive pattern reverse matching.The method improves document tree model in accordance with HTML tag structure and class property,reconstructs vector space model of the pages,reversely matches the repetitive structure pattern and then completes the extraction of the theme information.Experimental results suggest that this method can precisely recognise the theme repetitive pattern in complicated webpage structure,effectively avoid the disturbance from non-theme repetitive pattern blocks and performs well in precision and recall.
出处
《计算机应用与软件》
CSCD
北大核心
2013年第4期88-91,共4页
Computer Applications and Software
基金
国家自然科学基金项目(61003045)
关键词
信息提取
重复模式
主题识别
逆序匹配
Information extraction Repetitive pattern Theme recognition Reverse match