摘要
【目的/意义】随着Web网页的爆炸式增长和网页噪声不断增多,企业竞争情报系统和智能化网站的开发以及移动终端的阅读都急需一种可以高效精确抽取网页信息的方法。【方法/过程】本文提出了基于重复模式识别的信息提取新方法,通过页面解析、相似度计算、聚类并形成群组、删除横幅广告和导航链接等步骤,提取到了详情页面的标题和主要内容。【结果/结论】对于结构稳定的页面,本文实现了较高质量的信息抽取。不足之处是聚类和相似度的计算量较大,时间较长。
【Purpose/significance】With the explosive growth of webpages and webpages noise, an efficient and accurate extraction method of webpages information is needed urgently by website competitive intelligence system,intelligent site development and mobile reading.【Method/process】In this paper, a new method of information extraction based on repeated pattern recognition is proposed, and the headlines and main contents of the details are extracted through the steps of page parsing, similarity calculation, clustering, group formation, deletion of banner ads and navigation links.【Result/conclusion】For the pages with stable structure, this paper achieves higher quality information extraction. The disadvantage is that the computation of clustering and similarity is large and the time is long.
作者
李志义
沈之锐
LI Zhi-yi;SHEN Zhi-rui(School of Economic and Management,South China Normal University,Guangzhou 510006,China;Baidu Online Network Technology(Beijing)Co.Ltd,Beijing 100085,China)
出处
《情报科学》
CSSCI
北大核心
2019年第3期88-92,96,共6页
Information Science
基金
国家社科基金项目"基于表示学习的跨模态检索模型与特征抽取研究"(17BTQ062)
关键词
重复模式
信息抽取
编辑距离
聚类
repeating pattern
information extraction
edit distance
clustering