摘要
互联网的快速发展以及Web数据的日益庞大,使用户从Web中获取有用信息变得日益困难,如何快速有效地从Web中准确抽取信息已经成为亟待解决的问题,Web信息抽取技术应运而生.提出了一种新的基于XML的WEB信息自动抽取方法,采用数据转换算法将HTML文档标准化,通过学习样本实例的XPATH表达式,形成抽取规则库,并利用规则库对其它同类页面实现信息的自动抽取.实验结果表明,该方法具有较高的查全率和查准率,且抽取结果具有自描述性,方便于建立各个领域的数据抽取系统.
With the increasingly high-speed of the internet as well as the increase in the amount of data it contains,users are finding it more and more difficult to gain useful information from the web.How to extract accurate information from the Web efficiently has become an urgent problem.Web information extraction technology has emerged to solve this kind of problem.The method of Web information auto-extraction based on XML is designed through standardizing the HTML document using data translation algorism,forming a extracting rule base by learning the XPath expression of samples,and using extraction rule base to realize auto-extraction of pages of same kind.The results show that this approach shoud lead to a higher recall ratio and precision ratio,and the result should have a self-description,making it convenient for founding data extraction system of each domain.
出处
《河北工业大学学报》
CAS
北大核心
2010年第5期73-77,共5页
Journal of Hebei University of Technology
基金
天津市应用基础与前沿技术研究计划(10JCZDJC16000)