摘要
研究Web文档服务的准确性和快速性,网络信息抽取成为处理海量网络信息的重要手段,而大量异构信息的有效抽取是非常困难的,为了改进和提高系统对于海量异构网页信息的抽取查全率和查准率,提出了一种新的信息抽取的方法,算法利用了隐马尔可夫模型在处理规则知识上的优势对每个页面构建HTML树,并利用Shannon熵来定位数据域,再用Maxi-mum Likelihood方法实现隐马尔可夫模型的构建,实现对Web信息的抽取。仿真结果表明,通过对大量学术论文头部结构信息的抽取,应用算法可以使信息抽取在召回率和准确率方面有明显的提高。
As much information appears on the Internet, Web information extraction became an important means of massive network information processing. It is difficult to effectively extract the Web information. In order to improve and enhance the recall rate and precision rate of massive heterogeneousWeb information, this paper proposes an algorithm based on Hidden Markov Model (HMM) for Web information extraction. The algorithm is applied to pro- cessing rule knowledge for pages to create HTML Tree. And then Shannon entropy is used to locate date fields. Next, the algorithm constructs HMM by Maximum Likelihood. The experimental results show that by processing and applying mass structural information of Web papers with HMM, this method has good performance in Recall and Precision.
出处
《计算机仿真》
CSCD
北大核心
2010年第5期132-135,共4页
Computer Simulation
基金
陕西省自然科学基金资助项目(2007F25)
西安财经学院科研基金资助项目(07XCK04)
陕西省教育厅专项科研计划项目(09JK440)
关键词
隐马尔可夫模型
信息抽取
极大似然
机器学习
Hidden markov model
Web information extraction
Maximum likelihood
Machine learning