基于多学习策略的网页信息抽取方法

A METHOD FOR WEB INFORMATION EXTRACTION BASED ON MULTI-LEARNING STRATEGIES

下载PDF

导出

摘要由于网页信息具有异构和动态的特点,致使现有的大多数网页信息抽取方法都存在适用性差的问题。为此,将传统的文本分类器和隐式马尔可夫学习策略结合起来,提出了一种基于多学习策略的网页信息抽取方法。该方法在获得网页文本记录的局部最优分类抽取结果基础上,还利用了整个网页文本结构信息对抽取结果进行进一步优化。实验结果表明,该方法不需要对新的站点进行学习,就能获得较高的信息召回率和抽取精度,具有较强的适用性。 The current information extraction methods exist in the problem of poor applicability, since the content on the internet are heterogeneous and dynamic. A method based on multi-learning strategies was proposed for Web information extraction （IE） by combining two types of algorithms based on conventional text classifier and Hidden Markov Models （HMM）. The method can refine the IE result by using the relevant structural information present in the document, based on locally optimal classification of each fragment. Experiment result show that MLS method achieves higher accuracy and recall rate of IE without learning new Websites, and has strong applicability.

作者朱明李香郑烇

机构地区中国科学技术大学自动化系

出处《计算机应用与软件》 CSCD 北大核心 2008年第12期68-69,115,共3页 Computer Applications and Software

基金国家发改委项目"视频点播系统"(CNGI-04-15-2A)

关键词信息抽取机器学习文本分类器 HMM Information extraction Machine learning Text classifier HMM

分类号 TP391.4 [自动化与计算机技术—计算机应用技术] TP393.092 [自动化与计算机技术—计算机应用技术]

引文网络
相关文献

参考文献6

1Nachouki G. A method for information extraction from the web. Information and Communication Technologies, 2006. ICTTA' 06.2nd. Volume 1,24 -28 April 2006:517 -521.
2Estievenart F, Meurisse J R,Thiran P. Semi-automated extraction of targeted data from web pages. Proceeding of the 22nd International Conference on Data Engineering Workshops,03 - 07 April 2006 :38 - 48
3Bouckaert R R. Low level information extraction:a Bayesian network based approach. In Workshop on Text Learning,2002.
4Kushmerick N, Johnston E, McGuinness S. Information extraction by text classification. In Proceedings of the 16th International joint Conference on Artificial Intelligence, Workshop on Adaptive Text Extraction and Mining. Seattle, Washington, USA ,2001.
5Yang Y,Pedersen J O. A comparative study on feature selection methods in text categorization. In Proceedings of the 14th ICML,1997:412 - 420.
6张玲.Web信息提取技术研究与应用.北京:中国科学院计算技术研究所,2003:5-10.

1曲珍,扎西加,春燕.最大熵软决策树HMM最大似然藏语音合成[J].计算机工程与设计,2017,38(4):981-988.
2林波,丁东辉,郭靖羽,林伟佳,黄翰.基于投诉文本记录的数据挖掘系统[J].中国科技信息,2015(21):51-54. 被引量：1
3haibird.会议记录好帮手[J].电击高手,2004(2):48-48.
4邓小明,梁正友.隐式马尔可夫链无线冲突概率约乘退避策略[J].计算机工程与设计,2017,38(4):868-873.
5李文鑫,陈静,范文兵.基于小波域HMT模型的图像去噪研究[J].现代电子技术,2009,32(6):110-112. 被引量：2
6常婉纶.利用FSO实现文本记录导入ACCESS数据库[J].安庆师范学院学报（自然科学版）,2008,14(1):102-104. 被引量：2
7张爱科,符保龙.基于高维聚类的探索性文本挖掘算法[J].计算机应用,2013,33(4):988-990. 被引量：4
8陈卓.时尚日记多媒体[J].软件指南,2006(10):16-17.
9孟令阁,马建芬,张雪英.基于主题的SVM与MMR融合的会议摘要技术[J].计算机工程与设计,2016,37(10):2695-2699. 被引量：7
10Fei Gao,Shao-Xu Song,Lei Chen,Jian-Min Wang.Efficient Set-Correlation Operator Inside Databases[J].Journal of Computer Science & Technology,2016,31(4):683-701.

计算机应用与软件

2008年第12期

浏览历史

内容加载中请稍等...

基于多学习策略的网页信息抽取方法

参考文献6

相关作者

相关机构

相关主题

浏览历史