期刊文献+

基于多知识的Web网页信息抽取方法 被引量:10

INFORMATION EXTRACTION FROM Web PAGES BASED ON MULTI KNOWLEDGE
下载PDF
导出
摘要 从 Web网页中自动抽取所需要的信息内容 ,是互联网信息智能搜取的一个重要研究课题 .为有效解决网页信息抽取所需的信息描述知识获取问题 ,这里提出了一种基于多知识的 Web网页信息抽取方法 (简称 MKIE方法 ) .该方法将网页信息抽取所需的知识分为二类 .一类是描绘网页内容本身表示特点 ,以及识别各网页信息对象的确定模式知识 ;另一类则描述网页信息记录块 ,以及各网页信息对象的非确定模式知识 .MKIE方法根据前一类知识 ,动态分析获得后一类知识 ;并利用这两类知识 ,最终完成从信息内容类似但其表现形式各异的网页中 ,抽取出所需要的信息 .美大学教员论文网页信息抽取实验结果表明 。 Web page information extraction is an important study field in the AI research on the WWW. To tackle the problem of knowledge acquisition in the web page information extraction, a new approach based on multi knowledge for web information extraction is put forward. Two knowledge are used in this method (called MKIE). One is composed of the knowledge that characterizes the web page content and the definite patterns for the information objects in the web pages. The other one consists of the knowledge that describes the web page content displaying patterns and the non definite patterns for the information objects in the web pages. The latter knowledge can be acquired through learning based the former type knowledge and those two knowledge are used together to extract information from the same sort of web pages with different displaying styles successfully. The experiment made on the U.S. university faculty's paper web page information extraction show that the MKIE has a powerful ability to recognize the information displaying patterns and to extract them effectively.
出处 《小型微型计算机系统》 CSCD 北大核心 2001年第9期1058-1061,共4页 Journal of Chinese Computer Systems
基金 安徽省自然科学基金资助 (项目编号 :98312 82 0 )
关键词 WEB 网页 信息抽取 知识 互联网 Semi-structured data Information extraction Pattern recognition
  • 相关文献

参考文献1

  • 1Hammer J,Proceedings of the Workshop on Management of Semistructured Tucson,1997年,18~25页

同被引文献47

  • 1王茹,宋瀚涛,陆玉昌.网页数据自动抽取系统[J].计算机工程与应用,2004,40(19):135-138. 被引量:8
  • 2林亚平,刘云中,周顺先,陈治平,蔡立军.基于最大熵的隐马尔可夫模型文本信息抽取[J].电子学报,2005,33(2):236-240. 被引量:48
  • 3曹勇刚,曹羽中,金茂忠,刘超.面向信息检索的自适应中文分词系统[J].软件学报,2006,17(3):356-363. 被引量:48
  • 4朱明.互联网信息智能搜索与获取方法研究[M].合肥:中国科技大学,2001..
  • 5SEYMORE K, MCCALI.UM A, ROSENFEL R. Learning hidden Markov model structure for information extraction[C] //Pro-ceedings of the AAAI--99 Workshop on Machine Learning for Information Extraction. Orlando, 1999:37--42.
  • 6BERGER A, PIETRA S, PIETRA V. A maximum entropy approach to natural language proeessing[J]. Computational Languis-ties, 1996,22(1) :39--71.
  • 7.[EB/OL].http: ∥www. flud. com,.
  • 8Hsinchun Chen, Ann M. Lally, Bin Zhu, and Michael Chau, HelpfulMed: Intelligent Searching for Medical Information over the Internet, JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY, 54(7): 683- 694, 2003
  • 9Bing Liu,Robert Grossman and Yanhong Zhai,Mining Data Records in Web Pages,The Proceedings of SIGKDD.03,August 24-27,2003,Washington,DC,USA.
  • 10Liu Ling, Pu Calton., Han Wei. An XML- enabled data extraction toolkit for web sources[ J ]. Information Systems, 2001,26(2) :563 - 583.

引证文献10

二级引证文献36

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部