期刊文献+

基于决策树和马尔可夫链的问答对自动提取 被引量:5

Decision Tree and Markov Model Based Question-Answer Pair Extraction
下载PDF
导出
摘要 问答系统能用准确、简洁的答案回答用户用自然语言提出的问题,很明显系统中问答对的规模是影响问答系统最终性能的主要因素。为了提高问答对的规模、充分利用互联网资源,本文提出了一种基于决策树和马尔科夫链的在互联网上自动抽取问答对的算法。先根据网页中的HTML标记把网页表示成一棵DOM树;然后利用树中每个节点的结构和文字信息,抽取相应的特征;最后将得到的节点特征通过由决策树和一阶马尔可夫链结合得出的分类模型进行分类。试验结果表明准确率达到了90.398%,召回率达到了86.032%。对大量网页抽取的结果表明该分类模型能够适应对各种各样的网页的抽取。 Question Answering System can give users precise answer to the question presented in natural language and the major factor which influence the System's performance is the scale of Question-Answer pairs. In order to increase the Question-Answer pair's scale and make full use of Web Pages' resource, in this paper we propose a method that uses decision tree and Markov model to extract Question-Answer pairs in Web Pages. The method uses DOM tree to represent a web page according to HTML tags. Then acquire features value from every DOM tree's node. Last allow the features overpass the classification model, which created by decision tree and Markov model, to get the node's last classification result. Experimental results show that the precision achieved 90.40% and recall achieved 86. 03%. Experimental results also show that this model could extract information from all kinds of Web Pages.
出处 《中文信息学报》 CSCD 北大核心 2007年第2期46-51,共6页 Journal of Chinese Information Processing
基金 国家自然科学基金资助项目(60672056) 微软基金资助项目(2006120809)
关键词 人工智能 模式识别 信息抽取 DOM树 决策树 马尔可夫链 artificial intelligence pattern recognition information extraction DOM tree decision tree Markovmodel
  • 相关文献

参考文献12

  • 1Craven, T. C. HTML Tags as Extraction Cues for Web Page Description Construction [J]. Informing Science Journal, 2003, 6: 1-12.
  • 2Kosala, R.,Bruynooghe, M.,Bussche, J. V.,et al. Information Extraction from Web Documents Basecl onLocal Unranked Tree Automaton Inference [A]. In:Proceedings of the Eighteenth International Joint Conference on Artificial Intelligence ( IJCAI-2003 )[C]. 2003.
  • 3Eikvil, L. Information Extraction from World Wide Web- A Survey [R]. Technical Report 945, 1999.
  • 4Reis, D., Golgher, P., Silva, A.,et al. Automatic Web News Extraction Using Tree Edit Distance [A].In: Proceedings of International WWW Conference (WWW-2004) [C]. 2004,502-511.
  • 5Yunhua Hu, Guomao Xin,et al. Title Extraction from Bodies of HTML Documents and its Application to Web Page Retrieval [A]. The 28th Annual International ACM SIGIR Conference (SIGIR' 2005) [C].August 2005.
  • 6何新贵,彭甫阳.中文文本的关键词自动抽取和模糊分类[J].中文信息学报,1999,13(1):9-15. 被引量:54
  • 7Breuel, T. M. Information Extraction from HTML Documents by Structural Matching. In: Proceedings of the Second International Workshop on Web Document Analysis(WDA2003), 2003.
  • 8李效东,顾毓清.基于DOM的Web信息提取[J].计算机学报,2002,25(5):526-533. 被引量:101
  • 9J. R. Quinlan. C4. 5 Programs for Machine Learning[J]. Morgan Kaufmannn Publishers San Meteo, California, 1992.
  • 10于琨,蔡智,糜仲春,蔡庆生.基于路径学习的信息自动抽取方法[J].小型微型计算机系统,2003,24(12):2147-2149. 被引量:7

二级参考文献30

  • 1[1]Doorenbos R B, Etzioni O and Weld W S. A scalable comparisonshopping agent for the world_wide web [C]. Proceedings of the first international conference on Autonomous Agents, 1997:39~48.
  • 2[2]Embley D W, Jiang Y and Ng Y K. Record boundary discovery in web documents[C]. Proc. SIGMOD'99 , 1999: 467~478.
  • 3[3]David Buttler, Ling Liu and Calton Pu. A fully automated object extraction system for the world wide web[C]. International Conference on Distributed Computing Systems, 2001.
  • 4[4]Kushmerick N, Weld D, Doorenbos R. Wrapper induction for Information extraction[C]. Proc. IJCAI 97, 1997.
  • 5[5]Muslea I, Minton S and Knoblock C. A hierarchical approach to Wrapper induction[C]. Proc. 3rd International Conference Autonomous Agents, 1999.
  • 6[6]Arnaud Sahuguet, Fabien Azavant. Taming Web sources with "minute_made" wrappers[M]. Unpublished, 1999.
  • 7[7]Craven M, DiPasquo D, Freitag D, McCallum A, Mitchell T,Nigam N, Lattery S S. Learning to extract symbolic knowledge from the World Wide Web[C]. Proc. AAAI-98, 1998.
  • 8[8]Ashish N, Knoblock C. Semi_automatic wrapper generation for Internet information sources[C]. Proc. Cooperative Information Systems, 1997.
  • 9[9]McCallum A, Nigam K, Rennie J and Seymore K. A machine learning approach to building domain_specific search engines[C].Proc. IJCAI99, 1999: 662~667.
  • 10[10]http://www. w3. org/People/Raggett/tidy/#download.

共引文献159

同被引文献79

引证文献5

二级引证文献5

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部