期刊文献+

基于约束条件随机场的Web数据语义标注 被引量:9

Constrained Conditional Random Fields for Semantic Annotation of Web Data
下载PDF
导出
摘要 Web数据语义标注是Web信息抽取中的关键步骤.条件随机场是利用序列特征处理序列标注问题的经典方法.然而现有条件随机场模型无法综合利用已有的Web数据库信息和Web数据元素之间的逻辑关系,导致Web数据语义标注准确率不高.因此,提出一种约束条件随机场模型(CCRF).该模型通过引入可信约束和逻辑约束,有效利用了已有的Web数据库信息和Web数据元素之间的逻辑关系.为了克服现有条件随机场模型Viterbi推理方法无法综合利用这2类约束的不足,该模型采用整数线性规划推理方法,将两类约束同时引入推理过程.通过在多个领域的真实数据集上的实验结果表明,所提出的模型能够显著提高Web数据语义标注的性能,并且为Web信息抽取奠定了良好的基础. Semantic annotation of Web data is a key step for Web information extraction. The goal of semantic annotation is to assign meaningful semantic labels to data elements of the extracted Web object. It is a hot research topic that has gained increasing attention all over the world in recent years. Conditional random fields are the state-of-the-art approaches taking the sequence characteristics to do better labeling. However, traditional conditional random fields can not simultaneously use existing Web databases and logical relationships among Web data elements, which lead to low precision of Web data semantic annotation. To solve the problems, this paper presents a constrained conditional random fields (CCRF) model to annotate Web data. The model incorporates confidence constraints and logical constraints to efficiently utilize existing Web databases and logical relationships among Web data elements. In order to solve the problem that the Viterbi inference approach of traditional CRF model can not simultaneously utilize two kinds of constraints, the model incorporates a novel inference procedure based on integer linear programming and extends CRF to naturally and efficiently support two kinds of constraints. Experimental results on a large number of real-world data collected from diverse domains show that the proposed approach significantly improves the accuracy of semantic annotation of Web data, and lays a solid foundation for Web information extraction.
出处 《计算机研究与发展》 EI CSCD 北大核心 2012年第2期361-371,共11页 Journal of Computer Research and Development
基金 国家自然科学基金项目(61100167 90818001) 江苏省自然科学基金项目(BK2011204) 江苏省高校自然科学基金项目(11KJB520019) 山东省自然科学基金项目(Y2007G24)
关键词 语义标注 WEB信息抽取 条件随机场 整数线性规划 WEB数据集成 semantic annotation Web information extraction conditional random field integer linear programming Web data integration
  • 相关文献

参考文献14

  • 1Zhai Yanhong, Liu Bing. Web data extraction based on partial tree alignment [C] //Proc of the 14th Int Conf on World Wide Web. New York: ACM, 2005:76-85.
  • 2Haas L M. Beauty and the beast: The theory and practice ot information integration [C] //Proc of the 12th Int Conf on Database Theory. Ber}in~ Springer, 2007:28-43.
  • 3Lafferty J D, McCallum A, Pereira F C. Conditional random fields: Probabillstic models for segmenting and labeling sequence data [C] //Proc of the 18th Int Conf on Machine Learning. San Francisco: Morgan Kaufmann, 2001 : 282-289.
  • 4Embley D W, Campbell D M, Jiang Y S, et al. Conceptual- model based data extraction from multiple record Web pages [J]. Data Knowledge Engineering, 1999, 31(3): 227-251.
  • 5Arlotta L, Crescenzi V, Mecca G, et al. Automatic annotation of data extracted from large Web sites [C] //Proc of the 6th Int Workshop on the Web and Databases. New York: ACM, 2003:7-12.
  • 6马安香,张斌,高克宁,齐鹏,张引.基于结果模式的Deep Web数据抽取[J].计算机研究与发展,2009,46(2):280-288. 被引量:15
  • 7Nie Zaiqing, Wu Fei, Wen Jirong, et al. Extracting objects from the Web [C] //Proc of the 22nd Int Conf on Data Engineering. Piscataway, NJ: IEEE, 2006:123-134.
  • 8Kristjansson T, Culotta A, Viola P, et al. Interactive information extraction with constrained conditional random fields [C] //Proe of the 19th National Conf on Artificial Intelligence. Menlo Park, CA: AAAI, 2004: 412-418.
  • 9Punyakanok V, Roth D, Yih W, et al. Semantic role labeling via integer linear programming inference [C] //Proc of the 20th Int Conf on Computational Linguistics. Morristown: Association for Computational Linguistics, 2004:1346-1352.
  • 10Sha F, Pereira F. Shallow parsing with conditional random fields [C] //Proc of the 2003 Conf of the American Chapter on Human Language Technology. Morristown: Association for Computational Linguistics, 2003 :134-141.

二级参考文献9

  • 1Pinto D, McCallum A, Wei X. Table extraction using conditional random fields [C] //Proc of the 26th Annual Int ACM SIGIR Conf on Research and Development in Information Retrieval. New York: ACM, 2003:235-242
  • 2Wang Y, Hu J. A machine learning based approach for table detection on the Web [C]//Proc of the 11th Int Conf on World Wide Web. New York: ACM, 2002:242-250
  • 3Wang Jiying, Lochovsky F. Data extraction and label assignment for Web databases [C]//Proc of the 12th Int Conf on World Wide Web. New York: ACM, 2003:187-196
  • 4Zhai Y, Liu B. Web data extraction based on partial tree alignment [C] //Proc of the 14th Int Conf on World Wide Web. New York: ACM, 2005:76-85
  • 5Liu B, Grossman R L, Zhai Yanhong. Mining data records in Web pages [C] //Proc of the 9th Int Conf on Knowledge Discovery and Data Mining. New York: ACM, 2003: 601- 606
  • 6Liu W, Meng X, Meng W. Vision based Web data records extraction [C]//Proc of the 9th Int Workshop in Web and Databases. New York: ACM, 2006:20-25
  • 7Arvind Arasu, Hector Garcia Molina. Extracting structured data from Web pages [C] //Proc of the Int Conf on Management of Data. New York: ACM, 2003:337-348
  • 8Hsu J L, Liu C C, Chen Arbee L P. Efficient repeating pattern finding in music databases [C] //Proc of the 7th Int Conf on Information and Knowledge Management. New York: ACM, 1998:281-288
  • 9刘伟,孟小峰,孟卫一.Deep Web数据集成研究综述[J].计算机学报,2007,30(9):1475-1489. 被引量:136

共引文献14

同被引文献115

引证文献9

二级引证文献35

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部