基于实体的文本数据与XML文档的匹配技术研究

Research on Entity-based Matching Technology Between Text and XML

下载PDF

导出

摘要目前飞机企业等单位的大量数据采用XML格式存储,且与其它业务文本数据之间缺乏联系.在异构数据集成领域,文本数据与XML文档的模式匹配还较少有人关注.提出文本数据与XML文档的匹配方法,该匹配方法采用两阶段的算法,首先使用基于条件随机场的实体抽取算法从文本文档中提取实体信息,然后通过基于实体的最近语义片段(ECSF)检索算法在XM L树中查询覆盖所有实体及实例的最近语义片段作为匹配对象.ECSF检索算法中基于实体的最近语义片段含义是XM L树上的覆盖所有实体及实例信息的最小子树,且实例所对应的实体必须是该实例的祖先节点.最后通过实验验证了本文提出方法的可行性和有效性,且具有较好的匹配效果,包括召回率和准确率. Currently,large amounts of data are stored in XMLwithin many enterprises,such as aircraft enterprise,and there is hardly any relationship between them and other business text data. In the field of heterogeneous data integration,there is hardly any research on matching technique between text and XML. This paper first proposes an approach to integrate plain text data and XML document.The approach is constructed with a two-step framework： first,extracting entities of the text by conditional-random-fields based entity extraction tool; then,locating the closest semantic fragment within the XML file that covers all of the extracted entities and instances by Entity-based Closest Semantic Fragment（ ECSF） search algorithm. Furthermore,the entity node should be the ancestor of the corresponding instance node. Our evaluation shows that ECSF algorithm performs efficiently and achieves good result,including rate of recall and accuracy.

作者刘木强杨卫东

机构地区复旦大学计算机科学技术学院

出处《小型微型计算机系统》 CSCD 北大核心 2015年第11期2473-2478,共6页 Journal of Chinese Computer Systems

基金上海市高新技术产业化重点项目(11-43)资助国家行业专项(CHIN-ARE2015-04-07)资助

关键词 XML 匹配技术实体抽取基于实体的最近语义片段 ECSF XML matching technique entity extraction entity-based closest semantic fragment ECSF

分类号 TP393 [自动化与计算机技术—计算机应用技术]

引文网络
相关文献

参考文献18

1Rahrn E, Bernstein P A. A survey of approaches to automatic schema matching[J]. The VLDBJournal ,2001, IO( 4) :334-350.
2Do H H, Rahm E. COMA: A system for flexible combination of schema matching approaches[C] . Proceedings of the 28 th International Conference on Very Large Data Bases,2002:610-621.
3Du Xiao-kun. Research on schema matching algorithm of database[D]. Wuhan: Huazhong University of Science & Technology, 2010.
4曹兰英,严义,邬惠峰.基于模式匹配的XML自动转换技术[J].计算机工程与应用,2012,48(25):72-76. 被引量：6
5Alsayed A,Eike S,Gunter S. A schema matching-based approach to XML schema clustering[C]. Proceedings of the 10th International Conference on Information Integration and Web-based Applications & Services,2008: 131-136.
6Checiu L, Ionescu D. A new algorithm for mapping XML schema to XML schema[C]. Proceedings of IEEE ICCC-CONTI ,2010 :625-630.
7Roy P , Mohania M, Bamba B, et al. Towards automatic associationof relevant unstructured content with structured query results[C]. Proceedings of the 14th ACM International Conference on Information and Knowledge Management,2005 :405-412.
8Chakaravarthy V, Gupta H, et al. Efficiently linking text documents with relevant structured information[C]. Proceedings of the 32nd International Conference on Very Large Data Bases, 2006 : 667 -678.
9Bhide M, Gupta A, et al. LIPTUS: Associating structured and unstructured information in a banking environment[C]. Proceedings of the 2007 ACM SIGMOD International Conference on Management of Data,2007 :915-923.
10Hansu G, Mike G, Liang Z, et al. AnchorMF: towards effective event context identification[C]. Proceedings of the 22nd ACM International Conference on Conference on Information & Knowledge Management,2013 :629-638.

二级参考文献54

1张晓艳,王挺,陈火旺.命名实体识别研究[J].计算机科学,2005,32(4):44-48. 被引量：65
2IEC 61499-2 Function blocks-Part 2: Software tool requirements[S].Geneva: International Electrotechnical Commission, 2005.
3Rahm E,Bernstein P A.A survey of approaches to auto- matic schema matching[J].The VLDB Journal, 2001, 10 (4) :334-350.
4Madhavan J, Bernstein P A, Rahm E.Generic schema matching with Cupid[C]//Proceedings of VLDB Confer- ence, 2001 : 49-58.
5Melnik S,Molina-Garcia H,Rahm E.Similarity flooding: a versatile graph matching algorithm[C]//Proceedings of ICDE Conference, 2002:117-128.
6Do H H,Rahm E.COMA-a system for flexible combina- tion of schema matching approaches[C]//Proceedings of the Very Large Data Bases Conference, 2002 : 610-621.
7Aumilller D, Do H H, Rahm E, et al.Schema and ontology matching with COMA++ [C]//Proceedings of SIGMOD Conference, 2005 : 906-908.
8Cruz I F,Antonelli F P, Stroe C.AgreementMaker: eft- cient matching for large real-world schemas and ontolo- gies[C]//Proceedings of VLDB Conference,2009:24-28.
9Fellbaum C.WordNet: an electronic lexical database[M] Cambridge, MA: The MIT Press, 1998.
10Wu Z, Palmer M.Verb semantics and lexical selection[C]// Proceedings of the 32nd Annual Meeting of the Associ- ation for Computational Linguistics, Las Cruces, New Mexico, 1994: 133-138.

共引文献53

1肖瑞,胡冯菊,裴卫.基于BiLSTM-CRF的中医文本命名实体识别[J].世界科学技术-中医药现代化,2020,22(7):2504-2510. 被引量：30
2张剑,黄坤,姚晋.一种UDP报文数据的搜索方法[J].计算机与数字工程,2013,41(1):89-91. 被引量：1
3郭艳军,王喆,潘懋.一种支持数据校验的Excel信息转储元数据模型[J].计算机应用与软件,2014,31(6):15-17. 被引量：2
4韩春燕,刘玉娇,琚生根,李若晨,苏翀.中文微博命名体识别[J].四川大学学报（自然科学版）,2015,52(3):511-516. 被引量：8
5胡亚慧,李石君,余伟,杨莎,甘琳,王凯,方其庆.大数据环境下的电子商务商品实体同一性识别[J].计算机研究与发展,2015,52(8):1794-1805. 被引量：11
6史涛,沈艳霞.XML文档到关系型数据库的模型映射方法[J].江南大学学报（自然科学版）,2015,14(5):590-595. 被引量：4
7李汝君,张俊,张晓民,桂小庆.健康领域Web信息抽取[J].计算机应用,2016,36(1):163-170. 被引量：6
8万静,涂喆,冯晓.基于条件随机场的医药领域症状信息抽取[J].北京化工大学学报（自然科学版）,2016,43(1):98-103. 被引量：11
9孙丽萍,过弋,唐文武,徐永斌.基于构成模式和条件随机场的企业简称预测[J].计算机应用,2016,36(2):449-454. 被引量：3
10赵明珍,程亮喜,林鸿飞.基于评论挖掘的药物副作用发现机制[J].中文信息学报,2015,29(6):193-202. 被引量：2

1朱丽萍,宫志.移动工作集的冲突避免和解决[J].计算机工程与设计,2007,28(1):205-208.
2涂眉,周玉,宗成庆.基于最大熵的汉语篇章结构自动分析方法[J].北京大学学报（自然科学版）,2014,50(1):125-132. 被引量：9
3赵艳妮,郭华磊,马晓荣.大型企业分布式数据存储模式[J].硅谷,2010,3(16):107-107.
4刘忠,刘洋,建晓.基于KD-Tree的KNN文本分类算法[J].网络安全技术与应用,2012(5):38-40. 被引量：3
5张蕾,李学良,刘小冬.基于知识图的语义分析[J].西北大学学报（自然科学版）,2002,32(2):153-156. 被引量：1
6白永秋,周来水,卫炜,黄欢.基于Web的飞机产品支援信息管理系统研究[J].机械设计与制造工程,2013,42(4):46-49.
7胡令传,陶晓鹏.客户评论中用户体验信息自动提取研究[J].计算机工程,2015,41(1):49-53. 被引量：2
8陈海燕.基于搜索引擎的词汇语义相似度计算方法[J].计算机科学,2015,42(1):261-267. 被引量：21
9马婷婷,吕刚.一种改进的本体映射方法[J].徐州工程学院学报（自然科学版）,2011,26(1):16-21. 被引量：1
10吴川,马宇飞,贺玉文,钟玉琢,张宏江.体育视频中基于语义推理的事件检测方法[J].清华大学学报（自然科学版）,2003,43(4):507-509.

小型微型计算机系统

2015年第11期

浏览历史

内容加载中请稍等...

基于实体的文本数据与XML文档的匹配技术研究

参考文献18

二级参考文献54

共引文献53

相关作者

相关机构

相关主题

浏览历史