摘要
目前飞机企业等单位的大量数据采用XML格式存储,且与其它业务文本数据之间缺乏联系.在异构数据集成领域,文本数据与XML文档的模式匹配还较少有人关注.提出文本数据与XML文档的匹配方法,该匹配方法采用两阶段的算法,首先使用基于条件随机场的实体抽取算法从文本文档中提取实体信息,然后通过基于实体的最近语义片段(ECSF)检索算法在XM L树中查询覆盖所有实体及实例的最近语义片段作为匹配对象.ECSF检索算法中基于实体的最近语义片段含义是XM L树上的覆盖所有实体及实例信息的最小子树,且实例所对应的实体必须是该实例的祖先节点.最后通过实验验证了本文提出方法的可行性和有效性,且具有较好的匹配效果,包括召回率和准确率.
Currently,large amounts of data are stored in XMLwithin many enterprises,such as aircraft enterprise,and there is hardly any relationship between them and other business text data. In the field of heterogeneous data integration,there is hardly any research on matching technique between text and XML. This paper first proposes an approach to integrate plain text data and XML document.The approach is constructed with a two-step framework: first,extracting entities of the text by conditional-random-fields based entity extraction tool; then,locating the closest semantic fragment within the XML file that covers all of the extracted entities and instances by Entity-based Closest Semantic Fragment( ECSF) search algorithm. Furthermore,the entity node should be the ancestor of the corresponding instance node. Our evaluation shows that ECSF algorithm performs efficiently and achieves good result,including rate of recall and accuracy.
出处
《小型微型计算机系统》
CSCD
北大核心
2015年第11期2473-2478,共6页
Journal of Chinese Computer Systems
基金
上海市高新技术产业化重点项目(11-43)资助
国家行业专项(CHIN-ARE2015-04-07)资助