How to integrate heterogeneous semi-structured Web records into relational database is an important and challengeable research topic. An improved model of conditional random fields was presented to combine the learnin...How to integrate heterogeneous semi-structured Web records into relational database is an important and challengeable research topic. An improved model of conditional random fields was presented to combine the learning of labeled samples and unlabeled database records in order to reduce the dependence on tediously hand-labeled training data. The pro- posed model was used to solve the problem of schema matching between data source schema and database schema. Experimental results using a large number of Web pages from diverse domains show the novel approach's effectiveness.展开更多
Supposing that the overall situation is dug out from the distributed monitoring nodes, there should be two critical obstacles, heterogenous schema and instance, to integrating heterogeneous data from different monitor...Supposing that the overall situation is dug out from the distributed monitoring nodes, there should be two critical obstacles, heterogenous schema and instance, to integrating heterogeneous data from different monitoring sensors. To tackle the challenge of heterogenous schema, an instance-based approach for schema mapping, named instance-based machine-learning (IML) approach was described. And to solve the problem of heterogenous instance, a novel approach, called statistic-based clustering (SBC) approach, which utilized clustering and statistics technologies to match large scale sources holistically, was also proposed. These two algorithms utilized the machine-leaning and clustering technology to improve the accuracy. Experimental analysis shows that the IML approach is more precise than SBC approach, reaching at least precision of 81% and recall rate of 82%. Simulation studies further show that SBC can tackle large scale sources holisticalty with 85% recall rate when there are 38 data sources.展开更多
Data exchange is a goal-oriented social communications system implemented through computerized technology. Data definition languages (DDLs) provide the syntax for communicating within and between organizations, illocu...Data exchange is a goal-oriented social communications system implemented through computerized technology. Data definition languages (DDLs) provide the syntax for communicating within and between organizations, illocutionary acts, such as informing, ordering and warning. Data exchange results in meaning-preserving mapping between an ensemble (a constrained variety) and its external (unconstrained) variety. Research on unsupervised structured and semi-structured data exchange has not produced any significant successes over the past fifty years. As a step towards finding a solution, this article proposes a new look at data exchange by using the principles of complex adaptive systems (CAS) to analyze current shortcomings and to propose a direction that may indeed lead to workable and mathematically grounded solution. Three CAS attributes key to this research are variety, tension and entropy. We use them to show that older and contemporary DDLs are identical in their core, thus explaining why even XML and Ontologies have failed to a create fully automated data exchange mechanism. Then we show that it is possible to construct a radically different DDL that overcomes existing data exchange limitations—its variety, tension and entropy are different from existing solutions. The article has these major parts: definition of key CAS attributes;quantitative examination of representative old and new DDLs using these attributes;presentation of the results and their pessimistic ramification;a section that proposes a new theoretical way to construct DDLs that is based entirely on CAS principles, thus enabling unsupervised data exchange. The theory is then tested, showing very promising results.展开更多
针对已有基于模式结构的模式匹配方法的局限性,提出了一种利用模式结构信息和已有匹配知识的模式匹配模——SKM(schema and reused knowledge based matching model).在该模型中,借鉴神经网络元之间的影响过程实现语义匹配推理;通过重...针对已有基于模式结构的模式匹配方法的局限性,提出了一种利用模式结构信息和已有匹配知识的模式匹配模——SKM(schema and reused knowledge based matching model).在该模型中,借鉴神经网络元之间的影响过程实现语义匹配推理;通过重用已有匹配知识深入挖掘模式元素之间的深层语义关系;基于已有匹配知识自动缩减不确定阈值区之间来确定匹配阈值,有效减少人工干涉;给出了简单的确定模式元素之间匹配关系的方法;同时通过自适应式迭代模型,进一步挖掘求精已有匹配知识.实验结果表明,SKM模型切实可行.展开更多
基金Supported by the National Defense Pre-ResearchFoundation of China(4110105018)
文摘How to integrate heterogeneous semi-structured Web records into relational database is an important and challengeable research topic. An improved model of conditional random fields was presented to combine the learning of labeled samples and unlabeled database records in order to reduce the dependence on tediously hand-labeled training data. The pro- posed model was used to solve the problem of schema matching between data source schema and database schema. Experimental results using a large number of Web pages from diverse domains show the novel approach's effectiveness.
基金Projects(2007AA01Z126, 2007AA01Z474) supported by the National High-Tech Research and Development Program of ChinaProject(NCET-06-0928) supported by the Program for New Century Excellent Talents in University
文摘Supposing that the overall situation is dug out from the distributed monitoring nodes, there should be two critical obstacles, heterogenous schema and instance, to integrating heterogeneous data from different monitoring sensors. To tackle the challenge of heterogenous schema, an instance-based approach for schema mapping, named instance-based machine-learning (IML) approach was described. And to solve the problem of heterogenous instance, a novel approach, called statistic-based clustering (SBC) approach, which utilized clustering and statistics technologies to match large scale sources holistically, was also proposed. These two algorithms utilized the machine-leaning and clustering technology to improve the accuracy. Experimental analysis shows that the IML approach is more precise than SBC approach, reaching at least precision of 81% and recall rate of 82%. Simulation studies further show that SBC can tackle large scale sources holisticalty with 85% recall rate when there are 38 data sources.
文摘Data exchange is a goal-oriented social communications system implemented through computerized technology. Data definition languages (DDLs) provide the syntax for communicating within and between organizations, illocutionary acts, such as informing, ordering and warning. Data exchange results in meaning-preserving mapping between an ensemble (a constrained variety) and its external (unconstrained) variety. Research on unsupervised structured and semi-structured data exchange has not produced any significant successes over the past fifty years. As a step towards finding a solution, this article proposes a new look at data exchange by using the principles of complex adaptive systems (CAS) to analyze current shortcomings and to propose a direction that may indeed lead to workable and mathematically grounded solution. Three CAS attributes key to this research are variety, tension and entropy. We use them to show that older and contemporary DDLs are identical in their core, thus explaining why even XML and Ontologies have failed to a create fully automated data exchange mechanism. Then we show that it is possible to construct a radically different DDL that overcomes existing data exchange limitations—its variety, tension and entropy are different from existing solutions. The article has these major parts: definition of key CAS attributes;quantitative examination of representative old and new DDLs using these attributes;presentation of the results and their pessimistic ramification;a section that proposes a new theoretical way to construct DDLs that is based entirely on CAS principles, thus enabling unsupervised data exchange. The theory is then tested, showing very promising results.
文摘针对已有基于模式结构的模式匹配方法的局限性,提出了一种利用模式结构信息和已有匹配知识的模式匹配模——SKM(schema and reused knowledge based matching model).在该模型中,借鉴神经网络元之间的影响过程实现语义匹配推理;通过重用已有匹配知识深入挖掘模式元素之间的深层语义关系;基于已有匹配知识自动缩减不确定阈值区之间来确定匹配阈值,有效减少人工干涉;给出了简单的确定模式元素之间匹配关系的方法;同时通过自适应式迭代模型,进一步挖掘求精已有匹配知识.实验结果表明,SKM模型切实可行.