摘要
属性抽取主要基于单元素属性抽取,而多元素属性抽取的研究较少。针对多元素组成属性情况进行研究,提出一种基于查询接口DOM结构的属性抽取方法,该方法将查询接口解析成DOM,基于DOM节点提取查询接口上对应的表单元素,对从查询接口上提取获得的元素集合进行二次聚类,挖掘元素之间的组合关系,最终将元素组合形成属性。这种方法能很好地抽取接口上的单元素属性和多元素属性,实验结果表明了方法的有效性。
Query interface schema extraction is the precondition of Deep Web data integration. Generally query inter /ace schema consists of a set of domain-related attributes, and one attribute is formed by a single element or a com bination of multi-elements. The current researches on attribute extraction are mostly based on the single element fashion, and those multi-elements based are few. Aiming at the case of multi-elements attribute extraction, a DOM- based method for query interface schema extraction is proposed. This method parses query interface to become a DOM and extracts the form elements base on the corresponding DOM nodes. The method employs two-phase clus- tering algorithms to cluster the form elements, mines the combination relationship of them and combines elements to realize attributes extraction. This method has a favorable performance at both single-element and multi-elements attribute extraction. The experimental result shows that this method is effective.
出处
《桂林电子科技大学学报》
2012年第6期468-472,共5页
Journal of Guilin University of Electronic Technology
基金
国家自然科学基金(61163057)