基于树编辑距离的聚类算法数据记录抽取

下载PDF

导出

摘要本文研究了如何从列表页面中抽取数据记录.系统分为两个阶段:第一步采用三种启发式方法相结合的方法,识别主数据区域的根节点;第二步将数据记录分离,提出了一种新的基于树编辑距离的聚类算法,来减少候选分割方案的数量,然后根据公式计算相似度,找出最佳分割方案.本文通过对大量不同领域的网页进行测试,结果表明本文方法具有较高的准确率.

作者宫丽娜祝美莲

机构地区枣庄学院中国石油大学(华东)

出处《赤峰学院学报（自然科学版）》 2013年第12期28-30,共3页 Journal of Chifeng University(Natural Science Edition)

关键词主数据区域数据记录抽取树编辑距离聚类算法

分类号 TP311 [自动化与计算机技术—计算机软件与理论]

引文网络
相关文献

参考文献8

1A.H.F. Laender, B.A. Ribeiro-Neto, A. Soares da Silva, J.S. Teixeira, A brief survey of web data extraction tools, ACM SIGMOD Record 31 (2) (2002) 84-93.
2V. Crescenzi, G. Mecca, P. Merialdo, ROADRUN- NER: towards automatic data extraction from large web sites, in: Proceedings of the 2001 International VLDB Conference, (2001):109- 118.
3B. Liu, Grossman, R. and Y. Zhai, Mining data records in Web pages. KDD, (2003):601-606.
4Y. Zhai, B. Liu, Structured data extraction from the web based on partial tree alignment, IEEE Transactions on Knowledge and Data Engineering 18 (12) (2006) 1614 -1628.
5A. Arasu, H. Garcia-Molina, Extracting structured data from web pages, in: Proceedings of the ACM SIG- MOD International Conference on Management of Da- ta,(2003).
6C. Chang, S. Lui, IEPAD: information extraction based on pattern discovery, in: Proceedings of 2001 Interna- tionalWorldWide Web Conference,(2001):681- 688.
7B. Liu, Y. Zhai, NET: System for extracting Web data from flat and nested data records. In Proceedings of the Conference on Web Information Systems Engineering, (2005):487-495.
8Manuel A'lvarez, Alberto Pan, Juan Raposo, Fernando Bellas, Fidel Cacheda, Extracting lists of data records from semi-structured web pages, Data & Knowledge Engineering (64), (2008):491-509.

1马良.启发式方法纵观[J].自动化博览,1993(5):29-30.
2刘鑫.在Word表格中进行计算和格式转换[J].秘书,2014(2):32-33.
3孙春艳.EXCEL技巧大放送[J].家庭电脑世界,2002(9):104-104.
4Excel中删除重复数据的小技巧[J].彩票研究,2010(10):79-79.
5张玉芳,张泓博,熊忠阳.语义相似度计算在语义标注中的应用[J].计算机工程与应用,2013,49(4):153-156. 被引量：4
6叶海河.置换Excel表格的行与列[J].电脑迷,2003,0(12):68-68.
7宋志明.巧妙实现Word数据分列[J].电脑迷,2007,0(3):83-83.
8蒋玉茹,宋柔.基于细粒度特征的话题句识别方法[J].计算机应用,2014,34(5):1345-1349. 被引量：6
9猎.略道.输入法也能做计算器[J].电脑迷,2008,0(11):73-73.
10沈兆宣,林宗楷,郭玉钗.一种求解大规模划分问题的启发式方法[J].计算机辅助设计与图形学学报,1993,5(1):18-24.

赤峰学院学报（自然科学版）

2013年第12期

浏览历史

内容加载中请稍等...

基于树编辑距离的聚类算法数据记录抽取

参考文献8

相关作者

相关机构

相关主题

浏览历史