结构化Web数据的自动去重方法

AUTOMATIC DUPLICATION DELETION METHOD FOR STRUCTURED WEB DATA

下载PDF

导出

摘要针对载有结构化数据的网页特点,提出了一种新的有效字段发现策略,据此设计了一个基于学习的自动去重方法。对样本网页集进行聚类分析并生成每类网页的包装器,识别出包装器中的有效数据字段;对有效数据字段进行映射,通过计算有效数据字段内容的相似度来判断网页是否重复。实验证明该方法对结构化Web数据的去重有很好的召回率和准确率。 In this paper we present a new strategy of discovering valid data fields in light of the characteristic of webpage with structured data,and design a learning-based automatic duplication deletion method according to it.Sample webpage set is clustered and analysed and the wrappers of each kind of webpages are generated,and valid data fields in the wrappers are identified and then mapped.Whether the webpages has duplicate or not is determined by calculating the similarity of valid data fields’ content.Experiments indicate that this deletion approach for duplicate structural web data has a good recall rate and accuracy.

作者贺晟程家兴王为为蔡欣宝

机构地区安徽大学计算智能与信号处理教育部重点实验室苏州大学智能信息处理及应用研究所

出处《计算机应用与软件》 CSCD 2010年第12期12-14,54,共4页 Computer Applications and Software

基金国家自然科学基金(60273043) 安徽大学研究生创新基金(20073053)

关键词去重文档对象模型聚类结构化数据 Duplication deletion Document object model （DOM） Clustering Structured data

分类号 TP311.13 [自动化与计算机技术—计算机软件与理论]

引文网络
相关文献

参考文献7

1吴平博,陈群秀,马亮.基于特征串的大规模中文网页快速去重算法研究[J].中文信息学报,2003,17(2):28-35. 被引量：41
2Henzinger M. Finding near-duplicate Web pages:Alarge scale evaluation of algorithms [ C]//Annual ACM Conference on Research and Development in Information Retrieval. Washington. ACM Press,2006:284 -291.
3Li Wei, Liu Jianyi, Wang Cong. Web document duplicate removal algorithm based on keyword sequences[ C]//Natural language Processing and Knowledge Engineering. Valencia : IEEE Press, 2005:511 - 516.
4李林,刘桂峰,赵朋朋,崔志明.结构化信息的去重方法[J].计算机工程,2009,35(3):23-25. 被引量：3
5Zheng S, Song R, Wen J R, et al. Joint optimization of wrapper generation and template detection [ C ]//Proc. 13 th KDD. San Jose, CA, USA. 2007 : 894 - 902.
6Zhai Y, Liu B. Structured data extraction from the Web based on partial tree alignment[ J ]. IEEE Trans. on Knowledge and Data Engineering, 2006,18(12) :1614 - 1628.
7Elmagarmid A K, Member S. Duplicate Record Detection: A Survey [ C ]//IEEE Transactions on knowledge and data engneering, 2007,19 (1):1 -16.

二级参考文献11

1Nam G W, Park J H , Kim T Y. Dynamic Management of URL Based on Object Oriented Paradigm[C]//Proceedings of the International Conference on Parallel and Distributed Systems. Taiwan, China: IEEE Computer Society Press, 1998: 226-230.
2Shivakumar N, Garcia Molilna H. Finding Near Replicas of Documents on the Web[C]//Proceedings of Workshop on Web Databases. [S.l.]: Springer Press, 1998: 204-212.
3Cho J H, Shivakumar N, Garcia Molina H. Finding Replicated Web Collections[C]//Proceedings of the ACM International Conference on Management of the Data. [S. l.]: ACM Press, 2000.
4Bharat K, Broder A Z. Mirror, Mirror, on the Web: A Study of Host Pairs with Replicated Content[J]. Computer Networks, 1999, 31 (11-16): 1579-1590.
5Elmagarmid A K, Member S. Duplicate Record Detection: A Survey[C]. IEEE Transactions on Knowledge and Data Engneering, 2007, 19(1): 1-16.
6[1]T.W. Yan and H. Garcia- Molina. Duplicate removal in information dissemination. In Proceedings of the 21st International Conference on Very Large Data Bases(VLDB' 95) ,66 - 77,San Francisco,Ca., USA,September 1995. Morgan Kaufmann Publishers, Inc.
7[2]Narayanan Shivakumar and Hector Garcia- Molina. SCAM: a copy detection mechanism for digital documents. In Proceedings of 2nd International Conference in Theory and Practice of Digital Libraries (DL'95) ,Austin, Texas,June 1995.
8[3]T. Yan and H. Garcia- Molina. The sift information dissemination system. In ACM TODS,2000.
9[4]J.W. Kirriemuir & P. Willett Identification of duplicate and near - duplicate full - text records in database search outputs using hierarchic cluster analysis,in Program-automated library and information,(1995)29(3) :241-256.
10[5]Buckley C. ,Cardie C. ,Mardis S. ,Mitra M. ,Pierce D. ,Wagstaff K. ,Walz J. ,The Smart/Empire TIPSTER IR System, TIPSTER Phase Ⅲ Proceedings,Morgan Kaufmann,San Francisco,CA,2000.

共引文献42

1谢蕙,秦杰.基于元搜索的网页消重方法研究[J].计算机系统应用,2008,17(8):94-96. 被引量：5
2姚新波,马治坤.基于特征串的网页去重算法[J].科技信息,2008(28). 被引量：3
3曹传东,郭理.一种基于文本抽取的网页正文去重算法[J].科技信息,2009(1):102-103. 被引量：1
4谢瑶兵.基于特征串的网页文本并行去重算法[J].微电子学与计算机,2015,32(2):69-72. 被引量：2
5魏常丽,刘玉玲.搜索引擎结果去重Agent系统[J].内蒙古科技与经济,2006(02S):82-85.
6连浩,刘悦,许洪波,程学旗.改进的基于布尔模型的网页查重算法[J].计算机应用研究,2007,24(2):36-39. 被引量：7
7黄永光,刘挺,车万翔,胡晓光.面向变异短文本的快速聚类算法[J].中文信息学报,2007,21(2):63-68. 被引量：17
8罗永莲,张永奎.基于发布时间的新闻网页去重方法研究[J].计算机工程与应用,2007,43(6):119-121. 被引量：3
9钱爱兵,江岚.基于后缀树的中文新闻重复网页识别算法[J].现代图书情报技术,2008(3):55-61. 被引量：6
10陈锦言,孙济洲,张亚平.基于傅立叶变换的网页去重算法[J].计算机应用,2008,28(4):948-950. 被引量：2

1刘驰,闫宏飞.基于元信息的云盘资源检索结果去重[J].山东大学学报（理学版）,2016,51(7):11-17.
2马洁.浅谈网页制作中文字元素的设计[J].信息与电脑（理论版）,2015(14):48-49.
3黄恩博.基于布隆过滤器的网页搜索去重方法[J].现代计算机,2013,19(14):7-10. 被引量：4
4谢蕙,秦杰,胡双双.基于用户查询关键词的网页去重方法研究[J].现代图书情报技术,2008(7):43-46. 被引量：6
5李林,刘桂峰,赵朋朋,崔志明.结构化信息的去重方法[J].计算机工程,2009,35(3):23-25. 被引量：3
6樊勇,郑家恒.基于主题的网页去重[J].电脑开发与应用,2008,21(4):4-6. 被引量：2
7赵晓永,杨扬,王宁.基于声学指纹的海量MP3文件近似去重方法[J].计算机工程,2013,39(7):73-75. 被引量：2
8贺建英,袁小艳,唐青松.大数据下基于多CPU的两级指纹流水计算去重方法[J].计算机系统应用,2015,24(8):206-211. 被引量：1
9王逸明.电脑管理系统的研制经验[J].韶关师专学报,1991(4):113-118.
10赵艳红,李洪奇,朱丽萍,詹坤林.基于Bloom Filter的去重方法研究[J].计算技术与自动化,2016,35(1):95-100. 被引量：1

计算机应用与软件

2010年第12期

浏览历史

内容加载中请稍等...

结构化Web数据的自动去重方法

参考文献7

二级参考文献11

共引文献42

相关作者

相关机构

相关主题

浏览历史