一种基于局部敏感哈希的文本数据去重算法及其实现

A Text Data Deduplication Algorithm Based onLocality-Sensitive Hashing with Implementation

导出

摘要预训练语言模型的发展激发对网络数据的大规模需求,而网络数据往往具有较高的重复性和相似性,需要经过去重才能更好地被用于模型训练.目前的去重算法可以去除相似和相同的文本数据,但存在运算效率较低的问题,难以用于处理大规模文本数据.本研究提出一种面向大规模文本数据的去重算法,采用先局部后整体的去重策略,极大提高了去重的运算效率.实验结果表明,算法在50 h内完成371 GB数据的去重处理,较已有算法极大地提高了去重效率. The surge in demand for web data,driven by the development of pre-trained language models,highlights the need for effective deduplication due to the inherent redundancy and similarity in such data.Existing deduplication algorithms can eliminate similar and identical text data,but suffer from low computational efficiency,limiting their applicability to large-scale datasets.This paper proposes a deduplication algorithm tailored for largescale text data,employing a local-first,global-later strategy to significantly enhance computational efficiency.Experimental results demonstrate the proposed algorithm’s ability to deduplicate 371 GB of data within 50 h,showcasing a substantial efficiency improvement over existing algorithms.

作者申峻宇李东闻钟震宇张玉志 Shen Junyu;Li Dongwen;Zhong Zhenyu;Zhang Yuzhi(College of Software,Nankai University,Tianjin 300350,China)

机构地区南开大学软件学院

出处《南开大学学报（自然科学版）》 CAS CSCD 北大核心 2023年第6期29-35,共7页 Acta Scientiarum Naturalium Universitatis Nankaiensis

基金国家重点研发计划(2021YFB0300104)。

关键词文本去重最小哈希局部敏感哈希 text deduplication MinHash locality-sensitive hashing

分类号 TP391.1 [自动化与计算机技术—计算机应用技术]

引文网络
相关文献

1冯光璐,欧阳静,李然,倪凡,曾路.电网OA系统非结构化文档内容自动化识别技术[J].信息技术,2024,48(1):104-109.
2朱建平,黄恒,周积,陈海茂,黄利君.一种融合文件及内容分块的重复数据删除算法[J].软件,2023,44(12):53-59.

南开大学学报（自然科学版）

2023年第6期

浏览历史

内容加载中请稍等...

一种基于局部敏感哈希的文本数据去重算法及其实现

相关作者

相关机构

相关主题

浏览历史