摘要
预训练语言模型的发展激发对网络数据的大规模需求,而网络数据往往具有较高的重复性和相似性,需要经过去重才能更好地被用于模型训练.目前的去重算法可以去除相似和相同的文本数据,但存在运算效率较低的问题,难以用于处理大规模文本数据.本研究提出一种面向大规模文本数据的去重算法,采用先局部后整体的去重策略,极大提高了去重的运算效率.实验结果表明,算法在50 h内完成371 GB数据的去重处理,较已有算法极大地提高了去重效率.
The surge in demand for web data,driven by the development of pre-trained language models,highlights the need for effective deduplication due to the inherent redundancy and similarity in such data.Existing deduplication algorithms can eliminate similar and identical text data,but suffer from low computational efficiency,limiting their applicability to large-scale datasets.This paper proposes a deduplication algorithm tailored for largescale text data,employing a local-first,global-later strategy to significantly enhance computational efficiency.Experimental results demonstrate the proposed algorithm’s ability to deduplicate 371 GB of data within 50 h,showcasing a substantial efficiency improvement over existing algorithms.
作者
申峻宇
李东闻
钟震宇
张玉志
Shen Junyu;Li Dongwen;Zhong Zhenyu;Zhang Yuzhi(College of Software,Nankai University,Tianjin 300350,China)
出处
《南开大学学报(自然科学版)》
CAS
CSCD
北大核心
2023年第6期29-35,共7页
Acta Scientiarum Naturalium Universitatis Nankaiensis
基金
国家重点研发计划(2021YFB0300104)。
关键词
文本去重
最小哈希
局部敏感哈希
text deduplication
MinHash
locality-sensitive hashing