摘要
数据采集技术的进步导致了数据集规模的飞速上涨,由于数据的大规模和高复杂性引起了严重的数据质量问题,数据清洗是数据活动中必要且重要的环节。为了在保证清洗准确率的情况下有效地降低人工标注成本,提出了一种人工参与的迭代式的数据清洗方法(IDCHI)。该方法在检测模块中提出了数据选择优化方法,使分类器在初始阶段就拥有较高的准确度;并进一步提出了待人工标注数据选择方法,有效地降低人工标注的数据量。实验结果表明该方法可有效且高效地清洗错误数据。
The advancement of data collection technology has led to a rapid increase in the size of datasets.Due to the big scale and high complexity of the data volume,serious data quality issues arise.Therefore,data cleaning is a necessary and important step in data activities.To effectively reduce human annotation costs while ensuring the accuracy of cleaning,an iterative data cleaning method(IDCHI)with human participation was proposed.This method proposed a data selection optimization method in the detection module,which enables the classifier to have high accuracy in the initial stage;and further proposed a method for selecting data to be manually annotated,effectively reducing the amount of data to be manually annotated.The experimental results show that the proposed method is effective and efficient in cleaning erroneous data.
作者
刘一达
丁小欧
王宏志
杨东华
LIU Yida;DING Xiaoou;WANG Hongzhi;YANG Donghua(School of Computer Science and Technology,Harbin Institute of Technology,Harbin 150001,China)
出处
《大数据》
2023年第4期59-68,共10页
Big Data Research
基金
国家重点研发计划资助项目(No.2021YFB3300502)
国家自然科学基金资助项目(No.62202126,No.62232005)
中国博士后科学基金项目(No.2022M720957)
黑龙江省博士后面上资助项目(No.LBH-Z21137)。
关键词
数据清洗
人工参与
迭代式
小批量梯度下降
data cleaning
human_in_loop
iteration
mini-batch gradient descent