摘要
为解决用深度学习模型对安全检查纪要进行文本挖掘时,面临的数据集规模小、样本数据分布不均衡、命名实体识别(NER)效果差等问题,提出一种新的NER数据增强方法。首先,将数据集中的命名实体分离,并随机替换同类命名实体,避免数据增强技术对命名实体信息的破坏,使命名实体分布更加均匀;然后,通过优化其他部分的噪声数据和比例参数,进一步提高NER的效果;最后,通过自动标注分离后的数据,重新组合,以避免需要手动标注大量数据的弊端。结果表明:该方法可快速解决数据集数据量太小和数据集命名实体分布不均匀等问题;与更简单有效的数据增强(AEDA)方法相比,该方法在健康安全环境(HSE)检查纪要等数据集上取得更好的识别效果,使模型在1倍扩充数据上的综合评价指标从92.83%提升至97.23%;同时,能够得到建筑施工过程中安全隐患在空间上的分布规律和强关联规则。
In order to solve the problems faced by deep learning model in text mining of safety inspection minutes,such as small data set size,uneven distribution of sample data and poor effect of NER,a new data enhancement method for NER was proposed.First of all,the named entities in the data set were separated and the same kind of named entities were replaced randomly,which could not only avoid the damage of data enhancement technology to the information of named entities,but also make the distribution of named entities more uniform.Then,by optimizing the noise data and scale parameters of other parts,the effect of NER was further improved.Finally,the separated data was automatically labeled and recombined to avoid the disadvantage of manually marking a large amount of data.The results show that this method can quickly solve the problems such as the small amount of data and the uneven distribution of named entities in the dataset.Compared with the latest AEDA(An Easier Data Augmentation)method,this method achieves better recognition results on data sets such as HSE inspection minutes,and improves the comprehensive evaluation index of the model on one-fold expanded data from 92.83% to 97.23%.At the same time,the spatial distribution and strong association rules of safety hazards in construction process can be obtained.
作者
夏占杰
张贝克
高东
XIA Zhanjie;ZHANG Beike;GAO Dong(School of Information and Technology,Beijing University of Chemical Technology,Beijing 100029,China)
出处
《中国安全科学学报》
CAS
CSCD
北大核心
2022年第12期53-62,共10页
China Safety Science Journal
基金
国家自然科学基金资助(61703026)。