摘要
针对ICD编码分类任务存在的标签分布不平衡、临床记录文本过长和标签空间庞大等问题,提出一种基于数据增强和扩张卷积的ICD编码分类方法。首先,引入预训练模型BioLinkBERT,在生物医学领域采用无监督学习方式进行训练,以缓解域不匹配问题;其次,运用Mixup数据增强技术扩充隐藏表示,从而增加数据多样性及提升模型分类的鲁棒性,解决标签分布不平衡问题;最后,利用多粒度扩张卷积有效捕获文本数据中的长距离依赖关系,避免因输入文本过长影响模型效果。实验结果表明,该模型在MIMIC-Ⅲ数据集的两个子集上与多种方法进行比较,相较于基准模型的F_1值和precision@k值分别提升0.4%~1.5%和1.2%~1.6%。因此,本研究为解决ICD编码分类中的挑战提供有效的解决方案。
To address the problems of unbalanced label distribution,excessively long medical record text and large label space in the international classification of diseases(ICD)coding classification task,this paper proposed an ICD coding classification method based on data augmentation and dilated convolution.Firstly,this method introduced the pre-trained model BioLinkBERT,trained in the biomedical domain using unsupervised learning,to alleviate the domain mismatch problem.Secondly,it applied the Mixup data augmentation technique to expand the hidden representations,thereby increasing data diversity and improving model robustness for classification,addressing the problem of imbalanced label distribution.Finally,the model effectively captured long-range dependencies in the text data using multi-granularity dilated convolution,avoiding the impact of long input text on the model’s performance.The experimental results demonstrate that the proposed model achieves notable improvements over the baseline model on two subsets of the MIMIC-Ⅲdataset when compared with various methods.Specifically,the F 1 scores and precision@k values improves 0.4%to 1.5%and 1.2%to 1.6%,respectively.Therefore,this study provides an effective solution to solve the challenges of ICD coding classification.
作者
闫婧
赵迪
孟佳娜
林鸿飞
Yan Jing;Zhao Di;Meng Jiana;Lin Hongfei(School of Computer Science&Engineering,Dalian Minzu University,Dalian Liaoning 116600,China;School of Computer Science&Technology,Dalian University of Technology,Dalian Liaoning 116024,China;Dalian Yongjia Electronic Technology Co.,Dalian Liaoning 116024,China)
出处
《计算机应用研究》
CSCD
北大核心
2024年第11期3329-3336,共8页
Application Research of Computers
基金
辽宁省自然科学基金资助项目(2022-BS-104)。