摘要
【目的】为提高医学信息查询意图强度识别的精度并解决查询式词向量表征困难、标注数据集少等问题,设计一种基于任务知识融合与文本数据增强的医学信息查询意图强度识别方法。【方法】在文本数据增强方面,选取SimBERT模型,实现小样本数据集的文本数据增强;在文本表示方面,利用医学信息查询式文本语料对BERT模型进行增量预训练,获得融合任务知识的MQ-BERT模型;在文本分类方面,在MQ-BERT后引入Bi-LSTM等模型进行分类任务,并对比文本数据增强前后的分类效果。【结果】融合任务知识的MQBERT的分类结果F-Score达到92.22%,超越了阿里巴巴团队提出的MC-BERT在同一任务数据集上的最佳结果(F-Score=87.5%);文本数据增强后,模型分类效果进一步提升,其中基于MQ-BERT和Bi-LSTM的模型获得最佳分类结果,F-Score为95.34%,相比MC-BERT提升了7.84个百分点。【局限】增量预训练过程的数据选择方法在未来可以进一步优化。【结论】任务知识融合与文本数据增强能有效提高医学信息查询意图强度识别精度,针对不同强度的查询意图,应该对其查询结果采用不同的呈现方式,以提升医学信息检索系统的查询准确度,更好地满足用户的医学信息需求。
[Objective]This paper proposes a recognition model for the intensity of medical query intentions based on task knowledge fusion and text enhancement,aiming to improve the representation of query word vectors,as well as expand labeled data sets.[Methods]First,we used the SimBERT model to realize the text data enhancement of small task data set.Then,we utilized the medical query text corpus to incrementally pre-train the BERT model and obtain the MQ-BERT(Medical-Query BERT)model with task knowledge.Finally,we introduced the Bi-LSTM and other models to compare the classification performance before and after text data enhancement.[Results]The F-Score of our new MQ-BERT model reached 92.22%,which is superior than the existing models by Alibaba team on the same task data set(F-Score=87.5%).With the text data enhancement,the classification performance of our new model was also improved(F-Score=95.34%),which is 7.84%higher than the MC-BERT one.[Limitations]The data selection of incremental pre-training process could be further optimized.[Conclusions]Task knowledge fusion and text data enhancement can effectively improve the recognition accuracy of the intensity of medical query intentions,which benefits the developments of medical information retrieval system.
作者
赵一鸣
潘沛
毛进
Zhao Yiming;Pan Pei;Mao Jin(Center for Studies of Information Resources,Wuhan University,Wuhan 430072,China;School of Information Management,Wuhan University,Wuhan 430072,China;Big Data Institute,Wuhan University,Wuhan 430072,China;National Demonstration Center for Experimental Library and Information Science Education,Wuhan University,Wuhan 430072,China)
出处
《数据分析与知识发现》
CSSCI
CSCD
北大核心
2023年第2期38-47,共10页
Data Analysis and Knowledge Discovery
基金
国家自然科学基金项目(项目编号:71874130,72274146)
教育部人文社会科学研究项目(项目编号:18YJC870026)的研究成果之一。
关键词
医学信息查询
意图强度识别
文本数据增强
任务知识融合
BERT模型
Medical Information Query
Intention Intensity Recognition
Text Data Enhancement
Task Knowledge Fusion
BERT Model