摘要
目的基于单字和穷举分词的文本分类算法构建负例分析和类别相似度偏度分析方法,查找、发现和修正《国际医学术语词典》(MedDRA)可能存在的问题。方法选择MedDRA 25.1中文版,使用单字和穷举分词生成术语的文本向量,采用逆频率指数和卡方加权生成类特征向量,采用余弦相似度计算文本向量与类特征向量的相似度。负例分析以相似度最大值作为术语类别判定标准。穷举分词向量取各术语前20位余弦相似度计算偏度,在两种类特征下余弦相似度分布均为负偏度,作为判定易混淆术语的指标。人工校验和分析计算结果。结果负例分析发现低位语594个分类错误,其中346个为首选语一致的同义词,154个错误因罗马数字造成,94个因字符增减和字序变化造成,其中16个疑似为翻译错误和不准确。类别相似度偏度分析发现各术语层级字面上易混淆的医学术语共165个,其中以涉及卵巢生殖细胞、淋巴瘤的医学术语最为典型。结论文本分类算法的负例分析可反推字典数据本身的错误,类别相似度偏度分析可发现字面易混淆的医学术语。
AIM To construct negative cases analysis and category similarity skewness analysis methods based on text classification algorithm using single word and exhaustive tokenlization,in order to find and correct the possible problems in Medical Dictionary for Regulatory Activities(MedDRA).METHODS Single word and exhaustive tokenlization were used to generate the text vector of MedDRA terms.The inverse document frequency and chi-square weighting were used to generate class feature vectors.The cosine similarity was used to calculate the similarity between text vectors and generate class feature vectors.Negative cases analysis used the maximum similarity as the criterion for determining the category of terms.The first 20 cosine similarity of each term were used to calculate skewness.Both of the term skewness were negative under the two kinds of features,which is used as an index to determine confusing terms.Manually check and analyze the calculated results.RESULTS The negative cases analysis found 594 classification errors of the low level term(LLT),among which 346 were the synonyms because of the same preferred term(PT),and 154 were caused by Roman numerals,and 94 were caused by the increase or decrease of characters and the change of word order,of which including 16 errors weretranslation errors and inaccuracy.The category similarity skewness analysis found that there were 165 confusing medical terms in LLT and PT,among which the medical terms related to ovarian germ cell and lymphoma were the most typical.CONCLUSION The negative cases analysis of text classification can infer the error of dictionary data itself,and the category similarity skewness analysis can find the literal confusing medical terms.
作者
韩兵
杨桂秀
磨筱垚
HAN Bin;YANG Gui-xiu;MO Xiao-yao(Bejing Shijitan Hospital,Capital Medical University,BEIJING 100038,China;Bejing PVing Medical Technology Co.,Ltd.,BEIJING 100120,China)
出处
《中国新药与临床杂志》
CAS
CSCD
北大核心
2023年第5期331-336,共6页
Chinese Journal of New Drugs and Clinical Remedies
关键词
MedDRA
术语
算法
文本挖掘
负例分析
相似度
MedDRA
terminology
algorithm
text mining
negative cases analysis
similarity