摘要
[目的/意义]针对目前医学文本中疾病-基因等实体关联关系在知识发现中结合主题的研究较少,不足以揭示医学领域知识在主题层面的深层语义关联关系,提出了一套结合全文文本和领域知识主题的语义相似度计算方法。[方法/过程]以肿瘤期刊全文本为研究对象,用TWE模型进行词向量和主题向量的词嵌入表示,基于Siamese Network框架结合文本和领域知识主题进行相似度计算。[结果/结论]实验表明,该研究所提出的相似度计算方法在验证集中的预测F值达94%,最后通过对测试集数据进行聚类分析,从高、中、低频以及未进行临床注册实验的角度对疾病和关联基因进行分析,发现当前的热门研究以及未来可能成为研究热点的靶点基因。
[Purpose/significance]The research of studies on the combination of subjects with disease-gene and other entity associations in knowledge discovery in medical texts is less,not enough to reveal the deep semantic relationship of medical knowledge in the topic level.Aiming at that,we proposed a set of semantic similarity calculation methods,considering the text of full text and domain knowledge topics.[Method/process]Taking the full text of the oncology journal as the research object.The TWE model is applied to represent the word vector and the topic vector.Based on the Siamese Network framework,we conducted the similarity calculation,considering the text and domain knowledge topics.[Result/conclusion]Experiments showed that the predicted F value of verification set is 94%by means of proposed similarity calculation method.Finally,through the cluster analysis of the test set data,the disease and related genes were analyzed from the perspectives of high,medium,low frequency and no clinical registration experiments,we found the current hot research and potential target genes that may become research hotspots in the future.
出处
《情报理论与实践》
CSSCI
北大核心
2020年第5期183-190,共8页
Information Studies:Theory & Application
基金
江苏省自然科学基金青年项目“基于深度学习的学术全文本时态语义知识标识及检索模型构建研究”(项目编号:BK20190450)
国家自然科学基金面上项目“基于深度学习的学术全文本知识图谱构建及检索研究”(项目编号:71974094)
国家社会科学基金后期资助项目“面向科学研究主题的文本时态特征检索研究”(项目编号:19FTQB015)的成果之一。
关键词
深度学习
语义相似度
孪生神经网络
知识发现
deep learning
semantic similarity
siamese network
knowledge discovery