摘要
蛋白质交互关系(PPI)是生物医学领域的重要研究内容之一,目前通过生物医学实验得到的PPI信息主要以文本的形式存储在相关文献中。随着生物医学文献数量的飞速增长,人工识别PPI的方式已经难以满足实际应用需求。文中采用基于弱监督的PPI识别基础框架,以少量有交互关系的蛋白质对作为种子集,通过对种子集的不断迭代扩充,最终实现蛋白质交互关系识别。相比于现有的其他方法,该方法仅需少量有标注数据实现了较好的识别效果,节省了大量人力物力。在此基础上,利用词向量对现有的表达交互关系的关键词进行扩充,并对关键词的可靠性进行评分,根据扩充后的关键词集合对基础框架的聚类过程做了改进,将聚类的输入词汇模式集合根据所包含的关键词分数做降序排序。实验结果表明,基础的PPI识别框架仅有少量有标注数据取得了较好的结果,在此基础上改进后的关键词扩充算法进一步提高了PPI识别结果,第一次迭代后的F值最高为67.20%,比改进前的算法提高了1.54%,三次迭代后的F值为69.05%。
Protein-protein interaction is one of the important research areas in the field of biomedicine.The relevant PPI information currently available through biomedical experiments is mainly stored in texts in the relevant literature.With the rapid growth of biomedical literature,the way of manually identifying PPI has been difficult to meet the needs of practical applications.In this paper,we adopt a weak supervision based PPI recognition infrastructure.With a small number of pairs of proteins as an interactive set of seeds,PPI is eventually identified through continuous iteration expansion of the seed set.Compared with other existing methods,this method only needs a small amount of labeled data to achieve great recognition results, which saves a lot of manpower and resources.On this basis,we use the word embedding to expand the existing key words that express PPI and score the reliability of the keywords.According to the expanded set of keywords,the clustering process of the basic framework is improved, and the set of input lexical patterns of clustering is sorted in descending order according to the included keyword scores.The experiment shows that the basic PPI recognition framework achieves better results with only a small amount of labeled data.On this basis ,the improved keyword expansion algorithm further improves the results.The highest F-score after the first iteration is 67.20%,1.54% higher than that before the improvement,and the F-score after three iterations is 69.05%.
作者
毛宇薇
牛耘
MAO Yu-wei;NIU Yun(School of Computer Science and Technology,Nanjing University of Aeronautics and Astronautics,Nanjing 211106,China)
出处
《计算机技术与发展》
2019年第3期18-22,共5页
Computer Technology and Development
基金
国家自然科学基金(61202132)
关键词
蛋白质交互关系
弱监督
分布式假设
词向量
关键词
protein-protein interaction
weak supervision
distributional hypothesis
word embedding
keywords