摘要
目前,基于远监督的蛋白质交互关系抽取方法通过将知识库中的实体对与文本中的实体进行匹配来产生大规模的训练数据,有效地解决了标注数据不足的问题。在基于最大期望算法的蛋白质交互识别的基础上,提出了一种基于词频统计的蛋白质交互关系识别。该方法对每一个蛋白质对签名档进行处理,取出两个目标蛋白质中间的单词;然后对其进行词性标注,只保留名词和动词,同时进行词干提取;最终得到每个蛋白质对签名档下的词频统计。利用得到的词频信息设定阈值来获取签名档的高频词,改进最大期望算法的初始化过程。实验结果表明,通过加入高频词信息的干预来进一步获取句子的类别作为初始值较原始的基于最大期望算法的模型,取得了更高且均衡的精确度和召回率,对目前基于远监督的蛋白质交互关系识别方法进行了明显的改进。
Current protein-protein interaction(PPI)extraction approach based on distant supervision gathers large scales of training data by aligning entity pairs in knowledge base with entities in text,which solves the problem of lack of annotation data effectively.In this paper,based on the protein interaction recognition using the expectation maximization algorithm,we propose a novel method of word frequency count,which processes the signature of each protein pair and obtains the unigram words between two target proteins.Then,the data which is obtained by the first step should be processed with POS tagging and stem extraction,only the nouns and verbs saved.Finally,we can obtain the word frequency statistics for signatures of protein pairs.High frequency words are produced by setting the threshold for the word frequency statistics,which can be used to improve the initialization step of the expectation maximization algorithm.The experiment shows that the high and well balanced precision and recall are achieved by further integrating the high-frequency word information to obtain the sentence category as the initial model based on the maximum expectation algorithm,which shows significant improvement in comparison to current PPI based on distant supervision.
作者
蔡松成
牛耘
CAI Song-cheng;NIU Yun(School of Computer Science and Technology,Nanjing University of Aeronautics and Astronautics,Nanjing 211106,China)
出处
《计算机技术与发展》
2019年第2期65-68,72,共5页
Computer Technology and Development
基金
国家自然科学基金(61202132)
关键词
远监督
蛋白质交互
最大期望算法
词频统计
distant supervision
protein-protein interaction
expectation maximization algorithm
word frequency count