期刊文献+

基于词频统计的蛋白质交互关系识别 被引量:3

Protein-protein Interaction Identification Based on Word Frequency Count
下载PDF
导出
摘要 目前,基于远监督的蛋白质交互关系抽取方法通过将知识库中的实体对与文本中的实体进行匹配来产生大规模的训练数据,有效地解决了标注数据不足的问题。在基于最大期望算法的蛋白质交互识别的基础上,提出了一种基于词频统计的蛋白质交互关系识别。该方法对每一个蛋白质对签名档进行处理,取出两个目标蛋白质中间的单词;然后对其进行词性标注,只保留名词和动词,同时进行词干提取;最终得到每个蛋白质对签名档下的词频统计。利用得到的词频信息设定阈值来获取签名档的高频词,改进最大期望算法的初始化过程。实验结果表明,通过加入高频词信息的干预来进一步获取句子的类别作为初始值较原始的基于最大期望算法的模型,取得了更高且均衡的精确度和召回率,对目前基于远监督的蛋白质交互关系识别方法进行了明显的改进。 Current protein-protein interaction(PPI)extraction approach based on distant supervision gathers large scales of training data by aligning entity pairs in knowledge base with entities in text,which solves the problem of lack of annotation data effectively.In this paper,based on the protein interaction recognition using the expectation maximization algorithm,we propose a novel method of word frequency count,which processes the signature of each protein pair and obtains the unigram words between two target proteins.Then,the data which is obtained by the first step should be processed with POS tagging and stem extraction,only the nouns and verbs saved.Finally,we can obtain the word frequency statistics for signatures of protein pairs.High frequency words are produced by setting the threshold for the word frequency statistics,which can be used to improve the initialization step of the expectation maximization algorithm.The experiment shows that the high and well balanced precision and recall are achieved by further integrating the high-frequency word information to obtain the sentence category as the initial model based on the maximum expectation algorithm,which shows significant improvement in comparison to current PPI based on distant supervision.
作者 蔡松成 牛耘 CAI Song-cheng;NIU Yun(School of Computer Science and Technology,Nanjing University of Aeronautics and Astronautics,Nanjing 211106,China)
出处 《计算机技术与发展》 2019年第2期65-68,72,共5页 Computer Technology and Development
基金 国家自然科学基金(61202132)
关键词 远监督 蛋白质交互 最大期望算法 词频统计 distant supervision protein-protein interaction expectation maximization algorithm word frequency count
  • 相关文献

参考文献4

二级参考文献35

  • 1饶文碧,柯慧燕.Web文本分类技术研究及其实现[J].计算机技术与发展,2006,16(3):116-118. 被引量:5
  • 2王煜,白石,王正欧.用于Web文本分类的快速KNN算法[J].情报学报,2007,26(1):60-64. 被引量:33
  • 3[1]PUSTEJOVSKY J,CASTANO,ZHANG J.Robust relational parsing over biomedical literature:extracting inhibit relations[C]// Proceedings of the Seventh Pacific Symposium on Bio-Computing.[S.l.],2002:362-373.
  • 4[2]LEROY G,CHEN H,MARTINEZ J D.A shallow parser based on closed-class words to capture relations in biomedical text[J].Journal of Biomedical Informatics,2003,36(3):145-158.
  • 5[3]PARK J C,KIM H S,KIM J J.Bidirectional incremental parsing for automatic pathway identification with combinatory categorical grammar[C]// Proceedings of the Pacific Symposium on Bio-Computing.Hawaii,USA,2001:396-407.
  • 6[4]TEMKIN J M,GILDER M R.Extraction of protein interaction information from unstructured text using a context-free grammar[J].Bioinformatics,2003,19:2046-2053.
  • 7[5]AHMED S T,CHINDAMBARAM D,DAVULCU H,et al.IntEx:a syntactic role driven protein-protein interaction extractor for bio-medical text[C]// Proceeding of the ACL-ISMB Workshop on Linking Biological Literature,Ontologies and Databases:Mining Biological Semantics.Detroit,Michigan,USA,2005:54-61.
  • 8[6]ONO T,HISHIGAKI H,TANIGAMIi A,et al.Automatic extraction of information on protein-protein interactions from the biological literature[J].Bioinformatics,2001,17 (2):155-161.
  • 9[7]HUANG M L,ZHU X Y,HAO Y,et al.Discovering patterns to extract protein-protein interactions from full texts[J].Bioinformatics,2004,20 (18):3604-3612.
  • 10[8]DAVID C,BEMARD B,WILLIAM L,et al.BioRAT:extracting biological information from full-length papers[J].Bioinformatics,2004,20(17):3206-3213.

共引文献26

同被引文献9

引证文献3

二级引证文献5

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部