期刊文献+

基于不同文本表示的大规模蛋白功能预测探究

Research on Different Text Representation Methods of Large-scale Protein Function Prediction
下载PDF
导出
摘要 因为使用生化实验确定蛋白功能需要耗费大量的时间和资源,所以利用计算技术自动标注蛋白功能意义重大。基于文本进行蛋白质功能预测的方法可以充分利用蛋白序列以外的数据。为了探究文本分类中不同的文本表示对蛋白质功能预测任务的影响,通过实验分析了一系列主流的文本表示方法,包括传统的基于词袋模型的稀疏表示(TFIDF)和含有深度语义信息的稠密表示(W2V,GloVe,D2V),并做了两方面拓展:(1)对基于词向量的文本表示考虑使用IDF加权平均(WW2V,WGloVe);(2)稀疏表示与稠密表示的拼接结合(WW2V-TFIDF,WGloVe-TFIDF,D2V-TFIDF,Combined)。实验结果证明,IDF加权平均比直接平均效果更好;每个单独的表示侧重点不同,各有优缺点;稀疏表示与稠密表示具有互补性;多种表示的组合(结合了TFIDF,WW2V,WGloVe,D2V)效果最好。 Since it takes a lot of time and resources to determine the functions of a protein through biochemical experiments,it is of great significance to automatically annotate the protein functions using computational technology.Text-based methods for protein function prediction can take full advantages of data other than protein sequence.In order to explore the effects of different text representations in text categorization on the task of protein function prediction,the article analyzes the mainstream text representation methods by a series of experiments,including the traditional sparse representation of BOW(TFIDF)and the dense representations with deep semantic information(W2 V,GloVe,D2 V).In addition,we make two expansions:(1) Considering the IDF weighted average to word-embedding representations(WW2 V,WGloVe);(2)integrating the sparse representations and dense representations(D2 V-TFIDF,WW2 V-TFIDF,WGloVe-TFIDF,Combined).The results show that the IDF weighted average is better than the pure average;each individual representation focuses on different points and each of them has its own advantages and disadvantages;the sparse representation and the dense representation are complementary;the combination of multiple representations(combined with TFIDF,WW2 V,WGloVe,D2 V)presents the best performance.
作者 乔羽 姚舒威 QIAO Yu;YAO Shuwei(School of Computer Science and Technology,Fudan University,Shanghai 20043)
出处 《微型电脑应用》 2018年第7期1-5,共5页 Microcomputer Applications
基金 国家自然科学基金(61572139)
关键词 蛋白质功能预测 机器学习 文本表示 Protein function prediction Machine learning Semantic similarity
  • 相关文献

参考文献1

二级参考文献52

  • 1sBRANDEN C, TOOZE J. Introduction to Protein Structure[ M]. New York : Garland Pub, 1999.
  • 2LODISH H, BERK A, KAISER C A, et al. Molecular Cell Biology [ M] .7th ed. New York: WH Freeman and Company, 2012.
  • 3ROSENBERG I M. Protein analysis and purification: benchtop techniques[ M]. 2th ed. Boston: Birkhauser ,2005.
  • 4ULE J, JENSEN K B, RUGGIU M, et al. CLIP identified Nova- regulated RNA networks in the brain [ J ]. Science, 2003, 302 (5648) : 1212-1215.
  • 5YOUNG K H. Yeast two-hybrid: so many interactions, (in) so little time[J]. Biology of Reproduction, 1998, 58 (2) : 302-311.
  • 6ROST B, LIU J, WRZESZCZYNSKI K O, et al. Automatic prediction of protein function [ J ]. Cellular & Molecular Life Sciences Cmls, 2003, 60 (12) : 2637-2650.
  • 7ASHBURNER M. Gene ontology: tool for the unification of biology [J]. Nature Genetics, 2000, 25(1) :25-29.
  • 8TETKO I, RODCHENKOV I, WALTER M, et al. Beyond the "best" match: Machine learning annotation of protein sequences by integration of different sources of information [ J ]. Bioinformatics, 2008, 24(5) :621-628.
  • 9ANFINSEN C B, WHITE F H. The kinetics of formation of native ribonuclease during oxidation of the reduced polypeptide chain [ J ]. Proc Natl Acad Sci USA, 1961, 47(9) :1309-1314.
  • 10DOBSON P D, CAI Y B, DOIG A J, et al. Prediction of protein function in the absence of significant sequence similarity [ J ]. Current Medicinal Chemistry, 2004, 11(16) :2135-2142.

共引文献2

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部