摘要
基于R语言,将R程序包Rsubread、Rsamtools、refGenome和GenomicRanges整合为一个完整的流程,实现了基因表达芯片探针序列的自主注释。以应用范围最广的GPL570,GPL10558和曾使用的GPL21163芯片平台为测试数据进行重注释,并将GPL570的新注释与现存的注释做比较;对较新的长链非编码RNA表达芯片GPL16956进行自主注释,以测试流程的实用性。结果表明:GPL570的自主注释覆盖到了89.58%的探针,GPL10558、GPL21163和GPL16956的自主注释分别覆盖到了81.54%、84.68%和76.15%的探针。在GPL570新注释单独比对到的7107个基因中,有411个编码蛋白的基因能够富集到GO条目,而另外两种注释未能比对到这些基因,证明了本流程的可靠性和先进性。因此,本流程实用、有效,为数据挖掘工作提供了新的有力工具。
Based on the R language,the packages Rsubread,Rsamtools,refGenome,and GenomicRanges are integrated into a complete workflow to realize the self⁃annotation of the microarray gene expression.The most widely applied chip platform GPL570,GPL10558 and GPL21163 used as re⁃annotating datasets and the new annotation of GPL570 is compared with existing one.Self⁃annotation of the relatively new lincRNA expression chip GPL16956 is accomplished to test the practicality of the workflow.The annotation coverage rate of GPL570 was 89.58%whereas the rate of GPL10558,GPL21163 and GPL16956 were 81.54%,84.68%and 76.15%.Among the unique 7107 genes in this workflow,411 protein⁃coding gene were enriched to GO terms whereas the other two existing annotations could not,indicating the reliability and advancement of this study.Therefore,this workflow is practical and effective,and provides a new powerful tool for data mining.
作者
孙小洁
郑方强
曾健明
SUN Xiaojie;ZHENG Fangqiang;ZENG Jianming(College of Plant Protection,Shandong Agricultural University,Tai′an 271018,China;Zhuhai Jianming Biomedical Technology Co.,Ltd.,Zhuhai 519000,China)
出处
《生物加工过程》
CAS
2021年第1期17-22,共6页
Chinese Journal of Bioprocess Engineering