摘要
在线孟德尔人类遗传数据库(OMIM)是描述人类遗传病及其相关基因的知识库,其词条包括疾病的临床特征、基因连锁分析、染色体定位以及动物模型等,是研究疾病与基因关系的重要依据。疾病表型的相似性可能提示分子之间的相互作用。进行表型比对将有助于预测疾病候选基因以及分析分子之间的关系。OMIM数据库采用文本描述疾病表型,并不适用于计算机分析。对OMIM数据进行标准化对于大规模比对和分析疾病的表型数据、建立表型与基因的对应关系具有重要的意义。研究者近期通过引入标准的医学语言系统,采用文本挖掘中的词频-逆文档频率技术以及用于文档分类的余弦定理方法,结合基因本体论及其比对方法,推动了OMIM数据挖掘的快速发展。本文总结了近年来OMIM数据标准化、表型相似性度量及数据挖掘研究的主要成果,并对其发展趋势进行了预测。
Online Mendelian Inheritance in Man (OMIM) is a knowledge source and data base for human genetic dis- eases and related genes. Each OMIM entry ineludes clinical synopsis, linkage analysis for candidate genes, chromo- somal localization and animal models, which has become an authoritative source of information for the study of the relationship between genes and diseases. As overlap of disease symptoms may reflect interactions at the molecular level, comparison of phenotypic similarity may indicate candidate genes and help to discover functional connections between genes and proteins. However, the OMIM has used free text to describe disease phenotypes, which does not suit computer analysis. Standardization of OMIM data therefore has important implications for large-scale comparison of disease phenotypes and prediction of phenotype-genotype correlations. Recently, standard medical language sys- tems, term frequency-inverse document frequency and the law of cosines for document classification have been intro- duced for mining of OMIM data. Combined with Gene Ontology and various comparison methods, this has achieved substantial successes. In this article, we have reviewed various methods for standardization and similarity comparison of OMIM data. We also predicted the trend for research in this direction.
出处
《生物医学工程学杂志》
EI
CAS
CSCD
北大核心
2014年第6期1400-1404,共5页
Journal of Biomedical Engineering
基金
国家自然科学基金资助项目(81072899
61071213
81473446)