基于MapReduce的模体发现算法被引量：7

An algorithm for motif finding based on MapReduce

下载PDF

导出

摘要模体发现对于基因发现和理解基因调控关系有着重要的意义,它是生物信息学中最具挑战性的问题之一。提出了针对PMSP算法的3种数据划分方法,并在此基础上提出了基于MapReduce的模体发现算法(PMSPMR)。针对不同难度的问题,在Hadoop集群上的实验结果表明,PMSPMR算法具有良好的可扩展性。特别地,对于难度较大的模体发现问题实例,PMSPMR算法的加速比接近于Hadoop集群中节点的数目。此外,对于真实数据的实验,PMSPMR算法能够识别出真核细胞和酿酒酵母中已知的转录调控模体,表明了算法的有效性。 Motif search plays an important role in gene finding and understanding gene regulation relationship, and is one of the most challenging problems in bioinfotmatics. This paper presents three data partitioning methods for the PMSP algorithm and proposes the PMSP MapReduce algorithm （PMSPMR） for solving motif search problems. For problems of varying difficulty, the experimental results on the Hadoop cluster demonslrate that PMSPMR has good scalability. In particular, for motif search problems with high levels of difficulty, PMSPMR shows its advantage because the speedup is almost linearly proportional to the number of nodes in the Hadoop cluster. This paper also presents experimental results on realistic biological data by identifying known transcriptional regulatory motifs in eukaryotes as well as in actual promoter sequences extracted from Saccharomyces cerevisiae.

作者霍红卫林帅于强张懿璞

机构地区西安电子科技大学计算机学院

出处《中国科技论文》 CAS 北大核心 2012年第7期487-494,502,共9页 China Sciencepaper

基金国家自然科学基金资助项目(61173025) 高等学校博士学科点专项科研基金资助项目(20100203110010)

关键词模体发现数据划分可扩展性 motiffinding data partitioning scalability

分类号 TP301.6 [自动化与计算机技术—计算机系统结构]

引文网络
相关文献

参考文献21

1Evans P,Smith A,Wareham H. On the complexity of finding common approximatesubstrings[J].TheorComputSci,2003,(1/3):407-430.
2Das M,Dai H. A survey of DNA motif finding algorithms[J].BMC Bioinformatics,2007,(Suppl,7):S21.
3Hu J,Li B,Kihara D. Limitations and potentials of current motif discovery algorithms[J].Nucleic Acids Research,2005,(15):4899-4913.
4LawrenceC,AltschulS,BoguskiM. Detectingsubtlesequencesignals:a Gibb's sampling strategy for multiple alignment[J].Science,1993,(5131):208-214.
5Bailey T,Elkan C. Fiting a mixture model by expectation maximization to discover motifs in biopolymers[A].Menlo Park,California:AAAIPress,1994.28-36.
6Buhler J,Tompa M. Finding motifs using random projections[J].Journal of Computational Biology,2002,(02):225-242.
7Huo Hongwei,Zhao Zhenhua,Stojkovic V. Optimizing genetic algorithm for motif discovery[J].Mathematical and Computer Modelling,2010,(11/12):2011-2020.
8Pevzner P,Sze S. Combinatorial approaches to finding subtle signals in DNA sequences[A].Menlo Park,California:AAAI Press,2000.269-278.
9Pisanti N,Carvalho A,Marsan L. RISOTTO: Fast extraction of motifs with mismatches[A].Arequipa,Peru:Springer,2006.757-768.
10Davila J,Bala S,Rajasekaran S. Fast and practical algorithms for planted (l,d) motif search[J].IEEE/ACM Trans Comput Biol Bioinform,2007,(04):544-552.

同被引文献40

1王淑娟,赵再新,高宏亮,翟国富.基于GPS和GSM的铁路机车监控调度系统车载单元的设计[J].测控技术,2005,24(6):69-72. 被引量：4
2袁孝均.轨道电路分路不良问题研究[J].铁道通信信号,2007,43(4):11-14. 被引量：31
3黄采伦,樊晓平,陈特放,张剑.铁路机车实时安全状态监测及故障预警系统[J].机车电传动,2007(4):62-66. 被引量：6
4Pevzner P, Sze S. Combinatorial approaches to finding subtle signals in DNA sequences [C]//Proceedings of the Eighth International Conference on Intelligent Sys- tems for Molecular Biology. Menlo Park, California: AAAI Press, 2000: 269-278.
5Evans P, Smith A, Wareham H. On the complexity of finding common approximate substrings [J]. Theor Comput $ci, 2003, 306:407-430.
6Pavesi G, Mauri G, Pesole G. An algorithm for finding signals of unknown length in DNA sequences[J]. Bioinformatics, 2001, 17: 207-204.
7Eskin E, Pevzner P. Finding composite regulatory pat- terns in DNA sequences [J]. Bioinformatics, 2002, 18: 354-363.
8Pisanti N, Carvalho A, Marsan L, et al. RISOTTO: fast extraction of motifs with mismatches [C]//Pro- ceedings of the Seventh Latin American Symposium: Theoretical Informatics. Arequipa, Peru: Springer, LNCS 3887, 2006: 757-768.
9Davila J, Balla S, Rajasekaran S. Fast and practical al gorithms for planted (l, d) motif search [J]. IEEE/ ACM Trans Comput Biol Bioinform, 2007, 4 ( 4 ) : 544-552.
10Ho E, Jakubowski C, Gunderson S. iTriplet, a rule- based nucleic acid sequence motif finder [J]. Algor Mol Biol, 2009, 4:1-14.

引证文献7

1王鑫鑫,卢晓红,贾振元,贾旭,李光俊,武文毅.微铣削表面粗糙度预测模型的研究[J].新型工业化,2013,2(10):39-47.
2程航,栗风永,余江,张新鹏.基于空间滤波的LBP特征和彩色直方图的加密域图像检索#[J].新型工业化,2013,2(11). 被引量：4
3霍红卫,于强,牛伟.结合最大团求精的随机投影模体发现算法[J].中国科技论文,2013,8(4):342-349.
4周小平,刘祥磊.海量铁路机车GIS定位数据分布式处理技术[J].中国科技论文,2015,10(7):812-816. 被引量：3
5魏笑笑,王小正,王圣滔,谢田田.基于Spark的校园信息分析系统的设计与实现[J].软件,2017,38(10):94-99. 被引量：1
6胡宏涛,龚逸文.植入(l,d)模体发现若干算法的实现与比较[J].智能计算机与应用,2019,9(1):211-213.
7贺梦洁,朱美正,初宁,杨岗.基于Spark平台的地理数据并行装载技术[J].软件,2016,37(12). 被引量：1

二级引证文献9

1刘晓志,吴永刚.基于双曲余弦函数的智能天线自适应波束形成算法[J].新型工业化,2014,4(3):74-79. 被引量：4
2王亚洲.基于Hadoop平台的交通数据处理系统设计与实现[J].软件导刊,2016,15(4):124-126. 被引量：1
3何文韬,邵诚.工业大数据分析技术的发展及其面临的挑战[J].信息与控制,2018,47(4):398-410. 被引量：40
4刘奇灿.基于Wagtail的校园信息分享系统的设计与实现[J].智能计算机与应用,2018,8(4):175-177.
5杨青.海量数据环境下可破坏性数据定位算法研究[J].计算机时代,2018(11):5-9.
6李涛,冯仲科,孙素芬,程文生.基于Hadoop的气象大数据分析GIS平台设计与试验[J].农业机械学报,2019,50(1):180-188. 被引量：21
7翟宁宁.基于Radon变换估计点扩散函数[J].软件,2014,35(12):75-78.
8张继民,马丰原,王勤民.钾长石图像分选方案与算法研究[J].新型工业化,2014,4(7):70-74. 被引量：5
9王欣,李胜刚,秦斌,刘俊杰.基于模糊支持向量机的风电场功率预测[J].新型工业化,2014,4(9):50-55. 被引量：11

1霍红卫,于强,牛伟.结合最大团求精的随机投影模体发现算法[J].中国科技论文,2013,8(4):342-349.
2黄影.模体发现问题中OOPS模型的EM算法[J].科教导刊,2015(08X):20-21.
3王菊,刘付显,靳春杰,李祯东.一种面向不确定数据流的模体发现算法[J].电子科技大学学报,2017,46(1):81-87. 被引量：3
4张懿璞.一种新的DNA模体发现聚类求精算法[J].西安电子科技大学学报,2014,41(6):95-99. 被引量：1
5王菊,刘付显.一种面向多属性不确定数据流的模体发现算法[J].电子与信息学报,2017,39(1):159-166. 被引量：1
6覃桂敏,高琳,呼加璐.生物网络模体发现算法研究综述[J].电子学报,2009,37(10):2258-2265. 被引量：7
7沈一飞,陈国良,张强峰.基于纳米计算结构上的生物序列模体发现算法[J].小型微型计算机系统,2007,28(4):635-639. 被引量：3
8尹龙,尹东,张荣,王德建.一种扭曲粘连字符验证码识别方法[J].模式识别与人工智能,2014,27(3):235-241. 被引量：19
9木妮娜.玉素甫,古丽娜.玉素甫.有效的Common Motif识别算法[J].电脑知识与技术（过刊）,2016,22(4X):164-168.
10张守霞,高琳.基于位置相互关系的模体识别算法[J].电子科技,2010,23(1):15-17.

中国科技论文

2012年第7期

浏览历史

内容加载中请稍等...

基于MapReduce的模体发现算法被引量：7

参考文献21

同被引文献40

引证文献7

二级引证文献9

相关作者

相关机构

相关主题

浏览历史

基于MapReduce的模体发现算法 被引量：7

参考文献21

同被引文献40

引证文献7

二级引证文献9

相关作者

相关机构

相关主题

浏览历史

基于MapReduce的模体发现算法被引量：7