近年来,基于单分子测序技术的ISO-seq数据以其超长读段长度被越来越多地应用于转录组新型异构体预测研究,但目前大多数研究工作只用到全长读段数据,丢失了非全长读段数据中较多有用信息,因而数据没有得到充分利用。针对这一问题,本文在...近年来,基于单分子测序技术的ISO-seq数据以其超长读段长度被越来越多地应用于转录组新型异构体预测研究,但目前大多数研究工作只用到全长读段数据,丢失了非全长读段数据中较多有用信息,因而数据没有得到充分利用。针对这一问题,本文在保留非全长读段的基础上提出了两个能同时预测异构体结构和计算其表达比例的模型基于狄利克雷采样的异构体探测与预测(Dirichletsampling for isoform detection and prediction,DSIDP)和基于马尔科夫链的异构体探测与预测(Markovchain for isoform detection and predition,MCIDP)。两个模型均从全长读段中建立异构体预测集,并采用全长读段和非全长读段计算异构体表达比例。DSIDP将所有读段比对至异构体预测集,并使用Dirichlet采样解决多源映射问题,MCIDP使用马尔科夫链模拟基因外显子之间的选择性剪切,该模型还能预测出数据中没有全长读段的异构体。本文采用模拟数据和真实数据验证了两个模型的有效性。展开更多
基于高通量测序的RNA-Seq(RNA-sequencing)是用于转录组研究的一种新技术,针对该技术在转录组表达分析研究中存在的读段多源映射和读段非均匀分布等难点,提出一个改进的转录组表达研究方法 LDASeqII(Improvement of latent Dirichlet al...基于高通量测序的RNA-Seq(RNA-sequencing)是用于转录组研究的一种新技术,针对该技术在转录组表达分析研究中存在的读段多源映射和读段非均匀分布等难点,提出一个改进的转录组表达研究方法 LDASeqII(Improvement of latent Dirichlet allocation for sequencing data)。模型利用剪接异构体结构信息对参数进行约束并进行外显子读段数目归一化处理,解决了读段非均匀分布下的多源映射问题。通过引入"伪外显子"和"伪转录本"分别处理接合区读段和噪声读段。将模型应用到真实数据集上,并与原LDASeq(Latent Dirichlet allocation for sequencing data)模型和目前流行的Cufflinks与RSEM(RNA-Seq by expectation maximization)方法进行对比。结果显示,改进方法获得了更为准确的转录本及基因表达水平计算结果。展开更多
This paper is a research on the characteristics of power big data. According to the characteristics of "large volume", "species diversity", "sparse value density", "fast speed" of the power big data, a predict...This paper is a research on the characteristics of power big data. According to the characteristics of "large volume", "species diversity", "sparse value density", "fast speed" of the power big data, a prediction model of multi-source information fusion for large data is established, the fusion prediction of various parameters of the same object is realized. A combined algorithm of Map Reduce and neural network is used in this paper. Using clustering and nonlinear mapping ability of neural network, it can effectively solve the problem of nonlinear objective function approximation, and neural network is applied to the prediction of fusion. In this paper, neural network model using multi layer feed forward network--BP neural network. Simultaneously, to achieve large-scale data sets in parallel computing, the parallelism and real-time property of the algorithm should be considered, further combined with Reduce Map model, to realize the parallel processing of the algorithm, making it more suitable for the study of the fusion of large data. And finally, through simulation, it verifies the feasibility of the proposed model and algorithm.展开更多
文摘近年来,基于单分子测序技术的ISO-seq数据以其超长读段长度被越来越多地应用于转录组新型异构体预测研究,但目前大多数研究工作只用到全长读段数据,丢失了非全长读段数据中较多有用信息,因而数据没有得到充分利用。针对这一问题,本文在保留非全长读段的基础上提出了两个能同时预测异构体结构和计算其表达比例的模型基于狄利克雷采样的异构体探测与预测(Dirichletsampling for isoform detection and prediction,DSIDP)和基于马尔科夫链的异构体探测与预测(Markovchain for isoform detection and predition,MCIDP)。两个模型均从全长读段中建立异构体预测集,并采用全长读段和非全长读段计算异构体表达比例。DSIDP将所有读段比对至异构体预测集,并使用Dirichlet采样解决多源映射问题,MCIDP使用马尔科夫链模拟基因外显子之间的选择性剪切,该模型还能预测出数据中没有全长读段的异构体。本文采用模拟数据和真实数据验证了两个模型的有效性。
文摘基于高通量测序的RNA-Seq(RNA-sequencing)是用于转录组研究的一种新技术,针对该技术在转录组表达分析研究中存在的读段多源映射和读段非均匀分布等难点,提出一个改进的转录组表达研究方法 LDASeqII(Improvement of latent Dirichlet allocation for sequencing data)。模型利用剪接异构体结构信息对参数进行约束并进行外显子读段数目归一化处理,解决了读段非均匀分布下的多源映射问题。通过引入"伪外显子"和"伪转录本"分别处理接合区读段和噪声读段。将模型应用到真实数据集上,并与原LDASeq(Latent Dirichlet allocation for sequencing data)模型和目前流行的Cufflinks与RSEM(RNA-Seq by expectation maximization)方法进行对比。结果显示,改进方法获得了更为准确的转录本及基因表达水平计算结果。
文摘This paper is a research on the characteristics of power big data. According to the characteristics of "large volume", "species diversity", "sparse value density", "fast speed" of the power big data, a prediction model of multi-source information fusion for large data is established, the fusion prediction of various parameters of the same object is realized. A combined algorithm of Map Reduce and neural network is used in this paper. Using clustering and nonlinear mapping ability of neural network, it can effectively solve the problem of nonlinear objective function approximation, and neural network is applied to the prediction of fusion. In this paper, neural network model using multi layer feed forward network--BP neural network. Simultaneously, to achieve large-scale data sets in parallel computing, the parallelism and real-time property of the algorithm should be considered, further combined with Reduce Map model, to realize the parallel processing of the algorithm, making it more suitable for the study of the fusion of large data. And finally, through simulation, it verifies the feasibility of the proposed model and algorithm.