摘要
DNA序列分类的方法有很多种.本文给出了两种模型都是在图象的基础上,利用图象的直观、易于分析等优点,找到各种碱基不同的特征,得出一个比较合理的方法. 在建立模型时,先计算出给定的前20种DNA序列中各碱基A,G、C、T的含量 (将一串长序列简化成了四个百分含量数值,大大简化了序列),并以此含量为数据作出直角坐标系下的二维曲线.根据曲线的特征,得出了两个算法,一个是以其中的一个DNA序列中碱基的含量大于其它三种含量为特征分出类别,对21至40种序列的分类正确率达到 80%,对于题中所给的 182种序列分类正确率为 42%;另一个是通过转化曲线为直线的方法找出符合分类特征的区间,根据是否在此区间内而分出类别,对21至40种序列的分类正确率达到100%,对于题中所给的182种序列分类正确率为85%. 最后,通过对比两种模型的结果,判断出两种模型的优劣,并分析了其中的原因.
DNA sequences are sorted in many ways. The two sorts of models given in this paper are based on the images and the advantages of being watched and analysed easily. First we calculate the contents of the base (A. G. C. T) in the first 20 kinds of 'DNA' sequences (predigesting a long string of sequence into four per cents and then predigesting the sequence greatly), and draw a plane curve on the right- angle coordinate' system before we set a model. According to the charcater of the curve, we have found two arithmetics: one is that we can classify the sequences by the character which the content of base in a DNA' sequence is larger than that of the other three sorts., and we reach a veracity of 80 per cent through appling this way to the DNA' sequences between the 21 and 40, and we also reach a veracity of 42 per cent through appling this way to the DNA' sequences given; the other is that we may find the range accord with classifing by transforming the curve into beeline and draw a conclusion by the way whether the beeline is in the range or not, and we reach a veracity of 100 per cent through appling this way to the DNA' sequences between the 21 and 40, and we also reach a veracity of 85 per cent through appling this way to the DNA' sequences given. Finally, we have compared the results of the two models, and have estimated the advantage and disadvantage, and have analysed the reason.
出处
《大连大学学报》
2001年第4期95-100,共6页
Journal of Dalian University