This work investigates frequency distributions of strings within a text. The mathematical derivation accounts for variable alphabet size, character probabilities, and string/text lengths, under both the Bernoullian an...This work investigates frequency distributions of strings within a text. The mathematical derivation accounts for variable alphabet size, character probabilities, and string/text lengths, under both the Bernoullian and the Markovian model for string generation. The analysis is limited to the set of nonclumpable strings, that cannot overlap with them selves. Two formulae (exact and approximated) are derived, calculating the frequency distribution of a string of length m found inside a text of length n (with m 〈: n). The approximated formula has a constant complexity (in contrast to an exponential com plexity of the exact) and makes it applicable to very long texts. The proposed formulae were applied to analyze string frequencies in a portion of the human genome, and to recalculate frequencies of known repeated motif within genes, associated to genetic dis eases. A comparison with stateoftheart methods was provided. The formulae presentedhere can be of use in the statistical evaluation of specific motif frequencies within very long texts (e.g. genes or genomes) and help in characterizing motifs in pathologic conditions.展开更多
文摘This work investigates frequency distributions of strings within a text. The mathematical derivation accounts for variable alphabet size, character probabilities, and string/text lengths, under both the Bernoullian and the Markovian model for string generation. The analysis is limited to the set of nonclumpable strings, that cannot overlap with them selves. Two formulae (exact and approximated) are derived, calculating the frequency distribution of a string of length m found inside a text of length n (with m 〈: n). The approximated formula has a constant complexity (in contrast to an exponential com plexity of the exact) and makes it applicable to very long texts. The proposed formulae were applied to analyze string frequencies in a portion of the human genome, and to recalculate frequencies of known repeated motif within genes, associated to genetic dis eases. A comparison with stateoftheart methods was provided. The formulae presentedhere can be of use in the statistical evaluation of specific motif frequencies within very long texts (e.g. genes or genomes) and help in characterizing motifs in pathologic conditions.