期刊文献+

序列数据相似性查询技术研究综述 被引量:13

A Survey of the Research on Similarity Query Technique of Sequence Data
下载PDF
导出
摘要 序列数据在文本、Web访问日志文件、生物数据库等应用中普遍存在,对其进行相似性查询是一种提取有用信息的重要手段.近年来,随着各种科学计算的发展和序列数据的大量产生,序列相似性查询已经成为数据分析领域一个研究热点.其涉及到的几个重要问题有面向各种应用领域的相似性度量及其相互之间的关系;随机序列数据中距离分布的统计信息及其对分析查询算法性能的作用;在大规模数据中,各种高效回答相似性查询的关键技术及各自的优缺点比较.总结了序列数据的分类和特点,给出了几种序列数据相似性度量和随机序列之间距离分布的统计信息,并进一步分析了这些度量之间的关系.接着给出了几种序列相似性查询的类型,以及序列相似性查询要解决的核心问题.在此基础上,针对各种序列相似性查询关键技术进行分类和评价.最后,讨论了关于序列数据相似性查询研究所面临的挑战,并归结了未来的研究方向. Sequence data is ubiquitous in many domains such as text, Web access log and biological database. Similarity query in sequence data is a very important means for extracting useful information. Recently, with the development of various scientific computing and the generation of large scale sequence data, similarity query on sequence data is becoming a hot research topic. Some important issues related to it are: similarity metrics used in different application fields and the mutual connections between them; statistical information of distance distribution on random sequence collections as well as its function for analyzing the performance of query algorithms; different kinds of key techniques for efficiently answering similarity queries in large scale datasets and the comparisons between their merits and demerits. In this survey, the classification and characteristics of sequence data is summarized. Some kinds of similarity metrics and statistical information about distance between random sequences are also presented and the relationships among these similarity metrics are further analyzed. Then, some types of similarity query and key issues in point are introduced. Based on these foundations, this paper focuses on the classification and evaluation of key techniques on sequence similarity search. Finally, some challenges on similarity query of sequence data are discussed and future research trends are also summarized.
出处 《计算机研究与发展》 EI CSCD 北大核心 2010年第2期264-276,共13页 Journal of Computer Research and Development
基金 国家自然科学基金项目(60573093) 国家"八六三"高技术研究发展计划基金项目(2006AA02Z329) supported by the National Basic Research Program(973 Program)of China under grant No.2005CB321905 the Plan Program of Science and Technology Commission of Shanghai Municipality under grant No.08511500203
关键词 序列数据 相似性度量 距离分布 过滤方法 相似性查询 query sequence data similarity metric distance distribution filtering technique similarity
  • 相关文献

参考文献77

  • 1Dong G Z, Pei J. Sequence Data Mining [M]. Berlin: Springer, 2007.
  • 2Sarawagi S. Advanced Methods for Knowledge Discovery from Complex Data [M]. Berlin: Springer, 2005.
  • 3朱扬勇,熊赟.DNA序列数据挖掘技术[J].软件学报,2007,18(11):2766-2781. 被引量:37
  • 4Hand D, Mannila H, Smyth P. Principles of Data Mining [M]. Cambridge, MA: MIT Press, 200].
  • 5Brejova B, DiMarco C, Vinar T, et al. Finding patterns in biological sequences, CS-2000-22 [R]. Ontario: University of Waterloo, 2000.
  • 6Agrawal R, Faloutsos C, Swami A. Efficient similarity search in sequence databases [C] //Lomet D B. Proc of the 4th Int Conf on Foundations of Data Organization and Algorithms (FODO '93). Berlin: Springer, 1993:69-84.
  • 7Babcock B, Babu S, Datar M, et al. Models and issues in data stream systems [C] //Popa L. Proe of the 21st ACM SIGART-SIGMOD-SIGART Syrup on Principles of Database System(PODS). New York: ACM, 2002:1-16.
  • 8Gusfield D. Algorithms on Strings, Trees, and Sequences[M]. New York: Cambridge Press, 1997.
  • 9Dayhoff M O, Schwartz R M, Orcutt B C. A model of evolutionary change in proteins [J]. National Biomedical Research Foundation, 1978, 5(3): 345-352.
  • 10Henikoff S, Henikoff J. Amino acid substitution matrices from protein blocks [J]. Proc of the National Academy Sciences of the United States of America (PNAS), 1992, 89 (22) : 10915-10919.

二级参考文献20

共引文献91

同被引文献116

引证文献13

二级引证文献51

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部