期刊文献+

基于动态区间映射的文档聚类算法

Document Clustering Algorithm Based on Dynamic Interval Mapping
下载PDF
导出
摘要 随着信息数字化的快速发展,新兴的归档存储成为研究热点,空间利用率和扩展性是其关键问题。利用基于内容分块存储实现重复数据删除,是提高存储空间利用率的有效途径,但由于归档数据规模巨大,在所有数据中寻找共享分块的做法十分低效。将动态区间映射思想引入信息聚类,提出了基于动态区间映射的文档聚类算法DC-DIM;利用分块和特征提取方法产生文档的分块特征集合,将分块特征集合映射在区间链上,依据文档分块特征集合的映射分布确定文档的存储容器,实现文档聚类;将内容相似度高(共享内容多)的文档聚集在一起,为分块存储和方便数据管理创造有利条件。 Archival storage is becoming a research hotspot with information digitization accelerating, where space utilization and scalability are very important. Using content-based chunking storage to achieve data deduplieation is an effective way to improve storage space utilization, however, it is inefficiency to find shared chunks in all of the huge scale of archival data. We introduced the thought of dynamic interval mapping to information clustering, and presented the DC- DIM(Document Clustering algorithm based on Dynamic Interval Mapping). The algorithm uses chunking and feature extraction methods to generate the feature-set of dooument, and map it on interval links, then choose the document's storage container according to its feature-set's distribution on interval links. By this way, those documents with high similarity(shared a lot of contents) will be clustered, then, it will be very convenient to improve the space utilization and data management.
作者 孙永林 刘仲
出处 《计算机科学》 CSCD 北大核心 2010年第6期23-27,共5页 Computer Science
基金 国家自然科学基金(60503042)资助
关键词 文档聚类 归档存储 动态区间映射 空间利用率 扩展性 Document clustering, Archival storage,Dynamic interval mapping, Spaee utilization, Scalability
  • 相关文献

参考文献12

  • 1Sarbanes-Oxley. http://www, sarbanes-oxley, com/index, php.
  • 2Storer M W. Secure, Energy-Efficient, Evolvable, Long-Term Archival Storage[R]. UCSC-SSRC. 2009.
  • 3Bradshaw P L,Brannon K W, et al. Archive storage system design for long-term storage of massive amounts of data[J]. IBM J. Res.& Dev. ,2008,52(4/5).
  • 4You L L, Pollack K T, Long D D E. Deep Store: An Archival Storage System Architecture[C]//Proceedings of the 21st International Conference on Data Engineering(ICDE '05). 2005.
  • 5You L L, Karamanolis C. Evaluation of efficient archival storage techniques[C] // Proceedings of the 21st IEEE/12th NASA Goddard Conference on Mass Storage Systems and Technologies. 2004.
  • 6Eshghi K,Tang H K. A Framework for Analyzing and Improving Content-Based Chunking Algorithms[R]. Hewlett Packard Labs Technical Report TR. 2005.
  • 7Forman G, Eshghi K, Chiocchetti S. Finding similar files in large document repositories[C]//Proceedings of the Eleventh ACM SIGKDD International Conference on Knowledge Discovery in Data Mining. 2005 : 394-400.
  • 8Bhagwat D, Eshghi K, Mehra P. Content-based Document Routing and Index Partitioning for Scalable Similarity-based Searches in a Large Corpus[C] // Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining(KDD '07). 2007 : 105-112.
  • 9刘仲,周兴铭.基于动态区间映射的数据对象布局算法[J].软件学报,2005,16(11):1886-1893. 被引量:16
  • 10刘仲,周兴铭.可伸缩分布式动态区间映射算法[J].计算机学报,2006,29(10):1757-1763. 被引量:1

二级参考文献23

  • 1Brinkmann A, Salzwedel K, Scheideler C. Compact, adaptive placement schemes for non-uniform capacities, In: Maggs B, ed. Proc.of the 14th ACM Syrup. on Parallel Algorithms and Architectures (SPAA). New York: ACM Press, 2002.53-62.
  • 2Honicky RJ, Miller EL. A fast algorithm for online placement and reorganization of replicated data. In: Dongarra J, ed, Proc, of the 17th Int'l Parallel & Distributed Processing Symp. Nice: IEEE Computer Society, 2003.
  • 3Weber RO. Information technology--SCSI object-based storage device commands (OSD), Technical Council Proposal DocumentT10/1355-D, Technical Committee T10, 2004. http://www.t10.org/ftp/tl0/drafts/osd/osd-rl0.pdf.
  • 4Matsumoto M, Nishimura T. Mersenne Twister: A 623-dimensionally equidistributed uniform pseudorandom number generator.ACM Trans. on Modeling and Computer Simulation, 1998,80):3-30.
  • 5Ghemawat S, Gobioff H, Leung ST. The google file system. In: Larry P, ed. Proc. of the 19th ACM Symp. on Operating Systems Principles. New York: ACM Press, 2003. 19-22.
  • 6Xin Q, Miller EL, Long DDE, Brandt SA, Schwarz T, Litwin W. Reliability mechanisms for very large storage systems. In: Moore R, eds. Proc. of the 20th IEEE/the 1 lth NASA Goddard Conf. on Mass Storage Systems and Technologies. Washington: IEEE Computer Society, 2003:146-156.
  • 7LitwinW, Neimat MA, Schneider DA. LH*-A scalable, distributed data structure. ACM Trans. on Database Systems, 1996,21(4):480-525.
  • 8Choy DM, Fagin R, Stockmeyer L. Efficiently extendible mappings for balanced data distribution. Algorithmica, 1996,16(2):215-232.
  • 9Brinkmann A, Salzwedel K, Scheideler C. Efficient, distributed data placement strategies for storage area networks. In: Miller G,ed. Proc. of the 12th ACM Symp. on Parallel Algorithms and Architectures (SPAA). New York: ACM Press, 2000. 119-128.
  • 10Schwan P..Lustre:Building a file system for 1000 node clusters.In:John W.L.ed.Proceedings of the 2003 Ottawa Linux Symposium.Ottawa:Red Hat,Inc.,2003,401~407

共引文献15

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部