期刊文献+

可并行中文同主题词聚类新算法 被引量:2

A Parallable Algorithm for Chinese Co-Topic Words Clustering
下载PDF
导出
摘要 提出了一种高效的自动按照主题对中文词进行聚类的算法.该算法利用顿号(、)切分抽取语料库句子中的并列中文词,并以抽取出的中文词为节点构建一个共引用图;然后对每个中文词节点产生若干个locality sensitiveHashing(LSH)签名组合;最后将至少有1个相同LSH签名组合的任意2个中文词标记为同一个主题类.实验表明,该算法运算速度快,且易并行实现,在海量语料库的支持下,执行效率高,聚类效果较好. A simple but powerful algorithm for automatically clustering Chinese co-topic words is presented. The method first uses punctuation '、' to split and extract paratactic Chinese words within sentences from a corpus and constructs a co-citation graph by treating Chinese words as nodes. Second, the method generates several locality sensitive Hashing (LSH) signature combinations for each node in the co-citation graph. Those nodes shared at least one LSH signature combination, are grouped together and most of them may belong to the same topic. The main advantages of the algorithm are the fast speed of calculation and high convenience of implementation in parallel. Experimental results indicate the high efficiency and good clustering effect.
出处 《北京邮电大学学报》 EI CAS CSCD 北大核心 2009年第4期122-127,共6页 Journal of Beijing University of Posts and Telecommunications
基金 国家自然科学基金项目(60872051 60432010) 国家重点基础研究发展计划项目(2007CB307100)
关键词 中文词聚类 共引用图 localitysensitiveHashing签名 并行化 Chinese word clustering co-citation graph locality sensitive Hashing signature paraUable
  • 相关文献

参考文献12

  • 1胡和平,曾庆锐,路松峰.中文词聚类研究[J].计算机工程与科学,2006,28(1):122-124. 被引量:9
  • 2Wu Fazhou, Su Hao, Zhou Ming, et al. Use web to extend synonym for new Chinese words[C]//NDBC 2006. Guang Zhou:[s.n. ], 2006.
  • 3梅翔,孟祥武,陈俊亮,徐萌.一种基于语义关联的查询优化方法[J].北京邮电大学学报,2006,29(6):107-110. 被引量:10
  • 4Clauset A, Newman M E J, Moore C. Finding community structure in very large networks[J]. Physical Review E-Statistical, Nonlinear, and Soft Matter Physics, 2004, 70(62) : 066-111.
  • 5Newman M E J, Girvan M. Finding and evaluating community structure in networks [J ]. Physical Review E, 2004, 69(22): 026-113.
  • 6杜楠,王柏,吴斌.Community Detection in Complex Networks[J].Journal of Computer Science & Technology,2008,23(4):672-683. 被引量:1
  • 7Gibson D, Kumar R, Tomkins A. Discovering large dense subgraphs in massive graphs [ C ]//The 31st International Conference on Very Large Data Bases. New York: [s.n. ], 2005.
  • 8Broder A Z, Charikar M, Frieze A M, et al. Min-wise independent permutations[C] // The Annual ACM Symposium on Theory of Computing. New York: ACM, 1998: 327-336.
  • 9Broder A Z, Steven C G, Mark S M, et al. Syntactic clustering of the web[J]. Computer Networks and ISDN Systems Archive, 1997, 29(8) : 1157-1166.
  • 10Indyk P R, Motvani. Approximate nearest neighbors: towards removing the curse of dimensionality [ C] // STOC' 98. Dollas: ACM, 1998: 604-613.

二级参考文献18

  • 1杨博,刘大有.Force-Based Incremental Algorithm for Mining Community Structure in Dynamic Network[J].Journal of Computer Science & Technology,2006,21(3):393-400. 被引量:8
  • 2[1]Andrei B.A taxonomy of Web search[J].ACM SIGIR Forum,2004,38(1):39-45.
  • 3[2]Mei K,Koichi T.Information retrieval on the Web[J].ACM Computing Surveys,2000,32(2):144-173.
  • 4[3]Kraft R,Zien J.Mining anchor text for query refinement[C] // Proceeding of the WWW 2004.New York:ACM Press,2004:666-674.
  • 5[5]Berners-Lee T,Hendler J,Lassila O.The semantic Web[J].Scientific American,2001,284(5):34-43.
  • 6[6]Davies J,Weeks R,Krohn U.QuizRDF:search technology for the semantic Web[C]//2004 Proceedings of the 37th Annual Hawaii International Conference on System Sciences (HICSS '04).NW Washington:IEEE Computer Society,2004:112-119.
  • 7[7]Rocha C,Schwabe D.A hybrid approach for searching in the semantic Web[C] // Proceedings of the WWW 2004.New York:ACM Press,2004:374-383.
  • 8[8]Guba R,McCool R.Semantic search[C]//Proceeding of the WWW 2003.New York:ACM Press,2003:700-709.
  • 9[9]Fellbaum C.WordNet an electronic lexical database[M].Cambridge:The MIT Press,1998:285-303.
  • 10[10]Aleman-Meza B.SWETO:large-scale semantic Web test bed[C] // Proceedings of the 16th International Conference on Software Eng & Knowledge Eng(SEKE 2004):Workshop on Ontology in Action.Banff:Knowledge Systems Inst,2004:490-493.

共引文献17

同被引文献54

  • 1卢炎生,饶祺.一种LSH索引的自动参数调整方法[J].华中科技大学学报(自然科学版),2006,34(11):38-40. 被引量:6
  • 2Choi Suk-OO.利用叙词表开发本体[J].数字图书馆论坛,2007(5):18-23. 被引量:4
  • 3Baeza - Yates R, Ribeiro - Neto B. Modern Information Retrieval [ M ] .王知津,等译.北京:机械工业出版社,2005:7-10.
  • 4岸田和明,武者小路橙子,稻垣几世枝,等.シソーラスの比较评价:概念体系の提示の性能を中心に[J].情报の科学上技术.1988,38(10):565-572.
  • 5Kando - Matsuyama Noriko, Kishida Kazuaki, Mushakoji Sumiko, et al. A comparative evaluation of thesauri concerning "conceptual representability" through an indexing experiment of the documents on library and information science [J]. Library and Information Science, 1988(26) : 103- 114.
  • 6Narang S P. A comparative study of selected information retrieval thesauri in the engineering field [ D ]. Loughborough University of Technology. 1988.
  • 7Milstead J L. ASIS thesaurus of information science and librarianship[M] . 2nd ed. Medford, NJ: Learned Information, 2005.
  • 8Cambridge Science Abstracts. Library and information seienee abstract thesaurus [ EB/OL ]. [ 2010 - 04 - 15 ]. http ://www.
  • 9EBSCO. Library,information science & technology abstracts thesaurus [ EB/OL ] . [ 2010 - 04 - 19 ] . http ://web. ebscohost com/ehost/thesaurus?vid = 3 & hid = 11 & sid = 8a0abf04 - 2eb2- 4896- a60d - 76dac90b44e3% 40sessionmgrl3.
  • 10Wilson Web. Library literature full text thesaurus [ EB/OL]. [ 2010 - 04 - 10] . http ://www. wilsonweb, com.

引证文献2

二级引证文献8

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部