期刊文献+

基于关键词重提取的密文文本相似性度量方法研究 被引量:2

Similarity Measure Algorithm of Cipher-text Based on Re-extracted Keywords
下载PDF
导出
摘要 针对密文的相似性度量问题,提出了一种新的密文文本相似性度量方法。该方法通过定义关键词的有效作用域、相对作用域、分散域的概念,有效克服了现有的关键词权重量化方法不能对篇幅不同、结构不同的文档进行相对公平的关键词权重量化的不足,同时显著减少了文本度量时所依赖的关键词数量。进一步对文档的关键词进行重提取,并建立文档的关键词密文索引条目,通过密文的索引条目来度量密文的相似性。将该方法在真实文档上进行实验,并同其它算法进行比较,结果表明所提出的方法在准确率和召回率两方面优于其它参与比较的算法,并能在准确率和召回率之间取得比较好的平衡。 To solve the similarity of dissimilarity measurement between the cipher texts,a new similarity measure algo- rithm of cipher-text based on re-extracted keywords called SMCTBRK was proposed. Through defining the new con- cepts of effective scope, relative scope, distributed scope of the keywords, and re-extracting the keywords in documents, the SMCTBRK constructs the encryption index item for the compared documents depending on the less amounts of re- extracted keywords. Here, the encryption index item is organized as the feature vector. Further, the SMCTBRK com- putes the similarity between the different cipher texts by the encryption index item instead of the separated keywords. Experiments on real documents were conducted. And the results show that the SMCTBRK is more promised than the Shingling algorithm and the Simhash algorithm on accuracy and recall ratio.
作者 李志华 陈超群 李村 胡振宇 张华伟 LI Zhi-hua CHEN Chao-qun LI Cun HU Zhen-yu ZHANG Hua-wei(Department of Computer Science,School of IOT Engineering,Jiangnan University,Wuxi 214122,China)
出处 《计算机科学》 CSCD 北大核心 2016年第8期95-99,共5页 Computer Science
基金 江苏省科技厅产学研前瞻项目(BY2013015-23)资助
关键词 关键词重提取 相似性度量 密文文本 作用域 Re-extracted keywords, Similarity measure, Cipher texts, Effective scope
  • 相关文献

参考文献4

二级参考文献79

  • 1陈浩,何婷婷,姬东鸿.基于k-means聚类的无导词义消歧[J].中文信息学报,2005,19(4):10-16. 被引量:16
  • 2刘远超,王晓龙,徐志明,关毅.文档聚类综述[J].中文信息学报,2006,20(3):55-62. 被引量:65
  • 3赵世奇,刘挺,李生.一种基于主题的文本聚类方法[J].中文信息学报,2007,21(2):58-62. 被引量:23
  • 4宋擒豹.电子商务环境下的数据挖掘研究:博士学位论文[M].西安:西安交通大学,2001..
  • 5Deerwester S C, Dumais S T, Landauer T K, et al. Indexing by latent semantic analysis. Journal of the American Society for Information Science, 1990.
  • 6Hofmann T. Probabilistic latent semantic indexing//Proceedings of the 22nd Annual International SIGIR Conference. New York: ACM Press, 1999:50-57.
  • 7Blei D, Ng A, Jordan M. Latent Dirichlet allocation. Journal of Machine Learning Research, 2003, 3: 993-1022.
  • 8Griffiths T L, Steyvers M. Finding scientific topics//Proceedings of the National Academy of Sciences, 2004, 101: 5228 5235.
  • 9Steyvers M, Gritfiths T. Probabilistic topic models. Latent Semantic Analysis= A Road to Meaning. Laurence Erlbaum, 2006.
  • 10Teh Y W, Jordan M I, Beal M J, Blei D M. Hierarchical dirichlet processes. Technical Report 653. UC Berkeley Statistics, 2004.

共引文献260

同被引文献18

引证文献2

二级引证文献2

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部