摘要
针对密文的相似性度量问题,提出了一种新的密文文本相似性度量方法。该方法通过定义关键词的有效作用域、相对作用域、分散域的概念,有效克服了现有的关键词权重量化方法不能对篇幅不同、结构不同的文档进行相对公平的关键词权重量化的不足,同时显著减少了文本度量时所依赖的关键词数量。进一步对文档的关键词进行重提取,并建立文档的关键词密文索引条目,通过密文的索引条目来度量密文的相似性。将该方法在真实文档上进行实验,并同其它算法进行比较,结果表明所提出的方法在准确率和召回率两方面优于其它参与比较的算法,并能在准确率和召回率之间取得比较好的平衡。
To solve the similarity of dissimilarity measurement between the cipher texts,a new similarity measure algo- rithm of cipher-text based on re-extracted keywords called SMCTBRK was proposed. Through defining the new con- cepts of effective scope, relative scope, distributed scope of the keywords, and re-extracting the keywords in documents, the SMCTBRK constructs the encryption index item for the compared documents depending on the less amounts of re- extracted keywords. Here, the encryption index item is organized as the feature vector. Further, the SMCTBRK com- putes the similarity between the different cipher texts by the encryption index item instead of the separated keywords. Experiments on real documents were conducted. And the results show that the SMCTBRK is more promised than the Shingling algorithm and the Simhash algorithm on accuracy and recall ratio.
作者
李志华
陈超群
李村
胡振宇
张华伟
LI Zhi-hua CHEN Chao-qun LI Cun HU Zhen-yu ZHANG Hua-wei(Department of Computer Science,School of IOT Engineering,Jiangnan University,Wuxi 214122,China)
出处
《计算机科学》
CSCD
北大核心
2016年第8期95-99,共5页
Computer Science
基金
江苏省科技厅产学研前瞻项目(BY2013015-23)资助
关键词
关键词重提取
相似性度量
密文文本
作用域
Re-extracted keywords, Similarity measure, Cipher texts, Effective scope