S-SimRank:结合内容和链接信息的文档相似度计算方法(英文) 被引量：3

S-SimRank:Combining Content and Link Information to Cluster Papers Effectively and Efficiently

下载PDF

导出

摘要文档的内容分析和连接分析是计算文档相似度的两种方法。连接分析能够发现文档之间的隐含关系,但是,由于文档之间的噪声的存在,这种方法很难得到精确的结果。为了解决这个问题,提出了一个新的算法—S-SimRank(Star-SimRank),有效地将文档的内容信息和连接信息结合在一起从而提高了文档相似度计算的准确性。S-Simrank算法在ACM数据集上无论是准确性和效率都比其他算法有了很大地提高。最后,给出了S-SimRank的收敛性的数学证明。 Content analysis and link analysis among documents are two common methods in recommending system. Compared with content analysis, link analysis can discover more implicit relationship between documents. At the same time, because of the noise, these methods can＇t gain precise result. To solve this problem, a new algorithm, S-SimRank （Star-SimRank）, is proposed to effectively combine content analysis and link analysis to improve the accuracy of similarity calculation. The experimental results for the ACM data set show that S-SimRank outperforms other algorithms. In the end, the mathematic prove for the convergence of S-SimRank is given.

作者蔡元哲李佩刘红岩何军杜小勇

机构地区中国人民大学教育部数据工程和知识工程重点实验室中国人民大学信息学院清华大学管理科学与工程系

出处《计算机科学与探索》 CSCD 2009年第4期378-391,共14页 Journal of Frontiers of Computer Science and Technology

基金 The National Natural Science Foundation of China under Grant No.70871068,70621061,70890083,60873017,60573092~~

关键词连接分析相似度计算文本分析 linkage mining similarity calculation text mining

分类号 TP182 [自动化与计算机技术—控制理论与控制工程]

引文网络
相关文献

参考文献22

1Sahon G, Wong A, Yang C S. A vector space model for information retrieval[J]. Communications of the ACM, 1975.
2Jeh G, Widom J. SimRank: A measure of structural-context similarity[C]//SIGKDD, 2002.
3Yin X, Han J, Yu P. Linkclus: Efficient clustering via heterogeneous semantic links[C]//VLDB, 2006.
4Yin X, Han J, Yu P. Cross-relational clustering with user's guidance[C]//SIGKDD, 2005.
5Small H. Co-citation in the scientific literature: A new measure of the relationship between two documents[J]. Journal of the American Society for Information Science, 1973.
6Kessler M M. Bibliographic coupling between scientific papers[J]. American Documentation, 1963.
7Amsler R. Applications of citation-based automatic classification, Technical Report 72-14[R]. Linguistic Research Center, 1972.
8Xue G R, Zeng H J, Chen Z, et al. Similarity spreading: A new algorithm for similarity calculation of" interrelated objects[C]// Proc of the 13th WWW Conference, 2004.
9Salton G. Associative document retrieval techniques using bibliographic information[J]. Journal of the ACM, 1963.
10Wen J R, Nie J Y, Zhang H J. Clustering user queries of a search engine[C]//Proc of the lOth WWW Conference, 2001.

同被引文献21

1陈焕文,张燮,罗明标.电喷雾解析电离质谱法对食品中苏丹红染料的快速检测[J].分析化学,2006,34(4):464-468. 被引量：53
2郭瑞,张淑玲,汪小芬.人脸识别特征提取方法和相似度匹配方法研究[J].计算机工程,2006,32(11):225-227. 被引量：6
3林峻,李介谷.离线中文签名鉴别的特征提取及预处理[J].上海交通大学学报,1996,30(9):40-45. 被引量：3
4张宝华,王海水,许禄.DNA序列编码及相似度计算[J].高等学校化学学报,2006,27(12):2277-2280. 被引量：9
5Jeh G, WidomJ. Simrank: A measure of structural snntext similarity [C] //Proc of the 8th ACM SIGKDD Int Conf on Knowledge Discovery and Data Mining. New York: ACM, 2002 : 538-543.
6Dean J, Ghemawat S, et al. MapReduce; Simplified dala processing on large clusters [J]. Communicalions of the ACM, 200,1, 51(1): 107-113.
7Shvachko K, Kuang H, et al. Tile hadoop distributed file system [C] //Proc of tile 2010 Iggg 26th Syrnp on Mass Storage Systems and Technologies. New York: ACM, 2010 : 1-10.
8Cao L. Cho B, Tsai M, et al. Delta- Si Rank compututing on mapreduce[C] //Proc of the 1st Int Workshop on Big Data,Streams and Heterogeneous Source Mining: Algorithms Systems Programming Models and Applications. New York: ACM. 2012:28-25.
9Zhang Yanfeng, Gao Qixin, et al. Accelerate large-scale iterative computation through asynchronous accumulative updates [C]//Proc of the 3rd Workshop on Scientific Cloud Computing. New York: ACM, 2012:13-22.
10Zaharia M, Chowdhury M, Franklin M J, et al. Spark: Cluster computing with working sets [C] //Proc of the 2nd USENIX Conf on Hot Topics in Cloud Compuling. Berkeley: USENIX Association. 2010:10-10.

引证文献3

1刘亚丽,贾滨,丁丽英,吴佳,陈焕文,章文军.相似度算法在手写签名质谱成像鉴定中的应用研究[J].计算机与应用化学,2012,29(5):541-544. 被引量：2
2王春磊,张岩峰,鲍玉斌,赵长宽,于戈,高立新.Asyn-SimRank:一种可异步执行的大规模SimRank算法[J].计算机研究与发展,2015,52(7):1567-1579. 被引量：2
3崔海涛,李玲娟.基于Jaccard和LPA的社团划分算法[J].南京邮电大学学报（自然科学版）,2019,39(6):79-85. 被引量：4

二级引证文献8

1刘维,陈崚.复杂网络中的链接预测[J].信息与控制,2020,49(1):1-23. 被引量：2
2刘宁,裴雷,吴亦.签名笔迹笔压特征显现方法的比较研究[J].中国司法鉴定,2014(1):25-29. 被引量：9
3张琦玥,聂洪港.质谱成像技术的研究进展[J].分析仪器,2018(5):1-10. 被引量：11
4秦强,生佳根,严长春.多特征融合的标签传播算法[J].计算机与数字工程,2019,47(12):3030-3034.
5闫玺玺,赵强,汤永利,李莹莹,李静然.支持灵活访问控制的多关键字搜索加密方案[J].西安电子科技大学学报,2022,49(1):55-66. 被引量：6
6李慧,罗梦迪,许英.改进的Jaccard贴近度的群落划分算法[J].太原科技大学学报,2023,44(2):118-124.
7姜涛,张洋.基于Louvain算法的复杂网络链路预测仿真[J].计算机仿真,2023,40(3):417-420. 被引量：1
8牛蕊,吴施忆.中国式现代化指标评价与影响因素识别——基于TOPSIS改进与BP-DEMATEL模型[J].学术探索,2023(11):17-30. 被引量：3

1刘卫明,喻金平.计算机之间连接分析[J].现代电子技术,2001,24(10):6-9. 被引量：1
2毕慧,崔佳,李超.链接分析在网络舆情分析中的应用探析[J].信息系统工程,2016,29(5):41-41.
3徐京,张彦,辛阳,朱洪亮.高速网络内容监控系统的关键技术分析[J].信息网络安全,2012(10):29-35. 被引量：4
4魏现辉,张绍武,杨亮,林鸿飞.基于加权SimRank的跨领域文本情感倾向性分析[J].模式识别与人工智能,2013,26(11):1004-1009. 被引量：11
5高海洋,沈强,张轩溢,赵志军.一种基于数据压缩的Apriori算法[J].计算机工程与应用,2013,49(14):117-120. 被引量：6
6李双,李艳玮.基于复杂网络的ADHD患者脑功能连接分析[J].科教文汇,2014(4):110-111. 被引量：1
7杨盛明,李伟华.基于失败连接分析和P2P的未知网络蠕虫检测[J].电子设计工程,2013,21(13):142-146. 被引量：1
8尧涛.基于KNN的2015NIPS论文集文档相似度分析[J].科技资讯,2017,15(7):217-218. 被引量：1
9唐光海,刘四青,李忠俊.2层数据库应用程序结构下的Delphi 7.0与SQL Server的连接分析[J].机械与电子,2004,22(2):71-73.
10吴庆涛,邵志清,钱夕元.基于网络连接分析的DDoS攻击检测模型[J].计算机工程,2006,32(10):135-136. 被引量：6

计算机科学与探索

2009年第4期

浏览历史

内容加载中请稍等...

S-SimRank:结合内容和链接信息的文档相似度计算方法(英文) 被引量：3

参考文献22

同被引文献21

引证文献3

二级引证文献8

相关作者

相关机构

相关主题

浏览历史