摘要
微博是基于关系的信息分享、传播以及获取的平台,是网络舆情发起的源头、信息传播的重要阵地。微博便捷的转发操作,使得大量相同或相似的微博页面在微博空间内迅速传播。对微博相似页面进行检测,对于减轻用户浏览负担和提高网络舆情分析的效率有着重要的意义。本文针对微博相似页面提出了一种基于LCS的微博相似页面检测方法:首先计算可能相似的微博页面文档子集,其次计算其LCS并提取可信部分,最终检测出微博相似页面。实验表明,这一方法能准确、高效地检测出微博数据中的相似页面。
Microblog is a relation-based platform for sharing, spreading and acquiring information, and also the source of internet public opinion and the important battlefield of information transmission. The convenient forwarding operations of microblog result in the rapid spread of plenty of identical or similar microblog pages in the microblog space. Therefore, the detection of similar microblog pages is of great importance to lighten the client’s burden of browsing and improve the analytic efficiency of internet public opinion. A method based on LCS is introduced to detect similar microblog page:First is to calculate the files’ subset of the possibly similar microblog pages, and the next is to calculate its LCS and extract the reliable parts so as to ultimately detect the similar microblog pages. Experiments show that this method can detect the similar pages from the microblog data accurately and efficiently.
出处
《集成技术》
2013年第3期5-9,共5页
Journal of Integration Technology
基金
国家自然科学基金项目(项目批准号:61272013)
广东省教育科学"十二五"规划2012年度研究项目(项目批准号:2010TJK311)
关键词
LCS
相似性检测
相似性度量
微博页面
Longest Common Subsequence
near-duplicate detection
similarity measurement
microblog page