期刊文献+

Twitter中近似重复消息的判定方法研究 被引量:16

Detecting Near Duplicate Messages in Twitter
下载PDF
导出
摘要 微博客是Web2.0出现以来的一个新生概念。著名的Twitter系统是微博客中具有代表性的一个,其全球用户已经超过1.6亿,在世界范围内具有重要影响力:目前知名政治家、社会名流和大企业几乎都是Twitter的用户。Twitter系统中的消息小于140个字符,而且语法不规范。同时,由于Twitter允许用户以多种格式自由转发消息,系统中存在大量内容重复或近似重复的消息。重复消息的存在加重了系统存储的负担,对用户阅读、理解以及分析消息的内容也造成了不利影响。该文分析了Twitter系统中转发消息的语法特点,并利用这些语法特点提取规则,把转发的消息变成普通消息。该文还提出统计字符种类和最短编辑距离两种字符串距离计算的方法以判定Twitter中近似重复的消息。该文还分析了Twitter消息发送的方式以及不同登录方式的消息特征。实验结果表明,两种方法具有扩展性强、实现简单、效率高等优点,能够有效地检测Twitter上的信息重复现象。 Microblog is a very new concept of web 2.0.The most important microblog system in use is Twitter,with more than 160 million users all over the world.For now,Twitter is one of the most influential voices of the globe,its users including celebrities,well-known politicians and first-order companies.The length of the messages in Twitter is short,and the contents of the messages are very likely to be informal in syntax or grammar.Moreover,Twitter does not strictly define the syntax of retweet,which causes the existence of a great number of near duplicate messages.These near duplicate messages can be a waste of storage resources,and can greatly reduce the user experience of Twitter.In this paper,the syntax of retweet messages is analyzed,and a method is presented to remove the retweet symbols of messages using the analyzed results.In addition,two text distance calculating methods character statistics and shortest editing distance are proposed to cluster the Twitter messages into groups of near duplicate messages.We also analyze the log-in method and characteristics of twitter's messages.Through a series of experiments,we prove that our methods are efficient,extensible and easy to implement,and can be used to discover and filter the near duplicate messages in microblogs.
出处 《中文信息学报》 CSCD 北大核心 2011年第1期20-27,共8页 Journal of Chinese Information Processing
基金 国家242专项资助项目(2009F108 2009A91 2009A19) 国家自然科学基金资助项目(60903139)
关键词 微博客 TWITTER 重复消息 microblog Twitter near duplicate message
  • 相关文献

参考文献7

  • 1Twitter official website [EB/OL]. 2010. URL: http://www. twitter. com/.
  • 2B. Stone, E. Williams. Chirp: Twitter's developer conference [EB/OL]. April 14-15, 2010. URL: http ://chirp. twitter. com/.
  • 3C. I.yon, R. Barrett, J. Malcolm. A theoretical basis to the automated detection of copying between texts, and its practical implementation in the Ferret plagiarism and collusion detector [C]//Plagiarism, Prevention, Practice and Policies Conference. June, 2004.
  • 4B. H. Bloom. Space/time trade-offs in hash coding with allowable errors [J]. Communications of the ACM, 1970, 13(7): 422-426.
  • 5M. Charikar. Similarity estimation techniques from rounding algorithms [C]//Proceedings of the 34th Annual Symposium on Theory of Computing, Montr al, Qu b, Canada. May, 2002.
  • 6G. S. Manku, A. Jain, A. D. Sarma. Detecting nearduplicates for web crawling [C]//Proceedings of the 16th International World Wide Web Conference. Banff, Alberta, Canada. May, 2007.
  • 7D. Boyd, S. Golder, G. Lotan. Tweet, tweet, retweet: conversational aspects of retweeting on Twitter [C]// Proceedings of the 43rd Hawaii International Conference on System Sciences. 2010: 1-10.

同被引文献219

引证文献16

二级引证文献202

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部