期刊文献+

Enriching short text representation in microblog for clustering 被引量:14

Enriching short text representation in microblog for clustering
原文传递
导出
摘要 Social media websites allow users to exchange short texts such as tweets via microblogs and user status in friendship networks. Their limited length, pervasive abbrevi- ations, and coined acronyms and words exacerbate the prob- lems of synonymy and polysemy, and bring about new chal- lenges to data mining applications such as text clustering and classification. To address these issues, we dissect some poten- tial causes and devise an efficient approach that enriches data representation by employing machine translation to increase the number of features from different languages. Then we propose a novel framework which performs multi-language knowledge integration and feature reduction simultaneously through matrix factorization techniques. The proposed ap- proach is evaluated extensively in terms of effectiveness on two social media datasets from Facebook and Twitter. With its significant performance improvement, we further investi- gate potential factors that contribute to the improved perfor- mance. Social media websites allow users to exchange short texts such as tweets via microblogs and user status in friendship networks. Their limited length, pervasive abbrevi- ations, and coined acronyms and words exacerbate the prob- lems of synonymy and polysemy, and bring about new chal- lenges to data mining applications such as text clustering and classification. To address these issues, we dissect some poten- tial causes and devise an efficient approach that enriches data representation by employing machine translation to increase the number of features from different languages. Then we propose a novel framework which performs multi-language knowledge integration and feature reduction simultaneously through matrix factorization techniques. The proposed ap- proach is evaluated extensively in terms of effectiveness on two social media datasets from Facebook and Twitter. With its significant performance improvement, we further investi- gate potential factors that contribute to the improved perfor- mance.
出处 《Frontiers of Computer Science》 SCIE EI CSCD 2012年第1期88-101,共14页 中国计算机科学前沿(英文版)
关键词 short texts text representation multi-languageknowledge matrix factorization social media short texts, text representation, multi-languageknowledge, matrix factorization, social media
  • 相关文献

参考文献22

  • 1Adamic L A,Zhang J,Bakshy E,Ackerman M S. Knowledge sharing and yahoo answers:everyone knows something[A].2008.665-674.
  • 2Hotho A,Staab S,Stumme G. Wordnet improves text document clustering[A].2003.541-544.
  • 3Reforgiato Recupero D. A new unsupervised method for document clustering by using WordNet lexical and conceptual relations[J].Informarion Retrieval,2007,(06):563-579.doi:10.1007/s10791-007-9035-7.
  • 4Hu J,Fang L,Cao Y,Zeng H J,Li H,Yang Q,Chen Z. Enhancing text clustering by leveraging Wikipedia semantics[A].2008.179-186.
  • 5Hu X,Zhang X,Lu C,Park E K,Zhou X. Exploiting Wikipedia as external knowledge for document clustering[A].2009.389-396.
  • 6Blei D M,Ng A Y,Jordan M I. Latent Dirichlet allocation[J].Journal of Machine Learning Research,2003.993-1022.
  • 7Hofraann T. Probabilistic latent semantic indexing[A].1999.50-57.
  • 8Xu W,Liu X,Gong Y. Document clustering based on non-negative matrix factorization[A].2003.267-273.
  • 9Lin C J. Projected gradient methods for non-negative matrix factorization[J].Neural Computation,2007,(10):2756-2779.doi:10.1162/neco.2007.19.10.2756.
  • 10Cutting D R,Pedersen J O,Karger D R,Tukey J W. Scatter/gather:a cluster-based approach to browsing large document collections[A].1992.318-329.

同被引文献66

  • 1洪彩凤,武娇,顾永春,顾兴全,金世举.基于类语义结构表示的文本分类[J].中国计量大学学报,2020(2):215-224. 被引量:1
  • 2HU X, SUN N, ZHANG C, et al. Exploiting internal and external semantics for the clustering of short texts using world knowledge [ C ] // The 18th ACM Conference on Information and Knowledge Management. New York: ACM, 2009: 919-928.
  • 3HU X, TANG L, LIU H. Enhancing accessibility of microblogging messages using semantic knowledge [ C ]// International Conference on Information and Knowledge Management. Glasgow : ACM, 2011 : 2465-2468.
  • 4LIU Z T, YU W C, CHEN W, et al. Short text feature selection and classification for microblog mining [ C ] // International Conference on Computational Intelligence and Software Engineering. Wuhan: ACM, 2010: 1-4.
  • 5SRIRAM B, FUHRY D, DEMIR E, et al. Short text classification in Twitter to improve information filtering [ C]//The 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval. Geneva: ACM, 2010: 841-842.
  • 6CHURCHILL A, LIODAKIS E, YES. Twitter relevance filtering via joint bayes classifiers from user clustering [ R]. Stanford: University of Stanford, 2010.
  • 7LIN PY, LINZ J, KUANG B Q, et al. A short Chinese text incremental clustering algorithm based on weighted semantics and naive bayes [ J]. Journal of Computational Information Systems, 2012, 8(10) : 4257-4268.
  • 8谭翀,陈跃新.自动摘要方法综述[J].情报学报,2008,27(1):62-68. 被引量:9
  • 9孙萍,蒋昌俊.利用服务聚类优化面向过程模型的语义Web服务发现[J].计算机学报,2008,31(8):1340-1353. 被引量:63
  • 10洪宇,张宇,范基礼,刘挺,李生.基于语义域语言模型的中文话题关联检测[J].软件学报,2008,19(9):2265-2275. 被引量:19

引证文献14

二级引证文献111

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部