期刊文献+

基于LDA模型特征选择的在线医疗社区文本分类及用户聚类研究 被引量:44

LDA Feature Selection Based Text Classification and User Clustering in Chinese Online Health Community
下载PDF
导出
摘要 随着互联网时代的快速发展,在线医疗社区的出现打破了时空限制,为用户提供了丰富的医疗信息和情感帮助,已经成为社会支持的重要来源,受到用户的广泛关注和参与。对在线医疗社区进行用户文本挖掘能够揭示社区中用户的参与行为,从而优化其用户管理和信息推荐。已有的研究对象主要集中在英文在线医疗社区,鲜有文献对中文在线医疗社区进行研究。基于社会支持理论,本文设计了一个中文用户文本挖掘流程来研究中文在线医疗社区中的社会支持类型和用户参与。利用中文文本挖掘及机器学习方法,对中文糖尿病社区"甜蜜家园"进行研究。本文利用LDA(Latent Dirichlet Allocation)模型进行特征提取来构建低维度文本表示向量,采用二元分类法将用户文本分为不同的社会支持类型。最后,基于分类结果使用K-means算法进行用户聚类来识别用户角色。相比传统的特征提取方法,利用LDA进行特征提取能显著地降低数据维度,优化分类模型,提高分类准确率和分类效率。结果表明,本文提出的中文用户文本挖掘流程在文本分类与用户聚类中效果显著。 The emerging online health communities (OHCs) provide abundant medical information and emotional connection for users in today's rapidly developing Internet era, without the limitation of time and space. OHCs, which have been regarded as one of the major sources of social support, have become increasingly popular among people with health issues in China. The user text mining of OHCs can reveal a user's behavior, and hence can be used to op- timize user management and information recommendation. Most studies used English OHCs as their research objects, while few focused on Chinese OHCs. Based on the social support theory, we designed a Chinese content analysis process to reveal the social support and user engagement in OHCs. Using a case study of an OHC among diabetics, we first extracted the features using an LDA model to construct low-dimensional text representation vectors, and then used binary classification to divide users' posts and replies into different types of social support. Finally, we used the K-means algorithm to cluster the users based on the classification results to identify user roles. Compared with the traditional vector space model, the LDA feature extraction can not only significantly reduce the data dimension and the amount of human annotation data, but also improve the classification accuracy and efficiency. Results showed thatthe process performed well in text classification and user clustering.
出处 《情报学报》 CSSCI CSCD 北大核心 2017年第11期1183-1191,共9页 Journal of the China Society for Scientific and Technical Information
基金 国家自然科学基金项目"内容关系互动下的在线医疗社区用户行为演化研究"(71573197)
关键词 在线医疗社区 LDA模型 特征提取 文本分类 用户聚类 OHCs LDA model feature extraction text classification user clustering
  • 相关文献

参考文献2

二级参考文献86

  • 1Treshansky A,McGraw R.An overview of clustering algorithms[A].Proceedings of SPIE,The International Society for Optical Engineering[C].2001(4367):41-51.
  • 2Clausi D A.K-means Iterative Fisher (KIF) unsupervised clustering algorithm applied to image texture segmentation[J].Pattern Recognition,2002,35:1959-1972.
  • 3Bezdek J C,Pal N R.Some new indexes of cluster validity[J].IEEE Transactions on Systems,Man,and Cybernetics _ Part B:Cybernetics,1998,28(3):301-315.
  • 4Ramze R M,Lelieveldt B P F,Reiber J H C.A new cluster validity indexes for the fuzzy c-mean[J].Pattern Recognition Letters,1998,19:237-246.
  • 5Lin D, Pantel P. DIRT-Discovery of Inference Rules from Text. In: Proc of ACM SIGKDD Conference on Knowledge Discovery and Data Mining. San Francisco, USA, 2001. 323-328.
  • 6Harris Z. Distributional Structure. In: Katz J J, ed. The Philosophy of Linguistics. New York, USA: Oxford University Press, 1985, 26-47.
  • 7van Rijsbergen C J. Information Retrieval. 2nd edition. London, UK: Buttersworth, 1989.
  • 8Cutting D R, Karger D R, Pedersen J O, Tukey J W. Scatter/ Gather: A Cluster-Based Approach to Browsing Large Document Collections. In: Proc of the 15th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. Copenhagen, Denmark, 1992, 318-329.
  • 9Zamir O, Etzioni O, Madani O, Karp R M, Fast and Intuitive Clustering of Web Documents, In: Proc of the 3rd International Conference on Knowledge Discovery and Data Mining, San Diego, USA, 1997, 287-290.
  • 10Mine, Tsunenori L U, Shimiao A , etal, A Text Mining System DIREC; Discovering the Relationships between Keywords by Filtering, Extracting and Clustering, In: Proc of the 5th Joint Conference on Knowledge-Based Software Engineering, Maribor, Slovenia, 2002, 65-69.

共引文献237

同被引文献561

引证文献44

二级引证文献306

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部