摘要
对百度贴吧"恐艾吧"中在线高危人群的帖子内容、线上活动时间规律进行了分析,利用LDA话题模型,对比分析了有无HIV感染者参与的主贴讨论的话题之间的差异,使用基于关键词的机器学习方法区分了在"恐艾吧"中发布话题的用户的性取向,计算不同性取向人群中HIV的流行率。研究结果说明,使用在线数据挖掘的技术和方法比传统方法更加高效,可以作为高危人群研究的重要补充。此外,基于机器学习对人群性取向进行智能判别,对于公共卫生管理部门监测疫情在不同人群中的发展状况有重要意义。
The textual content and temporal pattern of online activities for users gathered in the "Fear of HIV Bar" of Baidu Tieba were analyzed. LDA topic model was used to analyze the main differences between topics discussed among HIV-infected people and non-HIV-infected people. A machine learning method based on key words was used to distinguish the sexual orientation of users who start a discussion in "Fear of HIV Bar", and calculate the epidemic rate of HIV among groups with different sexual orientations. The techniques used in this paper can be supplemented as an important tool for high-risk populations research. In addition, this paper can be applied to assess the epidemic of HIV in populations with different sexual orientations by using machine learning technique to intelligently classify the sexual orientation of a user, which is of great significance for the public health agencies.
作者
肖时耀
吕慰
陈洒然
秦烁
黄格
蔡梦思
谭跃进
谭旭
吕欣
XIAO Shiyao;LYU Wei;CHEN Saran;QIN Shuo;HUANG Ge;CAI Mengsi;TAN Yuejin;TAN Xu;LU Xin(School of Systems Engineering,National University of Defense Technology,Changsha 410073,China;Department of Oncology,Kangya Hospital,Yiyang 413002,China;School of Software Engineering,Shenzhen Institute of Information Technology,Shenzhen 518172,China)
出处
《大数据》
2019年第1期98-108,共11页
Big Data Research
基金
国家自然科学基金资助项目(No.91846301
No.71771213
No.71790615
No.71690233)
中国教育部文学和社会科学基金资助项目(No.17YJCZH157)
深圳市"鹏城学者计划"基金资助项目~~