摘要
如何通过有效的数据挖掘对互联网教育平台中的课程主题进行挖掘、聚类是当前互联网教育亟待解决的问题之一。实验基于文本信息对某互联网教育平台的1 472门课程体系的主题分布及类别进行了分析。采集了某平台1 472门课程的描述信息,进而通过自建词典和停用词库对文本进行切词分词,并通过TF-IDF对词频权重进行处理。利用LDA主题模型对课程的主题分布进行识别,发现了230个主题,并得到了每门课程在这230个主题下的文档–主题分布以及主题–词分布。进一步基于分布相似性函数对课程进行层次聚类,发现基于不同抽象层次主题的课程相互关联。最后将16个主题信息进行了可视化,这些主题分别从内容和数量两个角度反映出了课程的主题特征以及课程的聚合分布情况。
How to dig out informations from courses and conduct cluster analysis through effective data mining for online education is one of the problems to be solved. The topic distribution and classification of 1 472 courses from an online education platform were analyzed experimentally based on the text description. The text informations of 1 472 courses from the platform were collected, a customized dictionary and stop word list were constructed to do the word segmentation, and then the TF-IDF was employed to calculate the word frequency weighting. The topic distribution was recognized by using LDA and 230 topics were discovered.Both the document-topic distribution and topic-word distribution for each course text were obtained under the 230 topics. The hierarchical clustering for courses was completed based on the distribution similarity function and it is found that the courses were interrelated based on different levels of abstract topics. In the end, informations of 16 topics were visualized. This discovery of topics hidden in the semantics reflects the topic feature and the aggregate distribution of massive courses.
作者
李梦杰
刘建国
郭强
李仁德
汤晓雷
LI Mengjie;LIU Jianguo;GUO Qiang;LI Rende;TANG Xiaolei(Research Center of Complex Systems Science,University of Shanghai for Science and Technology,Shanghai 200093,China;Laboratory Center,Shanghai University of Finance and Economics,Shanghai 200433,China;Hujiang Education &Technology Co.,Ltd.,Shanghai 201203,China)
出处
《上海理工大学学报》
CAS
北大核心
2018年第3期259-266,共8页
Journal of University of Shanghai For Science and Technology
基金
国家自然科学基金资助项目(61773248
71771152)
关键词
主题发现
层次聚类
互联网教育
文本挖掘
topic discovery
hierarchical clustering
online education
text mining