摘要
针对微信公众号数据量大幅增长与从事微信活动的人们对其有效信息获取效率低下的问题,提出对微信公众号信息进行梳理并快速并行化分类以及打标签的方法。首先,该方法在介绍微信公众号实际应用的前提下,以经典K最近邻(KNN)分类算法为基础,实践并分析了单机KNN算法在效率上的不足;然后,采用Hadoop平台实现了基于MapReduce模型的KNN算法,对比了单机与分布式的效率以及对K值的调优,实验中的样本训练集通过人为指定,文本相似度的判别分为分词、特征词提取、权重计算、测试向量与训练向量夹角计算等步骤。在24个类别基础上,通过对1 000万条公众号数据分类实验,为每个公众号打上了单标签或多标签,优化后的分类准确率达到82%,其中与生活相关的公众号数量占比达70%以上。研究表明使用分类后的结果,信息针对特定人群传播,传播的转化率有所提升;分布式KNN算法在微信公众号数据处理方面比单机算法具有更高的效率和鲁棒性。
People who engage in We Chat commercial activities extract valuable information inefficiently when We Chat subscription data grows rapidly. To resolve the issue, a method of classifying and labeling the We Chat subscription data in parallel was proposed. Firstly, the practical applications of We Chat subscription were introduced, and the shortcomings of KNN classification algorithm on one single node was analyzed. Then, the distributed KNN algorithm on Hadoop platform using MapReduce application model was implemented, the efficiencies of stand-alone and distributed algorithms were contrasted and K value was tuned. In the experiment, the training sample set was specified, the text similarity between testing sample and training sample was determined by the steps bellow: word segmentation, feature words extraction, weight calculation, cosine coefficient calculation. Ten million records of truthful Web Chat subscription data were classified to 24 categories, and every We Chat subscription was set single label or multiple labels, the classification accuracy after optimization reached 82%, the number of the We Chat subscriptions associated with life accounted for more than 70%. The research shows the transformation rate of information has been improved by using the classification results, the distributed KNN algorithm has higher efficiency and robustness than the stand-alone algorithm for We Chat subscription data.
出处
《计算机应用》
CSCD
北大核心
2017年第A01期295-299,共5页
journal of Computer Applications
基金
国家安全生产总局项目(sichuan-0008-2016AQ
sichuan-0009-2016AQ)