期刊文献+

分布式KNN算法在微信公众号分类中的应用 被引量:4

Application of distributed KNN algorithm in WeChat subscription classification
下载PDF
导出
摘要 针对微信公众号数据量大幅增长与从事微信活动的人们对其有效信息获取效率低下的问题,提出对微信公众号信息进行梳理并快速并行化分类以及打标签的方法。首先,该方法在介绍微信公众号实际应用的前提下,以经典K最近邻(KNN)分类算法为基础,实践并分析了单机KNN算法在效率上的不足;然后,采用Hadoop平台实现了基于MapReduce模型的KNN算法,对比了单机与分布式的效率以及对K值的调优,实验中的样本训练集通过人为指定,文本相似度的判别分为分词、特征词提取、权重计算、测试向量与训练向量夹角计算等步骤。在24个类别基础上,通过对1 000万条公众号数据分类实验,为每个公众号打上了单标签或多标签,优化后的分类准确率达到82%,其中与生活相关的公众号数量占比达70%以上。研究表明使用分类后的结果,信息针对特定人群传播,传播的转化率有所提升;分布式KNN算法在微信公众号数据处理方面比单机算法具有更高的效率和鲁棒性。 People who engage in We Chat commercial activities extract valuable information inefficiently when We Chat subscription data grows rapidly. To resolve the issue, a method of classifying and labeling the We Chat subscription data in parallel was proposed. Firstly, the practical applications of We Chat subscription were introduced, and the shortcomings of KNN classification algorithm on one single node was analyzed. Then, the distributed KNN algorithm on Hadoop platform using MapReduce application model was implemented, the efficiencies of stand-alone and distributed algorithms were contrasted and K value was tuned. In the experiment, the training sample set was specified, the text similarity between testing sample and training sample was determined by the steps bellow: word segmentation, feature words extraction, weight calculation, cosine coefficient calculation. Ten million records of truthful Web Chat subscription data were classified to 24 categories, and every We Chat subscription was set single label or multiple labels, the classification accuracy after optimization reached 82%, the number of the We Chat subscriptions associated with life accounted for more than 70%. The research shows the transformation rate of information has been improved by using the classification results, the distributed KNN algorithm has higher efficiency and robustness than the stand-alone algorithm for We Chat subscription data.
出处 《计算机应用》 CSCD 北大核心 2017年第A01期295-299,共5页 journal of Computer Applications
基金 国家安全生产总局项目(sichuan-0008-2016AQ sichuan-0009-2016AQ)
关键词 微信公众号 HADOOP平台 MAPREDUCE模型 K最近邻 分类 WeChat subscription Hadoop platform MapReduce model K-Nearest Neighbor(KNN) classification
  • 相关文献

参考文献8

二级参考文献73

  • 1李青,焦李成,周伟达.基于向量投影的支撑向量预选取[J].计算机学报,2005,28(2):145-152. 被引量:37
  • 2王强,王晓龙,关毅,徐志明.K-NN与SVM相融合的文本分类技术研究[J].高技术通讯,2005,15(5):19-24. 被引量:10
  • 3Tseng Y H, Lin C J, Lin Y I. Text mining techniques for patent analysis. Information Processing and Management, 2007, 43:1216-1247.
  • 4中国科学院计算技术研究所.ICTCLAS汉语分词系统[CP/OL].(2010-12-21)[2012-05-30].http:Nictclas.org/.
  • 5Frantzi K, Ananiadou S, Mima H. Automatic recognition of multi-word terms: the C-value/NC- value method. Intl Journal on Digital Libraries, 2000, 3(2): 115-130.
  • 6Lafferty J, McCallum A, Pereira F. Conditional random fields: Probabilistic models for segmen- ting and labeling sequence data // Proceedings of ICML-01. Berkshires of western Massachusetts, 2001: 282 289.
  • 7He Y, Kayaal P M. Biological entity recognition with conditional random fields // Proceedings of AMIA Annual Symposium. Washington, DC, 2008:293-297.
  • 8国家技术监督局.中华人民共和罔国家标准GB/T13715-92信息处理用现代汉语分词规范.北京:中围标准出版社,1993.
  • 9CRF++: Yet Another CRF toolkit [CP/OL]. (2012-05 30) [2012-08-21]. http://crfpp.googlecode.com/svn/trunk/ doe/index.html.
  • 10Hadoop[EB/OL].[2012-10-02]. http://hadoop.apache.org/ index.heml.

共引文献74

同被引文献28

引证文献4

二级引证文献16

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部