摘要
面对大数据环境下的海量信息时,传统的话题追踪技术无法及时准确地追踪话题的后续报道。基于Hadoop平台,研究并实现基于KNN算法的大数据话题追踪技术。该方法首先在Hadoop平台下实现以单词权重同文档词频相结合作为文本特征的并行化提取,然后在传统KNN算法上实现基于Hadoop平台的算法并行化,从而实现对多个话题的同时追踪,最终实现基于KNN算法的大数据话题追踪技术。实验表明,该方法较为有效地解决了面向大数据的话题追踪问题。
When dealing with the massive information under big data environment,traditional topic-tracking technology cannot provide timely and accurate follow-up reports,let alone meet the demand of intelligence personnel for open source intelligence collection and detection.Based on the Hadoop platform,this study implements a big data topic-tracking algorithm through parallel KNN.First,based on the Hadoop platform,we realize parallel text feature extraction by combining the weight with word frequency.Then we improve the traditional KNN algorithm by adding the threshold setting,and parallelize the algorithm based on the Hadoop platform,which help us track the multiple topics at the same time.Finally,we implement a big data topic-tracking algorithm based on parallel KNN.
作者
单志佳
席耀一
唐永旺
杨航
张新宇
SHAN Zhijia;XI Yaoyi;TANG Yongwang;YANG Hang;ZHANG Xinyu(Unit 61849,Foshan 528220,China;Information Engineering University,Zhengzhou 450001,China)
出处
《信息工程大学学报》
2019年第3期379-384,共6页
Journal of Information Engineering University