摘要
随着信息化建设,互联网行业的发展,各种信息设备在运行和通信中,会产生大量的网络日志数据.网络日志的内容是非结构化的格式,获取相关信息具有一定难度,并且这种数据正在迅速增长为庞大的体量,所以从中获得所需的信息并对相关信息进行处理,是一个非常具有挑战性的任务.数据挖掘的技术是非常传统的技术,实施往往耗费太多时间,并产生过多的数据,大数据环境下,传统的串行的网络日志聚类方法存在性能的局限性,不再适合处理网络日志这样的海量数据,目前比较常用的对于网络日志的并行处理方法在计算时间、并行效率、准确率等方面存在一定改进空间.因而,本文提出了一种基于特征转移概率改进的网络日志聚类处理技术,并在Apache Spark平台上实现了用于提取频繁的庞大的网络日志的模式.实验结果表明,所提出的方法能够在大数据环境下对完整的网络日志提取所需信息并实现高效的分析,相对于目前常见的聚类分析算法,本文提出的基于特征转移概率的处理方式将执行时间降低到了75.97%.
With the development of information construction and Internet industry, a large amount of web log data is generated in the operation and communication of various information devices.The content of web logs is in unstructured format, and it is difficult to obtain relevant information, and this data is rapidly growing into a huge volume, so it is a very challenging task to obtain the required information from it and process the relevant information.Data mining is a very traditional technique, which often takes too much time to implement and generates too much data.In the big data environment, the traditional serial web log clustering method has performance limitations and is no longer suitable for dealing with such a huge amount of data as web logs, and there is some room for improvement in the computation time, parallel efficiency, and accuracy of the current more commonly used parallel processing methods for web logs.Thus, this paper proposes a web log clustering processing technique based on feature transition probability improvement, and implement a model for extracting frequent and huge web logs on Apache Spark platform.The experimental results show that the proposed method can extract the required information and achieve efficient analysis of complete weblogs in a big data environment, and the proposed feature transition probability-based processing reduces the execution time to 75.97% compared to the current common clustering analysis algorithms.
作者
齐文
朱曦源
宋杰
QI Wen;ZHU Xi-yuan;SONG Jie(School of Engineering and Technology,Liaodong University,Dandong 118001,China;Software College,Northeastern University,Shenyang 110819,China)
出处
《小型微型计算机系统》
CSCD
北大核心
2023年第3期514-520,共7页
Journal of Chinese Computer Systems
基金
国家自然科学基金项目(61672143)资助。