摘要
在传统的访问日志分析系统中,数据采集效率较低且日志采集目录不能被递归监听,存储系统及计算系统缺乏高可用性。构建基于分布式集群的高可用日志分析系统,通过Nginx直连Kafka的方式采集实时分析的数据和自定义Source组件的Flume采集离线分析的数据,使用高可用的分布式文件系统HDFS和计算系统Spark分别提供持久化存储和计算引擎,利用Mysql和Hbase分别存储聚合数据及明细数据。实验结果表明,该系统的各项功能符合预期结果,直连Nginx-Kafka的采集方式和自定义Source组件的Flume明显提高采集效率,Zookeeper协调的分布式存储系统HDFS和计算系统Spark均满足高可用性,利用ALS算法测试存储与计算系统的功能。
In the traditional access log analysis system,the efficiency of collecting data is relatively low,and the log collection directory cannot be recursively monitored,and the storage system and the computing system lack high availability.Building a highly available log analysis system based on distributed cluster,Collecting data for real time analysis and offline analysis by the way of Nginx connecting Kafka directly and the Flume of custom Source component,the highly available Hadoop distributed file system(HDFS)and computing system Spark provide persistent storage and computing engine respectively,Using MySQL and HBase to store aggregated and detailed data respectively.The experimental results show that the functions of the improved system meet the expected results.the way of Nginx connecting Kafka directly and the Flume of custom Source component significantly improves the collecting efficiency,and distributed storage system HDFS and computing system Spark coordinated by Zookeeper meet high availability.Using ALS algorithm test the function of storage and computing system.
作者
陈乐
余粟
王盟
CHEN Le;YU Su;WANG Meng(Shanghai University of Engineering Science,Shanghai 201620,China)
出处
《中国电子科学研究院学报》
北大核心
2020年第5期420-426,共7页
Journal of China Academy of Electronics and Information Technology
基金
上海市科学技术委员会资助项目(175111110204)。