面向过程挖掘的日志划分技术综述被引量：2

A Survey of Log Division Technique in Process Mining

下载PDF

导出

摘要过程挖掘的目标是从软件系统产生的日志数据中提取出有价值信息,用于配置或优化已实施的业务过程.与此同时,大数据、物联网等技术的发展不仅使得业务内容愈加复杂,更是加速了业务演化的速度.在此背景下,有必要对原始日志进行划分,使得事件日志通过分解而被更有效地分析,进而提升过程挖掘的质量.日志划分的宗旨是根据不同问题采取不同方法将原始事件日志划分为多个子日志,为后续的过程挖掘研究提供支撑.模型发现是过程挖掘中最重要的应用场景,而该场景面临的两大难题就是模型过于复杂和模型不正确.当前,解决这两个难题的方法分别是轨迹聚类和概念漂移,而这两类方法的本质都是对原始事件日志进行了划分.本文针对轨迹聚类和概念漂移两个分支进行归纳总结,试图厘清日志划分内容中这两个分支的异同点.接着,通过文献规约系统地对现有研究进行统计与分析,揭露了两个研究分支的发展趋势.然后,梳理了现有研究方法的主要思路,将轨迹聚类分为距离驱动、模型驱动和混合聚类三类,并将概念漂移分为单一类型和复合类型两类.最后,利用公开数据集测试不同类型算法的优缺点,并指出未来研究的发展方向. Process mining aims to extract the valuable information from event logs generated by software systems,which is often utilized for configuring and optimizing the ongoing business process models.Meanwhile,the development of information technologies(such as big data and the Internet of Things) not only makes process models more structurally-and behaviorallycomplex,but also accelerates the speed of business evolution.Under this circumstance,it is necessary to analyze,design and simplify the original event log into sub-logs for effectively-reusable purposes.Therefore,it is more instructive for process mining to mine the process models from the sub-logs instead of the original logs.As we know,data division is to improve model performance through subsets analysis,so the purpose of log division is to divide the original event log into multiple sub-logs by adopting different methods according to different issues.The analysis of these sub-logs can provide support for process mining research,especially in process discovery scenario.It is known that process mining has three application scenarios,namely process discovery,conformance checking and process enhancement.The most crucial learning task in the process mining domain is process discovery that is defined as the construction of a reasonable process model from the original evant log.However,the model mined from the original event log in process discovery scenario is always too complex(spaghetti-like model) and inaccurate(neglect evolution).At present,the solutions to these two problems are trace clustering and concept drift,where the trace clustering can be considered a versatile solution for reducing the complexity of the mined models,and concept drift in process mining is to detect changes in event logs for improving the accuracy of process discovery.In our viewpoints,the principles of the two solutions,trace clustering and concept drift,are the same,because they both improve the quality of the mined models by dividing the original log into multiple sub-logs.In this paper,we summarize the two key branches of log divisiox-trace clustering and concept drift,trying to clarity the similarities and differences between them.We find that trace clustering only considers the similarity between different traces in the original log,while ignoring the timestamp attribute on the traces.In contrast,the focus of concept drift research is to find out the time points(or the locations of the traces in the original log),and then divide the original log into sub-logs based on these time points.Moreover,we systematically summarize and analyze the development trend of the related studies through the literature protocol,and find that the growth trend of the concept drift in the past five years is greater than the growth trend of the trace clustering.More concretely,we classify the methods of trace clustering into three categories(i.e.,distance-driven,rodel-driven and hybrid clustering) and classify the methods of concept drift into two types(i.e.single type and composite type).Firthermore,we use the publiclyr-vailable datasets to evaluate the advantages and disadvantages of different types of methods,and sketch some potential development directions of future research.

作者林雷蕾闻立杰钱忱宗瓒王建民 LIN Lei-Lei;WEN Li-Jie;QIAN Chen;ZONG Zan;WANG Jian-Min(Department of Software,Tsinghua University,Beijing 100084;Beijing Key Laboratory of Industrial Big Data System and Application,Beijing 100084)

机构地区清华大学软件学院工业大数据系统与应用北京市重点实验室

出处《计算机学报》 EI CAS CSCD 北大核心 2022年第9期1946-1968,共23页 Chinese Journal of Computers

基金国家重点研发计划项目(2019YFB1704003) 国家自然科学基金项目(71690231,62021002)资助.

关键词过程挖掘轨迹聚类概念漂移业务演化 process mining trace clustering concept drift business evolution

分类号 TP311 [自动化与计算机技术—计算机软件与理论]

引文网络
相关文献

参考文献2

1林雷蕾,周华,代飞,朱锐,李彤.一种从无“aba”模式的日志中挖掘2度循环的方法[J].软件学报,2018,29(11):3278-3294. 被引量：7
2郑灿彬,吴翾,闻立杰,王建民.从事件日志中发现过程模型的渐变漂移[J].计算机集成制造系统,2019,25(4):830-836. 被引量：6

二级参考文献2

1鲁法明,曾庆田,包云霞,段华,张昊.基于流程案例簇的任务关系挖掘算法[J].计算机集成制造系统,2013,19(8):1771-1783. 被引量：7
2朱锐,李彤,莫启,代飞,高提雷,何云,孙雪.启发式并行化单触发序列挖掘算法[J].计算机集成制造系统,2016,22(2):330-342. 被引量：5

共引文献11

1李零,杨捷,段明明.基于长短时记忆网络的电力故障维修效果情感分析[J].云南大学学报（自然科学版）,2020,42(S02):44-48. 被引量：2
2杨捷,李沛霖,罗成臣,洪锋.基于数据挖掘的电网用户行为分析[J].云南大学学报（自然科学版）,2020,42(S02):38-43. 被引量：23
3许睿,龙丹,刘佳,刘畅.基于LDA模型的电力投诉文本热点话题识别[J].云南大学学报（自然科学版）,2020,42(S02):26-31. 被引量：3
4林雷蕾,杨良,闻立杰,周华,王建民.基于信息熵的无标日志划分评价方法[J].计算机集成制造系统,2020,26(6):1483-1491. 被引量：3
5敬思远.过程挖掘算法研究综述[J].乐山师范学院学报,2020,35(12):39-48.
6花龙雪,吴应良.基于CNKI文献计量分析的过程挖掘研究评述与展望[J].管理学报,2021,18(6):938-948. 被引量：45
7原佳怡,朱锐,林雷蕾,李彤,郑明.单触发序列突发漂移检测算法[J].计算机集成制造系统,2021,27(9):2636-2646.
8李婷,巩秀钢,徐兴荣,李会玲,牛慧敏,刘聪.基于漂移检测的流程变体差异分析方法[J].山东科技大学学报（自然科学版）,2022,41(2):89-98.
9孙晋永,许乾,闻立杰,孙志刚,邓文伟.基于概念漂移发现的业务过程异常检测方法[J].计算机集成制造系统,2024,30(8):2708-2721.
10林雷蕾,钱忱,闻立杰,邱泓钧.面向过程文本的合规性检查方法[J].软件学报,2024,35(10):4696-4709.

同被引文献15

1田银花,杜玉越,韩咚,刘伟.基于Petri网的批量迹与过程模型校准[J].计算机学报,2018,41(3):611-627. 被引量：5
2郑灿彬,吴翾,闻立杰,王建民.从事件日志中发现过程模型的渐变漂移[J].计算机集成制造系统,2019,25(4):830-836. 被引量：6
3林雷蕾,闻立杰,周华,裴继升,代飞,郑灿彬.基于日志完备性的过程漂移检测方法[J].计算机集成制造系统,2019,25(4):873-881. 被引量：4
4汤雅惠,朱锐,李彤,南峰涛,郑明,马自飞.基于轨迹聚类种群的遗传过程混成挖掘算法[J].计算机集成制造系统,2020,26(6):1510-1524. 被引量：6
5刘聪,程龙,曾庆田,闻立杰,欧阳春.基于Petri网的分层业务过程挖掘方法[J].计算机集成制造系统,2020,26(6):1525-1537. 被引量：16
6韩咚,田银花,杜玉越,张琴.基于Petri网可达图的业务对齐方法[J].计算机集成制造系统,2020,26(6):1589-1606. 被引量：7
7石磊,王毅,成颖,魏瑞斌.自然语言处理中的注意力机制研究综述[J].数据分析与知识发现,2020,4(5):1-14. 被引量：63
8张帅鹏,李会玲,李婷,徐兴荣,刘聪.一种基于日志相似度的轨迹聚类评估方法[J].山东科技大学学报（自然科学版）,2021,40(5):107-115. 被引量：1
9王琦,闻立杰,邓雅方,钱忱,王建民.基于过程模型约束的轨迹乱序事件修复方法[J].计算机集成制造系统,2021,27(9):2491-2500. 被引量：3
10孙笑笑,侯文杰,应钰柯,俞东进.基于双层机器学习的业务流程剩余时间预测[J].计算机学报,2021,44(11):2283-2294. 被引量：8

引证文献2

1田银花,李昕燃,武于皓,韩咚,杜玉越,王路.基于约束轨迹聚类的事件日志批量修复方法[J].计算机集成制造系统,2024,30(8):2797-2808.
2赵海燕,戎小玉,曹健,陈庆奎.面向业务过程的概念漂移检测技术综述[J].计算机集成制造系统,2024,30(10):3431-3446.

1李莉.基于案例属性的子目标挖掘方法[J].佳木斯大学学报（自然科学版）,2022,40(4):27-30.
2王华杰,翟丽红,张俊生.基于构造特征点信息的军用车辆行为分析算法[J].火力与指挥控制,2022,47(8):68-73.
3张颖龙,张帆,唐倩,吴伟超,吕升,袁婧.嘉兴市冬季一次污染过程PM_(2.5)组分特征及来源分析[J].四川环境,2022,41(4):67-73. 被引量：1
4黄欣卓,米加宁,章昌平,巩宜萱.科学数据复用研究的演化、知识体系与方法工具——兼论第四科研范式的影响[J].科研管理,2022,43(8):100-108. 被引量：8
5张淋萌,张可佳.基于量子单向认证的新型代理投票协议[J].黑龙江大学自然科学学报,2022,39(4):388-393.

计算机学报

2022年第9期

浏览历史

内容加载中请稍等...

面向过程挖掘的日志划分技术综述被引量：2

参考文献2

二级参考文献2

共引文献11

同被引文献15

引证文献2

相关作者

相关机构

相关主题

浏览历史

面向过程挖掘的日志划分技术综述 被引量：2

参考文献2

二级参考文献2

共引文献11

同被引文献15

引证文献2

相关作者

相关机构

相关主题

浏览历史

面向过程挖掘的日志划分技术综述被引量：2