摘要
针对现有研究多基于病例对照的欠采样方法,即每起事故从连续交通流数据中按一定比例抽取对照的非事故数据构建模型,而该类模型在连续数据环境中的预测精度存在缺陷的状况,对城市交通连续观测并动态调控的技术环境(简称连续数据环境)开展道路交通事故风险预测模型构建研究。首先提出基于全样本交通流数据,结合"调整事故分类阈值"的方法解决事故风险预测研究中的非平衡数据分类问题;而后采用上海市城市快速路2014年5,6月的线圈检测交通流数据及历史事故数据开展实证研究,以受试者工作特征曲线下面积为评价指标,对比基于全样本和抽样样本构建的常用事故风险预测模型(逻辑回归、随机森林)的整体预测能力;以灵敏度和特异度的几何均数为评价指标,对比3种分类阈值计算方式(约登指数法、事故占比法和交叉点法)对事故/非事故综合预测精度的影响。结果表明:在连续数据环境下,采用全样本数据建模能使模型整体预测能力提高13.06%;基于约登指数法进行分类阈值计算可使模型的事故/非事故综合预测精度最佳。
This paper describes research on a road crash risk prediction model for a continuous observation and dynamic management environment (called a continuous data environment) in an active traffic management (ATM) system. A traffic crash is an event with a small probability, and the ratio of crashes to non-crash cases in crash risk prediction research is not coordinated, and therefore poses the issue of an imbalanced data classific ation. To build a crash risk prediction model, existing research has been mostly based on a "matched case-control" under-sampling method to extract non-crash cases from continuous traffic flow data at a certain proportion- thus, the prediction accuracy of the model in a continuous data environment is inadequate. The research proposes utilizing a full set of traffic flow data to build a model and avoid an imbalanced data classification by "adjusting the classification threshold to discriminate crashes from non-crashes. " The loop detector data and crash history data of the Shanghai expressway system from May to June 2014 were used experimentally. The area under an ROC curve (AUC) was used as an index to compare the commonly used crash risk prediction model (using logistic regression and random forest algorithms) based on the full set of data and the sample data respectively. The influence of three different classification thresholds (Youden's index, the crash occupancy, and the cross point method) on the comprehensive prediction accuracy of a crash and non-crash was compared using the geometric mean of sensitivity and specificity as the indices. The results show that, in a continuous data environment, the model with a full set of data improves the overall prediction capability by 13.06%. Youden's index method for the classification threshold calculation increases the optimal comprehensive prediction accuracy of crash and non-crash cases.
作者
高珍
高屹
余荣杰
黄智强
王雪松
GAO Zhen;GAO Yi;YU Rong-jie;HUANG Zhi-qiang;WANG Xue-song(School of Software Engineering, Tongji University, Shanghai 201804, China;Key Laboratory of Road and Traffic Engineering, Ministry of Education, Tongji University, Shanghai 201804, China)
出处
《中国公路学报》
EI
CAS
CSCD
北大核心
2018年第4期280-287,共8页
China Journal of Highway and Transport
基金
国家自然科学基金项目(71401127
51522810)
上海市科学技术委员会项目(15DZ1204800)
关键词
交通工程
连续数据环境
事故风险预测模型
非平衡数据
二分类阈值
城市快速路
traffic engineering
continuous data environment
crash risk prediction model
imbalanced data
binary classification threshold
urban expressway