The delay-causing text data contain valuable information such as the specific reasons for the delay,location and time of the disturbance,which can provide an efficient support for the prediction of train delays and im...The delay-causing text data contain valuable information such as the specific reasons for the delay,location and time of the disturbance,which can provide an efficient support for the prediction of train delays and improve the guidance of train control efficiency.Based on the train operation data and delay-causing data of the Wuhan-Guangzhou high-speed railway,the relevant algorithms in the natural language processing field are used to process the delay-causing text data.It also integrates the train operatingenvironment information and delay-causing text information so as to develop a cause-based train delay propagation prediction model.The Word2vec model is first used to vectorize the delay-causing text description after word segmentation.The mean model or the term frequency-inverse document frequency-weighted model is then used to generate the delay-causing sentence vector based on the original word vector.Afterward,the train operating-environment features and delay-causing sentence vector are input into the extreme gradient boosting(XGBoost)regression algorithm to develop a delay propagation prediction model.In this work,4 text feature processing methods and 8 regression algorithms are considered.The results demonstrate that the XGBoost regression algorithm has the highest prediction accuracy using the test features processed by the continuous bag of words and the mean models.Compared with the prediction model that only considers the train-operating-environment features,the results show that the prediction accuracy of the model is significantly improved with multi-ple regression algorithms after integrating the delay-causing feature.展开更多
基金This work was supported by the National Nature Science Foundation of China(Nos.71871188 and U1834209)the Research and development project of China National Railway Group Co.,Ltd(No.P2020X016).
文摘The delay-causing text data contain valuable information such as the specific reasons for the delay,location and time of the disturbance,which can provide an efficient support for the prediction of train delays and improve the guidance of train control efficiency.Based on the train operation data and delay-causing data of the Wuhan-Guangzhou high-speed railway,the relevant algorithms in the natural language processing field are used to process the delay-causing text data.It also integrates the train operatingenvironment information and delay-causing text information so as to develop a cause-based train delay propagation prediction model.The Word2vec model is first used to vectorize the delay-causing text description after word segmentation.The mean model or the term frequency-inverse document frequency-weighted model is then used to generate the delay-causing sentence vector based on the original word vector.Afterward,the train operating-environment features and delay-causing sentence vector are input into the extreme gradient boosting(XGBoost)regression algorithm to develop a delay propagation prediction model.In this work,4 text feature processing methods and 8 regression algorithms are considered.The results demonstrate that the XGBoost regression algorithm has the highest prediction accuracy using the test features processed by the continuous bag of words and the mean models.Compared with the prediction model that only considers the train-operating-environment features,the results show that the prediction accuracy of the model is significantly improved with multi-ple regression algorithms after integrating the delay-causing feature.