Machine learning algorithms (MLs) can potentially improve disease diagnostics, leading to early detection and treatment of these diseases. As a malignant tumor whose primary focus is located in the bronchial mucosal e...Machine learning algorithms (MLs) can potentially improve disease diagnostics, leading to early detection and treatment of these diseases. As a malignant tumor whose primary focus is located in the bronchial mucosal epithelium, lung cancer has the highest mortality and morbidity among cancer types, threatening health and life of patients suffering from the disease. Machine learning algorithms such as Random Forest (RF), Support Vector Machine (SVM), K-Nearest Neighbor (KNN) and Naïve Bayes (NB) have been used for lung cancer prediction. However they still face challenges such as high dimensionality of the feature space, over-fitting, high computational complexity, noise and missing data, low accuracies, low precision and high error rates. Ensemble learning, which combines classifiers, may be helpful to boost prediction on new data. However, current ensemble ML techniques rarely consider comprehensive evaluation metrics to evaluate the performance of individual classifiers. The main purpose of this study was to develop an ensemble classifier that improves lung cancer prediction. An ensemble machine learning algorithm is developed based on RF, SVM, NB, and KNN. Feature selection is done based on Principal Component Analysis (PCA) and Analysis of Variance (ANOVA). This algorithm is then executed on lung cancer data and evaluated using execution time, true positives (TP), true negatives (TN), false positives (FP), false negatives (FN), false positive rate (FPR), recall (R), precision (P) and F-measure (FM). Experimental results show that the proposed ensemble classifier has the best classification of 0.9825% with the lowest error rate of 0.0193. This is followed by SVM in which the probability of having the best classification is 0.9652% at an error rate of 0.0206. On the other hand, NB had the worst performance of 0.8475% classification at 0.0738 error rate.展开更多
The accuracy of landslide susceptibility prediction(LSP)mainly depends on the precision of the landslide spatial position.However,the spatial position error of landslide survey is inevitable,resulting in considerable ...The accuracy of landslide susceptibility prediction(LSP)mainly depends on the precision of the landslide spatial position.However,the spatial position error of landslide survey is inevitable,resulting in considerable uncertainties in LSP modeling.To overcome this drawback,this study explores the influence of positional errors of landslide spatial position on LSP uncertainties,and then innovatively proposes a semi-supervised machine learning model to reduce the landslide spatial position error.This paper collected 16 environmental factors and 337 landslides with accurate spatial positions taking Shangyou County of China as an example.The 30e110 m error-based multilayer perceptron(MLP)and random forest(RF)models for LSP are established by randomly offsetting the original landslide by 30,50,70,90 and 110 m.The LSP uncertainties are analyzed by the LSP accuracy and distribution characteristics.Finally,a semi-supervised model is proposed to relieve the LSP uncertainties.Results show that:(1)The LSP accuracies of error-based RF/MLP models decrease with the increase of landslide position errors,and are lower than those of original data-based models;(2)70 m error-based models can still reflect the overall distribution characteristics of landslide susceptibility indices,thus original landslides with certain position errors are acceptable for LSP;(3)Semi-supervised machine learning model can efficiently reduce the landslide position errors and thus improve the LSP accuracies.展开更多
为使用正例与未标注数据训练分类器(positive and unlabeled learning,PU learning),提出基于随机森林的PU学习算法。对POSC4.5算法进行扩展,在其生成决策树的过程中加入随机特征选择;在训练阶段,使用有放回抽样技术对PU数据集抽样,生...为使用正例与未标注数据训练分类器(positive and unlabeled learning,PU learning),提出基于随机森林的PU学习算法。对POSC4.5算法进行扩展,在其生成决策树的过程中加入随机特征选择;在训练阶段,使用有放回抽样技术对PU数据集抽样,生成多个不同的PU训练集,并以其训练扩展后的POSC4.5算法,构造多棵决策树;在分类阶段,采用多数投票策略集成各决策树输出。在UCI数据集上的实验结果表明,该算法的分类性能优于偏置支持向量机算法、POS4.5算法和基于装袋技术的POSC4.5算法。展开更多
文摘Machine learning algorithms (MLs) can potentially improve disease diagnostics, leading to early detection and treatment of these diseases. As a malignant tumor whose primary focus is located in the bronchial mucosal epithelium, lung cancer has the highest mortality and morbidity among cancer types, threatening health and life of patients suffering from the disease. Machine learning algorithms such as Random Forest (RF), Support Vector Machine (SVM), K-Nearest Neighbor (KNN) and Naïve Bayes (NB) have been used for lung cancer prediction. However they still face challenges such as high dimensionality of the feature space, over-fitting, high computational complexity, noise and missing data, low accuracies, low precision and high error rates. Ensemble learning, which combines classifiers, may be helpful to boost prediction on new data. However, current ensemble ML techniques rarely consider comprehensive evaluation metrics to evaluate the performance of individual classifiers. The main purpose of this study was to develop an ensemble classifier that improves lung cancer prediction. An ensemble machine learning algorithm is developed based on RF, SVM, NB, and KNN. Feature selection is done based on Principal Component Analysis (PCA) and Analysis of Variance (ANOVA). This algorithm is then executed on lung cancer data and evaluated using execution time, true positives (TP), true negatives (TN), false positives (FP), false negatives (FN), false positive rate (FPR), recall (R), precision (P) and F-measure (FM). Experimental results show that the proposed ensemble classifier has the best classification of 0.9825% with the lowest error rate of 0.0193. This is followed by SVM in which the probability of having the best classification is 0.9652% at an error rate of 0.0206. On the other hand, NB had the worst performance of 0.8475% classification at 0.0738 error rate.
基金the National Natural Science Foundation of China(Grant Nos.42377164 and 52079062)the Interdisciplinary Innovation Fund of Natural Science,Nanchang University(Grant No.9167-28220007-YB2107).
文摘The accuracy of landslide susceptibility prediction(LSP)mainly depends on the precision of the landslide spatial position.However,the spatial position error of landslide survey is inevitable,resulting in considerable uncertainties in LSP modeling.To overcome this drawback,this study explores the influence of positional errors of landslide spatial position on LSP uncertainties,and then innovatively proposes a semi-supervised machine learning model to reduce the landslide spatial position error.This paper collected 16 environmental factors and 337 landslides with accurate spatial positions taking Shangyou County of China as an example.The 30e110 m error-based multilayer perceptron(MLP)and random forest(RF)models for LSP are established by randomly offsetting the original landslide by 30,50,70,90 and 110 m.The LSP uncertainties are analyzed by the LSP accuracy and distribution characteristics.Finally,a semi-supervised model is proposed to relieve the LSP uncertainties.Results show that:(1)The LSP accuracies of error-based RF/MLP models decrease with the increase of landslide position errors,and are lower than those of original data-based models;(2)70 m error-based models can still reflect the overall distribution characteristics of landslide susceptibility indices,thus original landslides with certain position errors are acceptable for LSP;(3)Semi-supervised machine learning model can efficiently reduce the landslide position errors and thus improve the LSP accuracies.
文摘为使用正例与未标注数据训练分类器(positive and unlabeled learning,PU learning),提出基于随机森林的PU学习算法。对POSC4.5算法进行扩展,在其生成决策树的过程中加入随机特征选择;在训练阶段,使用有放回抽样技术对PU数据集抽样,生成多个不同的PU训练集,并以其训练扩展后的POSC4.5算法,构造多棵决策树;在分类阶段,采用多数投票策略集成各决策树输出。在UCI数据集上的实验结果表明,该算法的分类性能优于偏置支持向量机算法、POS4.5算法和基于装袋技术的POSC4.5算法。