摘要
为度量在网络日志中网页分类模型的预测结果,将度量为可信的结果加入网址分类集合,提高网络日志中访问链接的分类效率,提出一种基于离群点检测的分类结果置信度的度量方法.采用基于Bagging构建多个弱分类器对待分类数据进行预测,并对每个预测结果构建各类别的概率向量,根据离群点检测来度量模型的预测结果是否为可信.在UCI公共数据集上,使用主流的基于k均值和基于局部密度的度量方法进行了对比实验.实验结果表明,应用基于离群点检测的分类结果置信度,基于k均值的度量方法和基于局部密度的度量方法均显著提高了准确率.另外,在工程项目爬取的网页分类中也取得了同样的效果.
In order to measure the prediction result of the webpage classification model,a novel confidence measure method of classification results is proposed based on outlier detection by adding the measurement result as a reliable result to the URL classification set to improve the classification efficiency of the link in the weblog.The Baggingbased weak classifiers first are used to predict the classification data.In addition,the probability vectors of different types are constructed for each prediction result.Then,the credibility of the prediction results is measured by outlier detection.The proposed confidence measure method is used by k-means-based measurement and local density-based measurement to webpage classification on UCI data set.The experimental results show that the accuracy of the classification results based on outlier detection are significantly improved respectively.Furthermore,the same effect is achieved in the classification of web pages crawled from engineering projects.
作者
严云洋
瞿学新
朱全银
李翔
赵阳
Yan Yunyang;Qu Xuexin;Zhu Quanyin;Li Xiang;Zhao Yang(Faeulty of Computer and Software Engineering,Huaiyin Institute of Technology,Huai'an,223003,China;School of Computer Science and Technology,Southwest University of Science and Technology, Mianyang,621010,China)
出处
《南京大学学报(自然科学版)》
CAS
CSCD
北大核心
2019年第1期102-109,共8页
Journal of Nanjing University(Natural Science)
基金
江苏省"六大人才高峰"项目(2013DZXX-023)
江苏省"青蓝工程"
江苏省重点研发计划(BE2015127)