摘要
为了更好地评估Web文档数据质量,提出一种基于PAC-Bayes理论的Web文档质量评估指标体系和评估方法。PAC-Bayes理论融合了PAC理论和贝叶斯定理,在充分利用样本先验信息的基础上,推导出了最紧的泛化风险边界,用于衡量学习算法的泛化性能。首先阐述了文档数据质量评估的研究现状,介绍了PAC-Bayes理论框架及其在支持向量机上的应用;其次提出一种基于PAC-Bayes理论的Web文档数据质量评估方法(DQAPB),将SVM算法及其PAC-Bayes边界应用于Web文档的质量评价中,并构建了基于PAC-Bayes理论的Web文档质量评估指标体系;最后采用Wikipedia文档进行实验,实验结果表明该方法具有简便快速、稳定性和鲁棒性较强的优点。
We propose an assessment index system and a method based on the PAC-Bayes theory for better data quality assessment of Web articles. Making full use of prior information of samples, the PAC-Bayes theory integrates the theories of Probably Approximately Correct and the Bayesian para- digm, and derives the tightest generalization bounds to assess the generalization capability of classifiers. We analyze the research status of data quality assessment of articles in detail, and then introduce the theoretical framework of the PAC-Bayes theory and its application for SVM. Furthermore, we propose a method for data quality assessment of Web articles based on the PAC-Bayes theory (DQAPB), and ap- ply the SVM algorithm and its PAC-Bayes bound to the data quality assessment of Web articles. Moreo- ver, we establish a quality assessment index system of Web articles based on the PAC-Bayes theory. Ex- periments on Wikipedia document show that the proposed method is simple and fast with strong stability and robustness.
出处
《计算机工程与科学》
CSCD
北大核心
2017年第3期572-579,共8页
Computer Engineering & Science
基金
天津市自然科学基金(15JCYBJC16000)
教育部人文社会科学研究一般项目(14YJA630025)
天津市社会科学基金(TJYY15-017)
国家自然科学基金(61502331)