摘要
网页分类可对海量网页进行分门别类,可应用于许多方面。现存的网页自动分类方法较多,其中常用的基于网页内容的方法由于网页内容的不纯,导致其存在较大的性能提升空间。基于查询日志,提出了一种新型的网页分类方法NQPC。该方法提出一种低维特征向量抽取方法,从而避免"维度灾难";基于优质的查询日志进行网页分类,查询日志相对网页内容而言,具有内容较纯的优点;提出一种提升分类准确率的过滤方法。实验结果表明,提出的网页分类方法具有优异的性能表现,使其具有良好的应用前景。
Web-page classification can be utilized to categorize massive web-pages and thus can be utilized in lots of areas.There are quite a few existing automatic web-page classification methods,among which there is large performance improvement space for the commonly-used web-content-based method,due to the impurity of page content.In this paper,based on query log,a novel web-page-classification method NQPC(Novel Query log-based web-Page Classification)is proposed.Its novelty is that: a low-dimensional feature vector extraction method is proposed to avoid the"curse of dimensionality";web-page classification is based on high-quality query log,which has purer content than web-page content;a filter method is proposed to improve the classification accuracy.Experimental results show that the web-page-classification method has excellent performance,which gives it good application prospects.
出处
《计算机工程与应用》
CSCD
2012年第11期82-87,128,共7页
Computer Engineering and Applications
基金
国家自然科学基金(No.60803085
No.60873245)
广东省中国科学院全面战略合作项目(No.2009A0091100002
No.2010A090100004)
东莞市重大科技专项(No.2009215102001)
关键词
查询日志
网页分类
机器学习
文本分类
特征抽取
query log
web-page classification
machine learning
text classification
feature extraction