摘要
随着互联网的快速发展,网络日益成为人们查找有用数据的重要手段。由于WWW上的信息很多存储在HTML页面上,网页分类就显得十分必要。利用各种开源软件,详细设计并实现了一个中文网页分类模型,同时利用元搜索技术实现数据采集,有效地提高了采集的广度和深度。在进行中文分词时利用了专业词库,此方法提高了分词的准确率,在建立VSM时提出了一种基于相似度曲线的网页特征抽取方法,此方法能有效解决特征提取的高维问题,并对提高特征区分度,缩小运算量具有良好的效果。
With the rapid development of Intemet, network became useful tool to search important source of data. Because information on the WWW storages in HTML pages, the web classification becomes a very necessary. This paper used open-source software, designed and implemented of a Chinese web classification model. This paper used meta search technology to achieve data acquisition, effectively raised the breadth and depth of acquisition. Using the professional thesaurus to segment improves the effectiveness of the segment. This paper presents a feature extraction method based on the similarity curve when VSM model is set up, it has effectively resolved the high dimensional feature extraction and raised the distinction, reduced the amount of computation and achieved good result.
出处
《信息技术》
2008年第2期15-18,共4页
Information Technology
基金
国家自然科学基金(60673160)