摘要
中文网页特征向量的提取是提高中文网页分类准确度和召回度的关键。经过研究HTML网页的结构特点,提出一种基于改进的HTML-Tree及网页元素权重的中文网页文本预处理方法,并在此基础上进行网页文本特征向量的提取。该方法充分利用不同类别网页的特点,考虑了网页内各种元素权重的贡献。经过实验验证,该方法提高了网页特征向量提取的效率,有效提高了中文网页分类的准确度和召回度。
Feature selection for Chinese web page is a key to improving accuracy and recall of Chinese web page classification. A method of preprocessing web page based on improved HTML-Tree and MTML tag weights is proposed with studies on the structure of HTML page. And feature selection for web page is thus processed based on this method. This method takes well advantage of featmes of different type web pages and takes the conmbutions variety information weights of web page make into consideration. The experimental results show the proposed method is good for improving efficiency of web page feature selection and accuracy and recall of Chinese web page classification.
出处
《信息技术》
2009年第1期10-14,共5页
Information Technology
基金
国家自然科学基金CNGI项目(CNG1-04-15-2A)
上海市科学技术委员会资助项目(05DZ22102)