摘要
研究了基于频率共现熵的跨语言网页自动分类问题,使用翻译软件将所有中文网页翻译为英文,计算中文和英文网页的共现特征频率共现熵值,确定中文和英文网页的共现知识,并与英文网页相结合训练中文分类模型.实验结果表明,该方法与贝叶斯分类模型、向量空间分类模型和信息瓶颈模型相比体现出良好的性能.
An approach to address the cross-language web pages automatic classification problem based on frequently co-occurring entropy(FCE) is been proposed.The algorithm first translating all Chinese web pages to English by simple translation software.Second,computing the frequently co-occurring entropy using all Chinese and English web pages.Third,selecting the common part between Chinese pages and English pages based on the FCE ranks.Last,training a Chinese classification model by English pages with the common part.The experimental results in ODP corpus show the method performs well performance than NB,SVM and IB models.
出处
《江西师范大学学报(自然科学版)》
CAS
北大核心
2011年第3期240-245,共6页
Journal of Jiangxi Normal University(Natural Science Edition)
基金
国家自然科学基金(60963014)
江西省教育厅青年科学基金(GJJ10116)
江西省教育厅科技课题(2007-129)资助项目
江西省自然科学基金(2008GZS0052)
江西省科技攻关项目(2006-184)
关键词
跨语言
网页分类
频率共现熵
贝叶斯分类
自适应贝叶斯分类
cross-language
web pages classification
frequently co-occurring entropy
naive Bayes
adapted-based naive Bayes