摘要
针对中文文本的自动分类问题,提出了一种新的算法。该算法的基本思路是构造一个带权值的分类主题词表,该词表采用键树的方式构建,然后利用哈希杂凑法和长词匹配优先原则在主题词表中匹配待分类的文档中的字符串,并统计匹配成功的权值和,以权值和最大者作为分类结果。本算法可以避开中文分词的难点和它对分类结果的影响。理论分析和实验结果表明,该技术分类结果的准确度和时间效率都比较高,其综合性能达到了目前主流技术的水平。
To solving Chinese text categorization, a new algorithm is proposed. The basic idea is to construct a weighted value of classification subject terms list firstly, it is constructed in key tree, then using the Hash function and the principle of giving priority for long term matching to mapping the strings in documentations to the list. After that, calculate the sum of weights of these keywords which have been matched successfully. Finally take the maximum for result of the classification. The algorithm can avoid the difficulty of Chinese word segmentation and its influence on accuracy of result. Theoretical analysis and experimental results indicate that the accuracy and the time efficiency of the algorithm is higher, whose comprehensive performance reaches to the level of current major technology.
出处
《情报学报》
CSSCI
北大核心
2008年第3期323-327,共5页
Journal of the China Society for Scientific and Technical Information
基金
国家自然科学基金资助项目(60673193)
湖南省教育厅重点项目(07A067)
湖南省教育厅一般项目(07C750)
湘潭大学跨学科星火项目(0609016).
关键词
文本分类
主题词表
键树
哈希函数
增益权值
text categorization, subject terms list, key tree, Hash function, gain weight