摘要
为了在不加入外部语义知识的前提下改善向量空间模型的文本分类效果,通过挖掘语料库内部蕴含的词间关系和文本间关系,并以不同的方式融入原始的词文本矩阵,然后选择常用的SVM和KNN算法,在领域性较强的法律语料库和领域性较宽泛的新闻语料库上进行文本分类的对比实验。实验证明,加入词间关系和文本间关系通常能有效改善文本分类的效果,但是对不同的分类方法和领域特征有不同的影响,在实际应用中应该区别对待。
In order to improve the effect of text categorization on the premise of no addition of the external knowledge, this paper presented a feature matrix-based categorization framework. First, the internal knowledge of corpus is mined and added into the original word-text matrix in different ways. Two common algorithms named SVM and KNN are cho- sen for contrastive experiment of text categorization in highly territorial legal corpus and domain-wide news corpus. Experi-mental results show that it is generally helpful when adding the semantic relationships extracted from corpus in- to the original matrix, but the adding method should be chosen according to different classification methods and domain chara-cteristics.
出处
《计算机科学》
CSCD
北大核心
2016年第9期82-86,共5页
Computer Science
基金
国家自然科学基金(71271209)
北京市自然科学基金(4132067)
教育部人文社会科学青年基金(11YJC630268)
河北省自然科学基金项目(A2013410011)资助
关键词
向量空间模型
文本分类
语义挖掘
特征矩阵
Vector space model, Text categorization, Semantic mining, Feature matrix