摘要
向量空间模型是自动文本分类中成熟的文本表示模型,通常以词语或短语作为特征项,但这些特征项通常只能提供较少的局部语义信息。为实现基于内容的文本分类,该文用HNC理论中的句类作为特征项,通过混合句类分解等技术对句类向量空间降维,使用tfc算法对特征项进行权重计算,用KNN算法进行分类。该分类器的平均准确率和召回率都是可接受的,对类别的抽象程度无要求,即抽象度较高和较低的类别可以同时分类。通过使用更好的机器学习算法和其他的HNC语言理解技术,性能可以进一步提高。
Vector space model is a mature model of text representation in automatic text categorization. Words and phrases are commonly used as feature items, but these items provide little local semantic information. This paper uses sentence categories, which include more semantic information, as feature items. To reduce the dimensionality of sentence category vector space, it divides mixed sentence categories and reformes the weights by tfc algorithm and buildsthe classifier by KNN algorithm. The average precision and recall of the classifier are acceptable, especially in the case of categories having different abstraction. The performance can be improved by HNC techniques and machine learning algorithm.
出处
《计算机工程》
CAS
CSCD
北大核心
2007年第22期45-47,共3页
Computer Engineering
基金
国家"973"计划基金资助项目"自然语言理解的交互引擎研究"(2004CB318104)
中科院声学所知识创新工程资助项目
关键词
文本分类
句类
向量空间模型
HNC理论
text classification
sentence category
vector space model (VSM)
HNC theory