摘要
该文意在设计并且实现一个针对英文文本的自动归类以及检索系统,重点在于提高分类方法的准确率。自动文本分类系统中,一般来说文本内容是以N维特征空间的形式存储的,所以特征提取的方法和准确率极大地影响到分类结果的正确率。传统方法是基于词形的,并不考察词语的意义,忽略了同一意义下词形的多样性、不确定性以及词义之间的关系,尤其是上下位关系。该文提出的方法,在向量空间模型(VSM)的基础上,以“概念”为基础,同时考虑词义的上位关系,使得训练过程中可以从词语中提炼出更加概括性的信息,从而达到提高分类精度的目的。
This paper aims at designing and implementing an automatic classification and retrieval system for English documents,focusing on improving the result of the classification algorithm.The documents in an automatic text classification sys tem are represented by feature vectors,and the overall performance is dependent on the algorithm and its accuracy of feature selection.Conventional word-fo rm based automatic classification systems ignore all semantic information of th e words,so the diversity and indeterminacy of word-forms will harm the result .This paper proposes a new feature extraction algorithm,which is based on the Vector Space Model,and uses concepts as features,giving further consideration to the concepts' inter-phrase relativity,especially the hypernymy.The algori thm enables the extraction of more abstract concepts of a text,and thus improve s the classification result.
出处
《计算机工程与应用》
CSCD
北大核心
2004年第11期75-77,共3页
Computer Engineering and Applications