摘要
文本特征描述是文本分类的基础,其目标是用一定的可计算的特征来表示文本,在分类的时候用这些特征来区分文本。在向量空间模型(Vector Space Model,简记为VSM)中采用“词袋”法来处理文本,即文本被看成是由相互无关的词语构成的集合,不考虑词语之间的关系,但是这种处理方法不是很合理,因为文本的结构是完整的,孤立地对待单个词语将丢失文本的内容信息。在实际语言环境中,词语有一定的上下文“作用域”,“作用域”中的词语对表达同一主题具有一定的共性。本文提出了一种基于上下文关系的文本特征描述方法,包括特征选择方法CBFS及权重计算方法CBFW。该方法是在提取一个初始特征词语集合的基础上,通过用互信息(MI)来衡量词语在上下文中的依赖度,选取对主题贡献大的词语加入特征集合,同时调整不同贡献的特征词语的权重,从而更加合理地表示文本。
Text feature description is considered as the basic problem in text classification and it aims to use computable feature to model documents. The most used feature description method treats a text as a set of words, which called "bag of words" model, under this model feature selection and weighting consider the "frequency" of single word only, ignoring the relation of words in context. But generally words in a certain context field can deliver correlative meaning for a same topic. So the "bag of words" model loses the context information that is important facts for improving classifica- tion precision. This paper presents a new feature description method based on text context. First, a commonly used feature selection method is used to get an initial set of feature words; secondly, Mutual Information (MI) is used to compute the word dependence in a concrete context, then, the feature words is selected according to the denpendence. Meanwhile, the weight of each feature is adjusted. Experiment result indicates the efficience of the new approach.
出处
《计算机科学》
CSCD
北大核心
2007年第5期183-186,共4页
Computer Science
基金
国家自然科学基金项目(60173060)
关键词
特征描述
文本分类
向量空间模型
权重计算
Feature description, Text categorization, Vector space model, Weighting