摘要
在信息检索的向量空间模型中 ,文本被形式化表示为由词语权重组成的向量。因此如何让这种向量尽量准确的有效的表示出文本内容一直是该模型中的基础性问题。在这篇论文中 ,我们提出了一种基于文本集密度的特征词选择与权重计算方案的方法。它是一种使用词对文本集密度的贡献衡量该词的价值的方法。使用这种方法 ,我们能找出不损失文本有效信息的最小特征词语集 ,并且创造出更为合理权重计算方案。在文中还用了一种新的衡量权重好坏的标准———元打分法 。
In vector space model of information retrieval,a text is represented as a weighted vector which is composed of terms weighting of the text. And it is a fundamental issue to how to represent the content of a text as exactly and efficiently as possible. In this paper, we will propose a method of feature selection and weighting scheme based on text set density,which is a way of measure of contribution to the text set density about some word. By the means, we can find the set containing least elements, which can represent all valuable information of a text, and invent a more reasonable weighting scheme. And this paper presents a new measure standard of the sense of goodness of some weighting schemes: meta scoring. Through the criterion, it is proved that the approach helps.
出处
《中文信息学报》
CSCD
北大核心
2004年第1期42-47,共6页
Journal of Chinese Information Processing
基金
山东省教育厅项目 (J0 0F0 4 )
关键词
计算机应用
中文信息处理
信息检索
文本集密度
权重计算方案
元打分法
computer application
Chinese information Processing
information retrieval
text set density
weighting scheme
meta scoring