摘要
文本的形式化表示一直是文本检索、自动文摘和搜索引擎等信息检索领域关注的基础性问题。向量空间模型(Vector Space Model)中的TF.1DF文本表示是该领域中得到广泛应用并且取得较好效果的一种文本表示方法。词语在文本集合中的类别分布比例量上的差异是决定词语表达文本内容的重要因素之一。但现在的TEIDF方法无法把握这一因素,针对这一缺点,将信息增益公式引入文本集合中并提出TEIDFIG文本表示方法,并比较分析了其相较于传统TF.IDF公式的优点,用实验验证了其可行性和有效性。
The formalization of text is always a fundamental issue in the area of information retrieval, such as text retrieval, automatic abstract, search engine etc. The TF.IDF text representation in Vector Space Model is an efficiency and widespread used method in this area. The difference in categorical distribution proportion in text aggregate of words is one of the key factors which determine the content of words. But the present TF.IDF method cannot handle this factor. For this shortcoming, this article introduces the text information gain for- mula to text aggregate and proposes the TEIDEIG text representation method, compares and analysis its advantages to the traditional TF. IDF formula, verifies the feasibility and validity with experiments.
作者
张青
熊前兴
ZHANG Qing, XIONG Qian xing (Department of Computer Science and Technology,Wuhan University of Technology, Wuhan 430063, China)
出处
《电脑知识与技术》
2011年第1期204-206,共3页
Computer Knowledge and Technology
关键词
文本表示
向量空间模型
词语权重
信息增益
text representation
vector space model
weight of words
information gain