摘要
文本的形式化表示一直是文本挖掘的基础性问题,向量空间模型中的TFIDF计算方法是文本表示中一种效果较好的经典词条权重计算方法。在分析传统TFIDF计算方法存在问题的基础上,针对TFIDF方法中没有考虑包含词条的文档在各个类别的分布情况以及各个类别中所含的文档数的不同。提出了将词条的数学期望(TFIDF-E)作为一个文本因子来进行改进上述问题。实验结果表明,TFIDF-E计算方法表示的文本分类效果好于TFIDF,验证了TFIDF-E方法的有效性和可行性。
Text formal representation is always the fundamental issue in text mining.TFIDF(Term Frequency,Inverse Document Frequency) calculation method in eigenspace model is a classical term weight calculation approach in text representation with better effect.based on analysing the problems in traditional TFIDF method of calculation,in light to that in TFIDF method it does not consider the distribution situation of various categories including the document contains the terms and to that there is different document number in each category,this paper proposes that to adopt mathematical expectations of the term(TFIDF-E) as a text factor for improving the above.Experimental results show that the text categorisation effect represented by TFIDF-E algorithm is better than the old TFIDF,the effectiveness and feasibility of TFIDF-E algorithm has been validated.
出处
《计算机应用与软件》
CSCD
2011年第4期177-179,共3页
Computer Applications and Software
基金
安徽省教育厅自然科学重点项目(KJ2007A051)
关键词
文本分类
词条权重
区分度
数学期望
Text categorisation Term weight Differentiation Mathematical expectation