摘要
以往基于词语关联的方法在挖掘频繁项集和关联规则时,都是将整个文本看作一个亨务来处理的,然而文本的基本语义单元实际上是句子。那些同时出现在一个句子里的一组单词比仅仅是同时出现在同一篇文档中的一组单词有更强的语义上的联系。基于以上的考虑,我们把一篇文档里的一个句子作为一个单独的事务,从而提出了一种基于句子级关联的分类方法SAT-FOIL。并在本文中提出新的得分模型来获得改进的新算法SAT-FOIL+。通过在标准的文本集Reuters上的大量实验,不仅证明新模型的优越性,而且证明了SAT-FOIL+分类效果同其他几种分类方法是可比的,并且要远远好于以往的基于文档级关联的分类方法。另外,挖掘出来的分类规则还具有易读性,并且易修改。
While previous association based methods mainly mined frequently co-occurring words (frequent itemsets) at the document-level, the basic semantic unit in a document is actually a sentence. Words within the same sentence are typically more semantically related than words that just appear in the same document. Our proposed SAT-FOIL views a sentence rather than a document as a transaction. In this paper we proposed new score models to get the im- proved algorithm SAT-FOIL+. The effectiveness of our proposed SAT-FOIL+ method has been demonstrated not only better than our former algorithm SAT-FOIL but also comparable to well-known alternatives and much better than previous document-level association based methods by extensive experimental studies using popular benchmark text collections Reuters. In addition, SAT-FOIL+ has inherent readability and refinability of acquired classification rules.
出处
《计算机科学》
CSCD
北大核心
2005年第3期207-212,共6页
Computer Science
基金
国家自然科学基金(编号60303030)
重庆自然科学基金(编号8721)