期刊文献+

SAT-FOIL+:基于句子级关联的文本分类 被引量:1

SAT-FOIL+: Sentence-Level Association Based Text Classification
下载PDF
导出
摘要 以往基于词语关联的方法在挖掘频繁项集和关联规则时,都是将整个文本看作一个亨务来处理的,然而文本的基本语义单元实际上是句子。那些同时出现在一个句子里的一组单词比仅仅是同时出现在同一篇文档中的一组单词有更强的语义上的联系。基于以上的考虑,我们把一篇文档里的一个句子作为一个单独的事务,从而提出了一种基于句子级关联的分类方法SAT-FOIL。并在本文中提出新的得分模型来获得改进的新算法SAT-FOIL+。通过在标准的文本集Reuters上的大量实验,不仅证明新模型的优越性,而且证明了SAT-FOIL+分类效果同其他几种分类方法是可比的,并且要远远好于以往的基于文档级关联的分类方法。另外,挖掘出来的分类规则还具有易读性,并且易修改。 While previous association based methods mainly mined frequently co-occurring words (frequent itemsets) at the document-level, the basic semantic unit in a document is actually a sentence. Words within the same sentence are typically more semantically related than words that just appear in the same document. Our proposed SAT-FOIL views a sentence rather than a document as a transaction. In this paper we proposed new score models to get the im- proved algorithm SAT-FOIL+. The effectiveness of our proposed SAT-FOIL+ method has been demonstrated not only better than our former algorithm SAT-FOIL but also comparable to well-known alternatives and much better than previous document-level association based methods by extensive experimental studies using popular benchmark text collections Reuters. In addition, SAT-FOIL+ has inherent readability and refinability of acquired classification rules.
出处 《计算机科学》 CSCD 北大核心 2005年第3期207-212,共6页 Computer Science
基金 国家自然科学基金(编号60303030) 重庆自然科学基金(编号8721)
关键词 SAT-FOIL+ 句子级关联 文本分类 句子级别 频繁项目集 Text classification Sentence-level Association rules Frequent itemsets
  • 相关文献

参考文献18

  • 1Sebastiani F. Machine Learning in Automated Text Categorization. ACM Computing Surveys, 2002,34 (1): 1 -47
  • 2Dumais S, Platt J, Heckerman D, Sahami M. Inductive Learning Algorithms and Representations for Text Categorization.CIKM98
  • 3Agrawal R, Srikant R. Fast Algorithms for Mining Association Rules. In :Proc of the 20th Very Large Data Bases, 1994
  • 4Liu B, Hsu W, Ma Y. Integrating classification and association rule mining. In:SIGKDD, 1998. 80-86
  • 5Antonie M, Zaiane O R. Text Document Categorization by Term Association. In: Proc. of IEEE Intl. Conf. on Data Mining, 2002
  • 6Joachims T. Text Categorization with Support Vector Machines:Learning with Many Relevant Features. In: European Conf. on Machine Learning, 1998. 137-142
  • 7Meretakis D,Fragoudis D,Lu H, Likothanassis S. Scalable Association-based Text Classification. In:Proc. of ACM Int. Conf. on Information and Knowledge Management, 2000
  • 8Bekkerman R,El-Yaniv R,Tishby N,Winter Y. Distributed Word Clusters vs. Words for Text Categorization. Journal of Machine Learning Research, 2003,3:1183-1208
  • 9Quinlan J R,Carneron-Jones R M. FOIL:A Midterm Report. In:Proc. European Conf. Machine Learning, 1993. 3-20
  • 10Baeza-Yates R, Ribeiro-Neto B. Modern Information Retrieval.Addison-Wesley, 1999

同被引文献11

  • 1邹晶,冯剑琳,李曲,王元珍.基于句子级的最大频繁序列的文本分类[J].计算机科学,2006,33(1):236-239. 被引量:1
  • 2张友华,熊范纶.基于句子相关度的文本自动分类[J].中国科学技术大学学报,2006,36(5):540-545. 被引量:4
  • 3Salton G,Lesk M E.Computer evaluation of indexing and text processing[J].Journal of the ACM,1968,15(1):8-36.
  • 4Salton G,Wong A,Yang C S.A vector space model for automatic indexing[J].Communications of the ACM,1975,18(11):613-620.
  • 5Cover T M,Hart P E.Nearest Neighbor Pattern Classification[J].IEEE Transactions on Information Theory,1967,IT-13(1):21-27.
  • 6张华平.ICTCLAS3.0 API[OL].[2008-03-17].http://www.nlp.org.cn/project/project.php?pr oj-id=6.
  • 7搜狗实验室资料下载.文本分类语料库:精简版(tar.gz格式)[OL].[2008-03-18].http://www.sogou.com/labs/dl/c.h tml.
  • 8李荣陆.文本分类系统(KNN和SVM)[OL].[2008-03-17].http://www.nlp.org.cn/docs/download.php?doc-id=1023.
  • 9黄曾阳.HNC(概念层次网络)理论[M].北京:清华大学出版社,1998..
  • 10张运良,张全.基于句类向量空间模型的自动文本分类研究[J].计算机工程,2007,33(22):45-47. 被引量:6

引证文献1

二级引证文献3

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部