期刊文献+

基于频繁词集聚类的海量短文分类方法 被引量:6

Massive short documents classification method based on frequent term set clustering
下载PDF
导出
摘要 信息技术的飞速发展造成了大量的文本数据累积,其中很大一部分是短文本数据。文本分类技术对于从这些海量短文中自动获取知识具有重要意义。但是对于关键词出现次数少的短文,现有的一般文本挖掘算法很难得到可接受的准确度。一些基于语义的分类方法获得了较好的准确度但又由于其低效性而无法适用于海量数据。针对这个问题提出了一个新颖的基于频繁词集聚类的短文分类算法。该算法使用频繁词集聚类来压缩数据,并使用语义信息进行分类。实验表明该算法在对海量短文进行分类时,其准确度和性能超过其它的算法。 With the rapid development of information technology, huge data is accumulated. A vast amount of such data appears as short documents. It is very useful to classify such short documents to get knowledge automatically form the data. But most of the current classification algorithms can not get acceptable accuracy since key words appear few times in short documents. Some classification algorithms based on semantic information are fnore accurate but they are inefficient to be used to process very large document sets. A novel classification method based on frequent term set clustering is proposed. This method uses frequent term set clustering to compress massive data and uses semantic information to improve accuracy. Experimental study shows that this method is more accurate and efficient than other classification algorithms when classifying massive short documents.
出处 《计算机工程与设计》 CSCD 北大核心 2007年第8期1744-1746,1780,共4页 Computer Engineering and Design
基金 国家863高技术研究发展计划基金项目(2004AA112020 2003AA115210 2003AA111020)
关键词 文本挖掘 分类 海量 短文 频繁词集 text mining classification massive short document frequent term set
  • 相关文献

参考文献10

  • 1Jiawei Han,Micheline Kamber.Data mining:Concepts and techniques[M].Morgan Kaufmann Publishers,2001.
  • 2Hynek J,Jezek K,Rohlik O.Short document categorizationitemsets method[C].Lyon,France:PKDD 4-th European Conference on Principles and Practice of Knowledge Discovery in Databases,Workshop Machine Learning and Textual Information Access,2000.14-19.
  • 3Cheng Ching Kang,Pan Xiaoshan,Franz J Kurfess.Ontologybased semantic classification of unstructured documents[C].Adaptive Multimedia Retrieval,2003.120-131.
  • 4苏伟峰,李绍滋,李堂秋.一个基于概念的中文文本分类模型[J].计算机工程与应用,2002,38(6):193-195. 被引量:17
  • 5De Luca E W,Nürnberger A.Ontology-based semantic online classification of documents:Supporting users in searching the web[C].Aachen:Proc of the European Symposium on Intelligent Technologies (EUNITE 2004),2004.
  • 6Wu S H,Tsai T H,Hsu W L.Text categorization using automatically acquired domain ontology[C].Sapporo,Japan:Proceedings of IRAL2003 Workshop on Information Retrieval with Asian Languages,2003.
  • 7Song D,Bruza P D,Huang Z,et al.Classifying document titles based on information inference[C].Japan:Proceedings of the 14th International Symposium on Methodologies for Intelligent Systems,2003.297-306.
  • 8Beil F,Ester M,Xu X.Frequent term-based text clustering[C].Edmonton,Alberta,Canada:Proc 8th Int Conf on Knowledge Discovery and Data Mining(KDD'2002),2002.
  • 9Yi Guan,Xiao-long Wang,Xiang-yong Kong,et al.Quantifying semantic similarity of Chinese words from hownet[C].Beijing:Proceedings of the First International Conference on Machine Learning and Cybernetics(ICMLC02),2002.234-239.
  • 10Jure Leskovec,John Shawe-Taylor.Semantic text features from small world graphs[C].Bohinj,Slovenia:Subspace,Latent Structure and Feature Selection techniques:Statistical and Optimization perspectives Workshop,2005.

二级参考文献4

共引文献16

同被引文献82

引证文献6

二级引证文献101

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部