期刊文献+

大规模文本数据库中的短文分类方法 被引量:4

Short Documents Classification Method in Very Large Text Database
下载PDF
导出
摘要 信息技术的飞速发展造成了大量的文本数据累积,其中很大一部分是短文本数据。文本分类技术对于从这些海量短文中自动获取知识具有重要意义。但是由于短文中的关键词出现次数少,而且带标签的训练样本又通常数量很少,现有的一般文本挖掘算法很难得到可接受的准确度。一些基于语义的分类方法获得了较好的准确度但又由于其低效性而无法适用于海量数据。文本提出了一个新颖的短文分类算法。该算法基于文本语义特征图,并使用类似kNN的方法进行分类。实验表明该算法在对海量短文进行分类时,其准确度和性能超过其它的算法。 With the rapid development of information technology,huge data are accumulated.A vast amount of such data appears as short documents.It is very useful to classify such short documents to get knowledge automatically form the data.But most of the current classification algorithms can't get acceptable accuracy since key words appear less time in short documents and the labeled training examples are usually very few.Some classification algorithms based on semantic information is more accurate but they are inefficient to be used to process very large document sets.In this paper,we propose a novel classification method based on semantic text features graph and kNN like method.Our experimental study shows that our algorithm is more accurate and efficient than other classification algorithms when classifying large scale short documents.
出处 《计算机工程与应用》 CSCD 北大核心 2006年第22期5-7,共3页 Computer Engineering and Applications
基金 国家863高技术研究发展计划资助项目(编号:2004AA112020 2003AA115210 2003AA111020)
关键词 文本挖掘 分类 短文 大规模文本数据库 text mining,classification,short document,very large text database
  • 相关文献

参考文献5

  • 1Song D,Bruza P D.Discovering Information Flow Using a High Dimensional Conceptual Space[C].In:Proceedings of ACM SIGIR 2001,2001:327~333
  • 2Lund K,Burgess C.Producing High-dimensional Semantic Spaces from Lexical Co-occurrence[J].Behavior Research Methods,Instruments,&Computers,1996; 28 (2):203~208
  • 3Jure Leskovec,John Shawe-Taylor.Semantic Text Features from Small World Graphs[C].In:Subspace,Latent Structure and Feature Selection techniques:Statistical and Optimization perspectives Workshop,Bohinj,Slovenia,2005
  • 4D Song,P D Bruza,Z Huang et al.Classifying Document Titles Based on Information Inference[C].In:proceedings of the 14th International Symposium on Methodologies for Intelligent Systems,Japan,2003:297~306
  • 5J Hynek,K Jezek,O Rohlik.Short Document Categorization-Itemsets Method[C].In:PKDD 4th European Conference on Principles and Practice of Knowledge Discovery in Databases,Workshop Machine Learning and Textual Information Access,Lyon,France,2000:14~19

同被引文献26

引证文献4

二级引证文献27

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部