摘要
信息技术的飞速发展造成了大量的文本数据累积,其中很大一部分是短文本数据。文本分类技术对于从这些海量短文中自动获取知识具有重要意义。但是由于短文中的关键词出现次数少,而且带标签的训练样本又通常数量很少,现有的一般文本挖掘算法很难得到可接受的准确度。一些基于语义的分类方法获得了较好的准确度但又由于其低效性而无法适用于海量数据。文本提出了一个新颖的短文分类算法。该算法基于文本语义特征图,并使用类似kNN的方法进行分类。实验表明该算法在对海量短文进行分类时,其准确度和性能超过其它的算法。
With the rapid development of information technology,huge data are accumulated.A vast amount of such data appears as short documents.It is very useful to classify such short documents to get knowledge automatically form the data.But most of the current classification algorithms can't get acceptable accuracy since key words appear less time in short documents and the labeled training examples are usually very few.Some classification algorithms based on semantic information is more accurate but they are inefficient to be used to process very large document sets.In this paper,we propose a novel classification method based on semantic text features graph and kNN like method.Our experimental study shows that our algorithm is more accurate and efficient than other classification algorithms when classifying large scale short documents.
出处
《计算机工程与应用》
CSCD
北大核心
2006年第22期5-7,共3页
Computer Engineering and Applications
基金
国家863高技术研究发展计划资助项目(编号:2004AA112020
2003AA115210
2003AA111020)
关键词
文本挖掘
分类
短文
大规模文本数据库
text mining,classification,short document,very large text database