摘要
针对RSS阅读器中冗余信息带来的不便,在采用中文分词和TF.IDF算法计算相似度进行预处理后,选取Levenshtein、余弦夹角法、Jaccard这三种相似度算法进行冗余信息鉴别。详细讨论这些方法的特征,并从实际应用的角度对这些方法的长处和不足做分析与比较,并选择Jaccard算法实现一个数据过滤机制。
In order to overcome the disadvantages of redundant RSS information, after using technologies of Chinese Segmentation and TP-IDF algorithm as pretreatment for similarity algorithm com- parison, makes the comparison among Levenshtein, Cosine ratio and Jaccard algorithm. Dis- cusses the features of these algorithms and compares the strengths and weaknesses. And intro- duces a simple data filtration mechanism by using optimal Jaccard algorithm.
出处
《现代计算机》
2012年第12期18-20,共3页
Modern Computer
关键词
计算机应用技术
TP·IDF
相似度计算:ICTCLAS
Computer Applications Technology
TP.IDF
Similitude Calculate
ICTCLAS(Institute of ComputingTechnology, Chinese Lexical Analysis System)