摘要
针对序列模式挖掘(GSP)算法在中文产品评论特征提取中准确率不够高的问题,提出了一种二次剪枝算法,即利用GSP算法产生候选特征集,然后采用词对共现度作为阈值对其进行进一步筛选,从而达到提高准确率的目的.利用定制化的爬虫工具从京东网站上抓取摄像头产品的中文评论,选取其中1 000条作为试验数据,采用分词工具ICTCLAS对评论进行分词和数据预处理,并将所提算法与GSP算法、交叉语言模型(CLM)和似然比检验(LRT)进行对比试验.结果表明,利用所提算法获得的中文产品评论特征提取准确率达到76.37%,较GSP算法、CLM和LRT的准确率分别提高2.94%,5.77%和7.57%.
Aiming at the lowaccuracy rate of the generalized sequence pattern( GSP) algorithm on product feature extraction from Chinese online reviews,a secondary pruning algorithm is proposed.In this algorithm,based on the candidate collection of the output of the GSP algorithm,the term pair co-occurrence weight( TPCW) is used as the threshold for further filtering to improve the accuracy rate. The customized tools are used to crawl the product Chinese reviews of cameras from Jingdong website. 1 000 reviews are selected as the experimental data and the segmentation tool ICTCLAS is used on the word segmentation and data preprocessing. The proposed algorithm is compared with the GSP algorithm,the cross language model( CLM),and the likelihood ratio test( LRT). The results showthat the accuracy rate of the proposed algorithm on product feature extraction from Chinese online reviews is 76. 37%,which is higher than those of the GSP algorithm,CLMand LRT by2. 94%,5. 77% and 7. 57%,respectively.
出处
《东南大学学报(自然科学版)》
EI
CAS
CSCD
北大核心
2016年第3期513-517,共5页
Journal of Southeast University:Natural Science Edition
基金
中央高校基本科研业务费专项资金资助项目
国家高技术研究发展计划(863计划)资助项目(2015AA015904)
关键词
特征提取
二次剪枝
词对共现度
似然比检验
交叉语言模型
feature extraction
secondary pruning
term pair co-occurrence weight
likelihood ratio test
cross language model