摘要
随着智能终端设备的不断普及,微博、微信等国内最受欢迎的社交平台等富含情感倾向的中英文混合极短文本数据的信息呈爆发式增长。为了有效提取中英文混合极短文本中的情感倾向等关键特征信息,本文提出了一种基于情感倾向和SVM的极短文本分类模型。首先对原数据进行识别并利用kettle、N-Gram模型对数据进行处理;然后利用TF-IDF提取分类所需要的关键词;再将处理后的数据存入词向量集;最后利用SVM对混合极短文本进行分类。经过K-fold交叉验证,检验了模型的有效性。实验以微博等主流社交平台上的6905条极短文本数据作为样本进行实验与分析。结果表明在分类准确率方面,该方法能够有效提高匹配效率;同时在泛化误差与精确度指标上匹配结果更加均衡。
With the rapid development of Internet,data containing abundant hybrid Chinese & English extremely short texts with emotion tendency such as Weibo and other popular Chinese social platforms show explosive growth. Therefore,a higher requirement for the technique of more efficient processing of hybrid extremely short text classification is proposed. In order to solve the problem of dealing with hybrid extremely short text obtained after the analysis of the original data quality,this classification technique is put forward. Firstly,the original data is recognized and processed with tools of kettle N-Gram Model and the emotional tendency. Then,necessary keywords are extracted using TF-IDF tool. After that,the processed data is stored into the word vector set. Finally,the mixed extremely short texts are sorted using SVM. After the K-fold test,the validity of the model was verified. 6905 pieces of extremely short texts in the mainstream platforms,such as Weibo,are used as the sample to be conducted and analyzed in this experiment. The results show that,in terms of classification accuracy,this constructed classification model is able to improve the matching efficiency. At the same time,the matching results in terms ofgeneralization error and accuracy are more balanced.
出处
《科技通报》
2018年第8期149-154,共6页
Bulletin of Science and Technology
基金
国家自然科学基金(No.61572036)
安徽省高校自然科学研究重点项目(No.KJ2016A167)
安徽省高等学校自然科学研究重点项目(No.KJ2017A639)