摘要
【目的】利用Word2Vec和Sent2Vec算法生成新浪微博的文本的向量化表示形式,以期在文本分类时获得较低的计算成本和较高的分类效果。【方法】使用文本中词的0-1矩阵进行分类,将分类效果作为基准线;采用Word2Vec算法生成词向量并用不同方式合成句子的向量表示,进行文本分类,并与基准线进行对比;利用Sent2Vec算法直接生成句子向量进行分类,综合评价3种方法的优缺点。【结果】研究显示使用Word2Vec算法和Sent2Vec算法能够极大程度上压缩文本特征,对比于使用所有3万多个词作为特征,Word2Vec算法和Sent2Vec算法将特征数压缩在1 000以内。在分类准确率方面,Word2Vec算法的分类准确率比基准线低约3%,准确率为75.14%。Sent2Vec算法的分类效果远不如其他两种方法,准确率只有63.08%。【局限】由于语料有限,Word2Vec算法在计算词向量时可能缺少足够的语义信息,导致词向量的准确性不高,而Sent2Vec算法在中文文本语境下生成句向量的分类结果较差。【结论】Word2Vec算法更适用大规模语料文本分类,在文本量较少时应使用词为特征分类。
[Objective] This paper uses the Word2Vec and Sent2Vec algorithms to generate vectors for the text posts of Sina Weibo, aiming to achieve lower computational cost and higher efficiency in text classification. [Methods] First, we classified words from the posts with the 0-1 matrix and used results as the baseline. Then, we used the Word2Vec algorithm to generate the word vector and the vector representation of the sentences in different ways. Third, we classified the Weibo posts using sentence vectors generated by the Sent2Vec algorithm. Finally we comprehensively evaluated the advantages and disadvantages of the three methods. [Results] Both Word2Vec and Sent2Vec algorithms could reduce the text features significantly. We used 30,000 words as features and found Word2Vec and Sent2Vec algorithms could reduce feature numbers to less than 1000. The classification accuracy rate of the Word2Vec algorithm was 75.14%, which was 3% lower than the baseline. The accuracy rate of the Sent2Vec algorithm was far less than the other two methods, with the accuracy rate was only 63.08%. [Limitations] The corpus size of this paper needs to be expanded. We found that the Word2Vec algorithm did not have enough semantic information to calculate word vector. However, Sent2Vec has poor classification results for Chinese sentence vectors. [Conclusions] Word2Vec algorithm is suitable for large-scale corpus classification, and words should be used as classification features for lack of text.
作者
李心蕾
王昊
刘小敏
邓三鸿
Li Xinlei;Wang Hao;Liu Xiaomin;Deng Sanhong(School of Information Management, Nanjing University, Nanjing 210023, China;Jiangsu Key Laboratory of Data Engineering and Knowledge Service, Nanjing 210023, China)
出处
《数据分析与知识发现》
CSSCI
CSCD
北大核心
2018年第8期41-50,共10页
Data Analysis and Knowledge Discovery
基金
国家自然科学基金项目"面向学术资源的TSD与TDC测度及分析研究"(项目编号:71503121)
"江苏青年社科英才"人才培养项目的研究成果之一