摘要
本文针对中文微博篇幅短小、无间隔标记等特点,建立了由词汇特征、浅层句法和深层句法特征集组成的中文微博作者文体特征模型,选取支持向量机、序列最小优化支持向量机、朴素贝叶斯和决策树算法在公开微博语料上进行算法对照实验、特征集组合实验和各组文体特征的作者身份识别实验。实验结果验证了本文模型在中文微博作者身份识别任务中的高准确率、召回率和时间效率。
In order to meet the the urgent demand of Chinese Microblog authorship attribution,we established a multidimensional stylistic features model consists of the lexical features, shallow syntactic features and deep syntactic features. This Chinese Microblog stylistic features model has been verified through control experiments and grouping experiments using LibSVM, Sequential Minimal Optimization SVM, Naive Bayesian and Decision Tree algorithm on public microblog corpus. The experimental outcome verified the contribution of each feature-dimension and the good performance of our model in the precision, recall and computing time.
出处
《情报学报》
CSSCI
CSCD
北大核心
2017年第1期72-78,共7页
Journal of the China Society for Scientific and Technical Information
基金
国家社会科学基金一般项目(15BYY028)
国家教育部回国人员科研启动基金(教外司[2015]1098)
教育部人文社科青年基金项目(11YJCZH131)
大连外国语大学科研项目(2013XJQN20
2014XJQN15)
关键词
中文
微博
作者身份识别
Chinese
microblog
authorship attribution