期刊文献+

基于多层面文体特征的博客作者身份识别研究 被引量:14

Blogger Identification Based on Multidimensional Stylistic Features
下载PDF
导出
摘要 传统的文体风格特征模型不适用于当前大量涌现的网络文本。本文针对以博客为代表的网络文本篇幅短小、表达方式丰富灵活的特点,以内容无关为原则,分别抽取字符特征、词汇特征、句法特征和文本布局等特征,建立了由词汇特征、浅层句法特征、深层句法特征和结构特征组成的多层面文体风格特征模型,并选取朴素贝叶斯、决策树、序列最小优化支持向量机和大规模线性分类支持向量机算法在公开博客语料上进行对照实验。实验结果验证了各个层面特征在作者身份识别中的作用,表明了本文方法的准确性、通用性及其在短文本上的鲁棒性。 Models for traditional stylistic features are not suitable for Web tents. Based on the principle of content- independent, we extracted character features, lexical features, syntactic features and text layout features,and established a multidimensional stylistic features model which consists of lexical features, shallow syntactic features, deep syntactic features and structure features. We tested the performance of this model with Naive Bayesian, Decision Tree , Sequential Minimal Optimization SVM and LIBLINEAR SVM on public blog corpus. The results verified the contribution of each feature-dimension. The experiments also proved the accuracy, versatility and robustness of the method proposed in this paper.
出处 《情报学报》 CSSCI 北大核心 2015年第6期628-634,共7页 Journal of the China Society for Scientific and Technical Information
基金 教育部人文社会科学研究规划青年基金项目“基于多层面特征分析的在线信息作者身份识别研究”(项目编号:11YJCZH131) 辽宁省高等学校优秀人才支持计划(项目编号:WJQ2013017) 大连外国语大学科研项目“基于语言学特征的网络舆情信息挖掘”的研究成果之一
关键词 文体特征 博客 作者身份 stylistic features, blogger, Identification
  • 相关文献

参考文献14

  • 1Stamatatos E. A survey of modern authorship attribution methods [ J ]. Journal of the American Society for Information Science and Technology, 2009, 60 ( 3 ) : 538-556.
  • 2Goebel R,Wahlster W. Using dependency-based annotations for authorship identification [ C ]//Text, Speech and Dialogue. Berlin: Springer, 2012: 314-319.
  • 3Mendenhall T C. The characteristic curves of composition [J]. Science, 1887 (214S): 237-246.
  • 4Yule G U. On sentence-length as a statistical characteristic of style in prose: With application to two cases of disputed authorship [ J]. Biometrika, 1939: 363-390.
  • 5李贤平.《红楼梦》成书新说[J].复旦学报(社会科学版),1987,29(5):3-16. 被引量:66
  • 6Baayen H, Van Halteren H, Tweedie F. Outside the cave of shadows: Using syntactic annotation to enhance authorship attribution [ J ]. Literary and Linguistic Computing, 1996, 11(3): 121-132.
  • 7Zhao Y, Zobel J. Effective and Scalable Authorship Attribution using Function Words [ M ]//Information Retrieval Technology. Berlin : Springer, 2005 : 174-189.
  • 8Gamon M. Linguistic correlates of style: authorship classification with deep linguistic analysis features [ C ]// Proceedings of the 20th International Conference on Computational Linguistics. Association for Computational Linguistics, 2004 : 611-617.
  • 9Abbasi A,Chen H. Applying authorship analysis to extremist- group web forum messages [ J ]. IEEE Intelligent Systems, 2005, 20 (5) : 67-75.
  • 10Zhang C, Wu X, Niu Z, et al. Authorship identification from unstructured texts[ J]. Knowledge-Based Systems, 2014:99-111.

二级参考文献18

  • 1武晓春,黄萱菁,吴立德.基于语义分析的作者身份识别方法研究[J].中文信息学报,2006,20(6):61-68. 被引量:25
  • 2孙晓明,马少平.基于写作风格的作者识别[C]//中国中文信息学会第五届全国会员代表大会暨成立二十周年学术会议论文集.北京:清华大学出版社,2001.
  • 3Efron B, Thisted R. Estimating the Number of Unseen Species: How Many Words did Shakespeare Know? [ J ]. Biometrika, 1976, 63(3) :435 -447.
  • 4De Vel O, Anderson A, Corney M, et al. Mining E - mail Content for Author Identification Forensics [ J]. ACM S1GMOD Record, 2001,30(4) :55 -64.
  • 5Zheng R, Li J, Huang Z, et al. A Framework for Authorship Identi- fication of Online Messages: Writing - style Features and Classifi- cation Techniques[ J ]. Journal of the American Society for Informa- tion Science and Technology,2006,57 ( 3 ) : 378 - 393.
  • 6Abbasi A, Chen H. Identification and Comparison of Extremist - group Web Forum Messages Using Authorship Analysis [ J ]. IEEE Intelligent Systems,2005,20 ( 5 ) : 67 - 75.
  • 7Holmes D I,Forsyth R S. The Federalist Revisited:New Directions in Authorship Attribution [ J ]. Literary and Linguistic Computing, 1995,10(2) :111 - 127.
  • 8Juola P, Baayen H. A Controlled Corpus Experiment in Authorship Identification by Cross -entropy[ J]. Literary and Linguistic Com- puting,2005,20(S) :59 -67.
  • 9Abbasi A, Chen H. Writeprints:A Stylometric Approach to Identity -level Identification and Similarity Detection in Cyberspace [ J ]. ACM Transactions on Information Systems ,2008,26 (2) :1 -29.
  • 10Salton G, Buckley C. Term - weighting Approaches in Automatic Text Retrieval [ J ]. Information Processing and Management, 1988,24 (5) :513 -523.

共引文献74

同被引文献92

引证文献14

二级引证文献32

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部