期刊文献+

唐诗题材自动分类研究 被引量:16

Automatic Classification of Tang Poetry Themes
下载PDF
导出
摘要 将文本分类技术引入唐诗研究。首先将唐诗按照题材分为爱情婚姻、边塞战争、交游送别、羁旅思乡、山水田园、咏史怀古和其他7类,并据此提出唐诗题材自动分类模型。所选500首诗歌样本以《唐诗三百首》为基础,并有所补充。采用向量空间模型(VSM)将唐诗文本转换为向量,通过卡方检验进行词语特征选择,最后基于朴素贝叶斯和支持向量机算法构造文本分类器,取得较好的题材分类效果。此外,还验证了作者关于题目、体制、作者等变量对题材分类产生影响的假设,为相关诗歌本体研究提供了科学依据。 The authors propose a text classification model for Tang poetry. Firstly seven categories are defined for poetry themes: love and marriage, frontier war, friendship and farewell, journey and homesick, landscape and countryside, history and nostalgia, others. 500 Tang poems are selected as research samples, and they are represented in vectors with Vector Space Model (VSM). To reduce the vector dimensions, feature selection is made by Chi-square test. Two classifiers are built based on Naive Bayes and Support Vector Machine algorithms. The models perform well in classification experiment. Besides, the authors verify the positive effect of poetry titles, authors and types to poetry themes by text classification models, which could offer scientific reference to the related research of Tang poetry.
出处 《北京大学学报(自然科学版)》 EI CAS CSCD 北大核心 2015年第2期262-268,共7页 Acta Scientiarum Naturalium Universitatis Pekinensis
基金 863计划(2012AA011104)资助
关键词 唐诗 题材 文本分类 卡方检验 朴素贝叶斯 支持向量机 Tang poetry themes text classification Chi-square test Naive Bayes support vector machine
  • 相关文献

参考文献16

  • 1胡俊峰,俞士汶.唐宋诗之计算机辅助深层研究[J].北京大学学报(自然科学版),2001,37(5):727-733. 被引量:24
  • 2胡俊峰,俞士汶.唐宋诗中词汇语义相似度的统计分析及应用[J].中文信息学报,2002,16(4):39-44. 被引量:43
  • 3匡海波,陈小荷.唐诗文本自动分类的算法研究//第五届全国青年计算语言学研讨会论文集.武汉,2010:399-405.
  • 4[4]萧统.文选[M].上海:上海古籍出版社,1986.
  • 5孙琴安.唐诗选本六百种提要.西安:陕西人民教育出版社,1980:110.
  • 6Salton G, Wong A, Yang C. A vector space model for automatic indexing. Communications of the ACM- CACM, 1975, 18(11): 613-620.
  • 7Jones K S. A statistical interpretation of term specificity and its application in retrieval. Journal of Documentation, 1972, 28(1): 11-21.
  • 8Salton G, Buckley C. Term-weighting approaches in automatic text retrieval. Information Processing & Management, 1988, 24(5): 513-523.
  • 9Yang Y, Pedersen J. A comparative study on featureselection in text categorization // ICML. Nashville, 1997:412-420.
  • 10王士稹.诗问四种.周维德,笺注.济南:齐鲁书社.1985:78.

二级参考文献19

  • 1黄昌宁,李涓子.词义排歧的一种语言模型[J].语言文字应用,2000(3):85-90. 被引量:16
  • 2鲁松 白硕.词距离的计算方法.自然语言理解与机器翻译[M].北京:清华大学出版社,2001,7..
  • 3俞士汶 胡俊峰.唐宋诗之词汇自动分析及应用.台北中央研究院第三届汉学会议[M].,..
  • 4B E Boser,I M Guyon,V N Vapnik.A training algorithm for optimal margin classifiers[C].In:D Haussler ed.Proceeding of the 5th Annual ACM Workshop on Computational Learning Theory,ACM Press,1992:144~152
  • 5Vapnik V.The Nature of Statistical Learning Theory[M].Spinger Verlag,1995
  • 6Thorsten Joachims.Making large-scale SVM learning practical[C].In:B Scholkopf,C J C Burges,A J Smola eds.Advances in Kernel MethodsSupport Vector Learning,MIT Press,1999:169~184
  • 7J Platt.Fast training of support vector machines using sequential minimal optimization[C].In:B Scholkopf,C Burges,A Smola eds.Ad vances in Kernel Methods-Support Vector Learning,MIT Press,1998
  • 8Nello Cristianini,John Shawe-Taylor.An Introduction to Support Vector Machines and Other Kernel-based Learning Methods[M].Cambridge University Press,2000
  • 9Ronan Collobert,Samy Bengio,Johnny Mariethoz.TORCH:A MODULAR MACHINE LEARNING SOFTWARE[R].IDIAP Research Report 02-46
  • 10K Aas,L Eikvil.Text categorization:A survey[R].Technical report,Norwegian Computing Center,1999-06

共引文献152

同被引文献143

引证文献16

二级引证文献42

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部