期刊文献+

脏话文本语料库建设 被引量:9

Building foul words text corpus
下载PDF
导出
摘要 脏话作为一种非正规的语言现象,在网络评价中已经无处不在,对网络文明造成了影响。描述了脏话文本的特点、定义及其危害,并对网络脏话文本进行了研究与分析,设计了一个机器自动判别与少量人工标注相结合的脏话语料采集方法,借助海量的真实评价文本,构造了一个较大规模的高质量的脏话语料库,初步采集了6 000多句脏话语料。然后利用一元、二元和三元特征,通过SVM与最大熵分类器对脏话的自动分类进行了实验,结果表明,两种分类器的准确率和查全率都达到97%以上。 Being un-offical language, foul words are widespread in Web reviews, and have a bad impact on Web civilization. The hazards and characteristics of the foul words are analyzed and described. Focused on the research of Web foul words, this paper designs a method for foul words corpus collection, which is integration of the machine automatically and manually technology. Over 6000 sentences are collected from huge amounts of Web review into a Foul Words Corpus. An automatic identification foul words experiment is done, which based on SVM and Maximum Entropy. The results show that the recall and accuracy are both over 97%.
出处 《计算机工程与应用》 CSCD 2014年第11期126-129,共4页 Computer Engineering and Applications
关键词 脏话文本 语料库 文本分类 自动识别 foul words corpus text classification automatic identification
  • 相关文献

参考文献10

  • 1袁纳宇.图书馆应用微博客的价值分析[J].图书与情报,2010(3):104-106. 被引量:72
  • 2廖德明.脏话的性意识指向剖析[J].辽东学院学报(社会科学版),2009,11(4):25-30. 被引量:9
  • 3Pang B, Lee L, Vaithyanathan S.Thumbs up?Sentiment classification using machine learning techniques[C]//Proc of the EMNLP 2002.Morristown: ACL, 2002: 79-86.
  • 4Cui H, Mittal V O, Datar M.Comparative experiments on sentiment classification for online product reviews[C]//Proc of the AAAI 2006.Menlo Park:AAAI Press,2006: 1265-1270.
  • 5Ng V,Dasgupta S,Arifin S linguistic knowledge sources M N.Examining the role of in the automatic identifica- tion and classification of reviews[C]//Proceedings of the COLING/ACLMain Conference Poster Sessions.Morris- town, NJ, USA: Association for Computational Linguis- tics, 2006 : 611-6 ! 8.
  • 6Somasundaran S ,Wiebe J, Hoffmarm P, et al.Manual anno- tation of opinion categories in meetings[C]//Proceedings of the Workshop on Frontiers in Linguistically Annotated Corpora 2006.Sydney,Australia:Association for Computa- tional Linguistics, 2006.
  • 7Wiebe J, Wilson .T, Cardie C.Annotating expressions of opinions and emotions in language[J].Language Resources and Evaluation, 2005,39 (2/3) : 164-210.
  • 8百度百科坝占吧百科名片[EB/OL].[2012-07-15].http://baike.baidu.com/view/2185.htm.
  • 9Tseng H, Chang P, Andrew G, et al.A conditional ran- dom field word segmenter for Sighan bakeoff 2005[C]// Proceedings of the 4th SIGHAN Workshop on Chinese Language Processing, Jeju Island, Korea, 2005 ;168-171.
  • 10Stolcke A.SRILM--an extensible language modeling tool- kit[C]//International Conference on Spoken Language Processing, Denver, Colorado, 2002.

二级参考文献15

  • 1韦津利.脏话文化史[M].颜韵,译.上海:文汇出版社.2008:28.
  • 2THOMAS L, WAREING S. Language, society and power: an introduction. London: Routledge. 1999:6 - 10.
  • 3HUGHES G. Swearing: a social history of foul language: oaths and profanity in English, London: Penguin Books, 1998.
  • 4FAIRCLOUGH N. Language and power. New York : Longman. 1989 : 4 - 5.
  • 5林芳玫.走出“干”与“被干”的僵局--女性主义对色情媒介的争议[M].台北:女书文化,1999:163-165.
  • 6MILLET. Sexual politics. London: Virago. 1970 : 34 - 35.
  • 7MONTAGU A. The anatomy of swearing. Philadelphia : University of Pennsylvania Press. 2001 : 87.
  • 8福柯.福柯集[M]∥杜小真.杜小真编选.上海:上海远东出版社,1998:293-295.
  • 9奥巴马竞选总统也用微博客[EB/OL].[2010-03-12].http://cq.qq.com/M20090727/000825.htm.
  • 10新浪将正式推出微博服务借鉴绞杀博客网经验[EB/OL].[2010-03-12].http://www.cnii.com.cn/20080623/ca580651.htm.

共引文献79

同被引文献101

引证文献9

二级引证文献14

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部