摘要
脏话作为一种非正规的语言现象,在网络评价中已经无处不在,对网络文明造成了影响。描述了脏话文本的特点、定义及其危害,并对网络脏话文本进行了研究与分析,设计了一个机器自动判别与少量人工标注相结合的脏话语料采集方法,借助海量的真实评价文本,构造了一个较大规模的高质量的脏话语料库,初步采集了6 000多句脏话语料。然后利用一元、二元和三元特征,通过SVM与最大熵分类器对脏话的自动分类进行了实验,结果表明,两种分类器的准确率和查全率都达到97%以上。
Being un-offical language, foul words are widespread in Web reviews, and have a bad impact on Web civilization. The hazards and characteristics of the foul words are analyzed and described. Focused on the research of Web foul words, this paper designs a method for foul words corpus collection, which is integration of the machine automatically and manually technology. Over 6000 sentences are collected from huge amounts of Web review into a Foul Words Corpus. An automatic identification foul words experiment is done, which based on SVM and Maximum Entropy. The results show that the recall and accuracy are both over 97%.
出处
《计算机工程与应用》
CSCD
2014年第11期126-129,共4页
Computer Engineering and Applications
关键词
脏话文本
语料库
文本分类
自动识别
foul words
corpus
text classification
automatic identification