期刊文献+

基于条件随机场模型和文本纠错的微博新词词性识别研究 被引量:7

Part-of-speech tagging of microblog unknown words based on conditional random fields and error correction
下载PDF
导出
摘要 针对微博数据特点,采用降噪算法和条件随机场模型对微博数据进行词性标注,并对其中比重较大的谐音词使用贝叶斯方法进行词性二次纠正.首先利用新浪平台API和爬虫获取原始微博数据,再根据噪音特点人工制定规则进行降噪.由于条件随机场在中文词性标注中特征提取的优势,使用条件随机场模型对降噪后的微博语料词性标注.在此基础上,利用微博语料中谐音词比重较大的特点,将微博词语转化为拼音,根据贝叶斯方法计算得到谐音词的原生词候选,再根据词语的上下文建立谐音词和原生词映射,并利用原生词的词性已知的性质,对谐音词进行词性纠错.实验结果表明,该方法可以较好地标注微博未登录词,词性标注准确率达到95.23%. The purpose of this work is to solve the problem of microblog part-of-speech(POS)tagging.POS tagging of Chinese new word is a difficult,important and widely-studied sequence modeling problem.This paper describes a hybrid model that combines a rule-based model with linear-chain conditional random fields(CRFs)and Bayes algorithm for the task of POS tagging of Microblog unknown words.Firstly,microblog data are obtained by using Sina API and web spider.According to the features of microblog,a rule-based method is presented to reduce the impact of noisy data in POS tagging.Then,since CRFs has an advantage in feature extraction of POS tagging,it is used to obtain initial POS tags of microblog new words.We also present a probabilistic POS tagging method,which further improves performances.Homophonic words account for a large proportion of microblog new words.Because the pronunciation between homophonic words and its original words are similar or identical,Chinese Phonetic Alphabet is used to buildthe bridge between them.And the bridge is built by using Naive Bayes algorithm.Evaluation on microblog test set shows that this system could tag the new words of microblog in a better way,the best precision it achieves is95.23%.
出处 《南京大学学报(自然科学版)》 CAS CSCD 北大核心 2016年第2期353-360,共8页 Journal of Nanjing University(Natural Science)
基金 国家自然科学基金(61202181) 博士后科学基金(2012M512006) 中央高校基本科研业务费专项资金(XJJ2013097)
关键词 条件随机场 微博 噪音数据 谐音词 词语纠正 conditional random fields(CRFs) microblog noisy data homophonic words words correction
  • 相关文献

参考文献16

  • 1丁兆云,贾焰,周斌.微博数据挖掘研究综述[J].计算机研究与发展,2014,51(4):691-706. 被引量:119
  • 2赵斌,吉根林,曲维光,顾彦慧.基于重用检测的微博垃圾用户过滤算法[J].南京大学学报(自然科学版),2013,49(4):456-464. 被引量:8
  • 3于清,阿里甫.库尔班.微博语料分词及标注方法初探[J].新疆大学学报(自然科学版),2013,30(1):81-86. 被引量:1
  • 4蒋才智,王浩,姚宏亮.基于知网的贝叶斯中文人名识别[J].南京大学学报(自然科学版),2012,48(2):147-153. 被引量:4
  • 5Weischedel R, Schwartz R, Palmucci J, et al. Copingwith ambiguity and unknown words through probabilistie models. Computational Lin- guisties, 1993,19(2) :361--382.
  • 6Ratnaparkhi A. A maximum entropy model for part-of-speech tagging. In.. Proceedings of the Conference on Empirical Methods in Natural Language Processing. Philadelphia, P A, USA: Association for Computational Linguistics, 1996, 133--142.
  • 7Lafferty J, Mccallum A, Pereira F C. Conditional random fields .. Probabilistic models for segmenting and labeling sequence data. In: Proceeding of the 18'h International Conference on Machine Learning. San Francisco, CA, USA.. Morgan Kaufmann Publishers Inc, 2001, 85--120.
  • 8Lu X F. Hybrid methods for POS Guessing of Chinese unknown word. In~ Proceedings of the ACL Student Research. Stroudsburg, PA, USA.. Association for Computational Linguistics, 2005, 1--6.
  • 9Wu A,Jiang Z X. Statistically-enhanced new word identification in a rule-based Chinese system. In~ Proceedings of the 2nd Workshop on Chinese Language Processing. Stroudsburg, PA, USA: Association for Computational Linguistics, 2000, 46--51.
  • 10Zhang K X, Zhou C L. Regularized structured perceptron for Chinese word segmentation POS tagging and parsing. In: Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics. Gothenburg, Sweden.. Association for Computational Linguistics, 2014,164-- 173.

二级参考文献170

共引文献162

同被引文献55

引证文献7

二级引证文献39

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部