基于无监督学习算法的推特文本规范化被引量：1

Twitter text normalization based on unsupervised learning algorithm

下载PDF

导出

摘要推特文本中包含着大量的非标准词,这些非标准词是由人们有意或无意而创造的。对很多自然语言处理的任务而言,预先对推特文本进行规范化处理是很有必要的。针对已有的规范化系统性能较差的问题,提出一种创新的无监督文本规范化系统。首先,使用构造的标准词典来判断当前的推特是否需要标准化。然后,对推特中的非标准词会根据其特征来考虑进行一对一还是一对多规范化;对于需要一对多的非标准词,通过前向和后向搜索算法,计算出所有可能的多词组合。其次,对于多词组合中的非规范化词,基于二部图随机游走和误拼检查,来产生合适的候选。最后,使用基于上下文的语言模型来得到最合适的标准词。所提算法在数据集上获得86.4%的F值,超过当前最好的基于图的随机游走算法10个百分点。 Twitter messages contain a large number of nonstandard tokens, created unintentionally or intentionally by people. It is crucial to normalize the nonstandard tokens for various natural language processing applications. In terms of the existing normalization systems which perform poorly, a novel unsupervised normalization system was proposed. First, a standard dictionary was used to determine whether a tweet needs to be normalized or not. Second, a nonstandard token was considered to take 1-to-1 or 1-to-N recovering based on its characteristics. For 1-to-N recovering, the nonstandard token would be divided into multiple possible words using forward and backward search. Third, some normalization candidates were generated for nonstandard tokens among multiple possible words by integrating random walk and spelling checker. Finally, the best normalized twitter could be obtained by taking all the candidates into consideration of n-gram language model. The experimental results on the manual dataset show that the proposed approach obtains F-score of 86. 4%, which is 10 percentage points higher than that of current best graph-based random walk algorithm.

作者邓加原姬东鸿费超群任亚峰

机构地区武汉大学计算机学院

出处《计算机应用》 CSCD 北大核心 2016年第7期1887-1892,共6页 journal of Computer Applications

基金国家自然科学基金重点项目(61133012) 国家自然科学基金资助项目(61173062) 国家哲学社会科学重大计划项目(11&ZD189)~~

关键词规范化无监督学习二部图随机游走拼写检查 normalization unsupervised learning bipartite graph random walk spelling checker

分类号 TP391 [自动化与计算机技术—计算机应用技术] TP18 [自动化与计算机技术—控制理论与控制工程]

引文网络
相关文献

参考文献23

1RITTER A, CLARK S, MAUSAM M, et al. Named entity recogni- tion in tweets: an experimental study [ C]//Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing. Stroudsburg, PA: Association for Computational Linguistics, 2011:1524 -1534.
2LIU F, LIU Y, WENG F. Why is " SXSW " trending? Exploring multiple text sources for twitter topic summarization [ C ]// Proceedings of the 2011 ACL Workshop on Language in Social Media. Stroudsburg, PA: Association for Computational Linguistics, 2011: 66-75.
3MUKHERjEE S, BHANACHARYYA P. Sentiment analysis in twitter with lightweight discourse analysis [ C]// Proceedings of the 26th International Conference on Computational Linguistics. New York: ACM, 2012:1847-1864.
4TANG D, WEI F, YANG N, et al. Learning sentiment-specific word embedding for Twitter sentiment classification [ C ]// Proceedings of the 52nd Annual Meeting of the Association for Com- putational Linguistics. Stroudsburg, PA: Association for Computa- tional Linguistics, 2014:1555 - 1565.
5SAKAKI T, OKAZAKI M, MATSUO Y. Ea_hquake shakes Twitter users: real-time event detection by social sensors [ C] // Proceedings of the 19th International Conference on the World Wide Web. New York: ACM, 2010:851-860.
6WENG J, LEE B-S. Event detection in Twitter [ C]// Proceedings of the 5th International Conference on Weblogs and Social Media. Menlo Park, CA: AAAI Press, 2011 : 401 - 408.
7BENSON E, HAGftlGHI A, BARZILAY R. Event discovery in so- cial media leeds [ C]// Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Tecimologies. Stmudsburg, PA: Association for Computational Lin- guistics, 201 1: 389 -398.
8HAN B, BALDWIN T. Lexical normalisation of short text messages: mken sens a #twitter [ C]// Proceedings of the 49th Annual Meeting ff the Association for Computational Linguistics: Human Language Technologies. Stromtsburg, PA: Association tbr Computational Lin- guistics, 2011 : 368 - 378.
9LIU X, ZHANG S, WEI F, et al. Recognizing named entities in tweets [ C]//Proceedings of the 49th Annual Meeting of the Associ- ation tor Computatimlal Linguistics: Hunmn Language Technologies. Strotldsburg, PA: Association fbr Computational Linguistics, 2011: 359 - 367.
10FOSTER J, CETINOGLU O, WAGNER J, et al. #hardtoparse: POS tagging and parsing the twitter verse [ C]// Proceedings of the AAAI Workshop on Analyzing Mierotext. Menlo Park, CA: AAAI Press, 2011: 20-25.

同被引文献11

1庞伟正,金瑞琪,王成武.一种规则引擎的实现方法[J].哈尔滨工程大学学报,2005,26(3):385-389. 被引量：27
2张振亚,王进,程红梅,王煦法.基于余弦相似度的文本空间索引方法研究[J].计算机科学,2005,32(9):160-163. 被引量：49
3嵇晓声,刘宴兵,罗来明.协同过滤中基于用户兴趣度的相似性度量方法[J].计算机应用,2010,30(10):2618-2620. 被引量：27
4孙向琨,邓伟.结合TF-IDF的歌曲情感多标记分类[J].计算机工程,2011,37(19):189-190. 被引量：4
5孟海东,张玉英,宋飞燕.一种基于加权欧氏距离聚类方法的研究[J].计算机应用,2006,26(S2):152-153. 被引量：17
6王庆,陈泽亚,郭静,陈晰,王晶华.基于词共现矩阵的项目关键词词库和关键词语义网络[J].计算机应用,2015,35(6):1649-1653. 被引量：11
7李天彩,王波,毛二松,席耀一.基于Skip-gram模型的微博情感倾向性分析[J].计算机应用与软件,2016,33(7):114-117. 被引量：4
8孟奎,刘梦赤,胡婕.基于字符级循环网络的查询意图识别模型[J].计算机工程,2017,34(3):181-186. 被引量：4
9张璞,王俊霞,王英豪.基于标签传播的情感词典构建方法[J].计算机工程,2018,44(5):168-173. 被引量：8
10郑雄风,丁立新,万润泽.基于用户和产品Attention机制的层次BGRU模型[J].计算机工程与应用,2018,54(11):145-152. 被引量：12

引证文献1

1张轼坤,沈峰,高列宁,周云康.基于词向量的国际业务实时推理模型[J].信息技术与网络安全,2019,38(5):85-91. 被引量：1

二级引证文献1

1骆丽萍,黄洁,张雅歌.基于卷积神经网络的教育众筹成败预测[J].软件工程与应用,2019,8(6):319-325.

1李洋.微博文本规范化研究综述[J].现代计算机,2014,20(2):26-29.
2钱涛,姬东鸿,戴文华.基于迁移的微博分词和文本规范化联合模型[J].华南理工大学学报（自然科学版）,2015,43(11):47-53.
3舒振球,赵春霞.基于图正则化的受限非负矩阵分解算法及在图像表示中的应用[J].模式识别与人工智能,2013,26(3):300-306. 被引量：6
4胡敏杰.用半监督聚类算法实现WEB文本挖掘[J].漳州师范学院学报（自然科学版）,2010,23(4):50-57.
5孙温稳.XML文本的标准化[J].电子技术与软件工程,2016(7):187-187. 被引量：1
6孙温稳.基于国内现存文本语料库规范化的现状研究及改进[J].河南科技,2016,35(11):19-20.
7李映,张艳宁,赵荣椿.核学习机研究[J].计算机工程与应用,2004,40(17):4-6. 被引量：4
8软件缺陷词典正在制定[J].软件世界,2007(6):19-19.
9郑秋生,刘守喜.基于CRF的互联网文本命名实体识别研究[J].中原工学院学报,2016,27(1):70-73. 被引量：9
10王涛,李艾华,蔡艳平,王声才.基于核的学习机研究综述[J].计算机应用研究,2010,27(6):2011-2015. 被引量：1

计算机应用

2016年第7期

浏览历史

内容加载中请稍等...

基于无监督学习算法的推特文本规范化被引量：1

参考文献23

同被引文献11

引证文献1

二级引证文献1

相关作者

相关机构

相关主题

浏览历史

基于无监督学习算法的推特文本规范化 被引量：1

参考文献23

同被引文献11

引证文献1

二级引证文献1

相关作者

相关机构

相关主题

浏览历史

基于无监督学习算法的推特文本规范化被引量：1