基于感知器的中文分词增量训练方法研究被引量：3

An Incremental Learning Scheme for Perceptron Based Chinese Word Segmentation

下载PDF

导出

摘要该文提出了一种基于感知器的中文分词增量训练方法。该方法可在训练好的模型基础上添加目标领域标注数据继续训练,解决了大规模切分数据难于共享,源领域与目标领域数据混合需要重新训练等问题。实验表明,增量训练可以有效提升领域适应性,达到与传统数据混合相类似的效果。同时该文方法模型占用空间小,训练时间短,可以快速训练获得目标领域的模型。 In this paper, we propose an incremental learning scheme for perceptron based Chinese word segmentation. Our method can perform continuous training over a fine tuned source domain model, enabling to deliver model without annotated data and re-training. Experimental results shows the scheme proposed can significantly improve adaptation performance on Chinese word segmentation and achieve comparable performance with traditional method. At the same time, our method can significantly reduce the model size and the training time.

作者韩冰刘一佳车万翔刘挺

机构地区哈尔滨工业大学计算机学院社会计算与信息检索研究中心

出处《中文信息学报》 CSCD 北大核心 2015年第5期49-54,共6页 Journal of Chinese Information Processing

关键词中文分词领域适应增量训练 Chinese word segmentation domain adaptation incremental learning

分类号 TP391 [自动化与计算机技术—计算机应用技术]

引文网络
相关文献

参考文献12

1XUE N, SHEN L. Chinese word segmentation as LMR tagging[C]//Proceedings of the second SIGHAN workshop on Chinese language processing. 2003, 17: 176-179.
2ZHANG Y, CLARK S. Chinese Segmentation with a Word-Based Perceptron Algorithm[C]//Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics. 2007: 840-847.
3SHI Y, WANG M. A dual-layer CRFs based joint decoding method for cascaded segmentation and labeling tasks[C]//Proceedings of IJCAI. 2007, 7: 1707-1712.
4SUN W. Word-based and Character-based Word Segmentation Models: Comparison and Combination[C]//Proceedings of the COLING 2010: Posters. 2010: 1211-1219.
5ZHANG M, ZHANG Y, CHE W,et al. Type-Supervised Domain Adaptation for Joint Segmentation and POS-Tagging[C]//Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics. 2014: 588-597.
6LIU Y, ZHANG Y. Unsupervised Domain Adaptation for Joint Segmentation and POS-Tagging[C]//Proceedings of COLING 2012: Posters. 2012: 745-754.
7LIU Y, ZHANG Y, CHE W, et al. Domain Adaptation for CRF-based Chinese Word Segmentation using Free Annotations[C]//Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2014: 864-874.
8LIU Y, ZHANG M, CHE W, et al. Micro blogs Oriented Word Segmentation System[C]//Proceedings of the Second CIPS-SIGHAN Joint Conference on Chinese Language Processing. 2012: 85-89.
9XUE N. Chinese word segmentation as character tagging[J]. Computational Linguistics and Chinese Language Processing, 2003, 8(1): 29-48.
10COLLINS M. Discriminative Training Methods for Hidden Markov Models: Theory and experiments with perceptron algorithms[C]//Proceedings of the ACL-02 conference on Empirical methods in natural language processing-Volume 10. 2002: 1-8.

二级参考文献9

1骆正清,陈增武,胡上序.一种改进的MM分词方法的算法设计[J].中文信息学报,1996,10(3):30-36. 被引量：28
2Nianwen Xue.Chinese word segmentation as character tagging[J]. International Journal of Computational Linguistics and Chinese Language Processing,2003,8(1):29-48.
3Huihsin Tseng,Pichuan Chang,Galen Andrew,et al.A conditional random field word segmenter for sighan bakeoff 2005[C]//Proceedings of the fourth SIGHAN workshop.2005:168-171.
4Yue Zhang,Stephen Clark.Chinese segmentation with a word-based perceptron algorithm[C]//Proceedings of the 45th ACL.2007:840-847.
5Xu Sun,Yaozhong Zhang,Takuya Matsuzaki,et al.A discriminative latent variable chinese segmenter with hybrid word/character information[C]//Proceedings of NAACL.2009:56-64.
6Hai Zhao,Chang-Ning Huang,Mu Li.An Improved Chinese Word Segmentation System with Conditional Random Field[C]//Proceedings of the Fifth SIGHAN Workshop on Chinese Language Processing. 2006:162-165.
7Pi-Chuan Chang,Michel Galley,Christopher D.Manning.Optimizing Chinese Word Segmentation for Machine Translation Performance[C]//ACL Workshop on Statistical Machine Translation.2008:224-232.
8John D. Lafferty,Andrew McCallum,Fernando C.N.Pereira. Conditional random fields:Probabilistic models for segmenting and labeling sequence data[C]//Proceedings of ICML.2001:282-289.
9吴春颖,王士同.基于二元语法的N-最大概率中文粗分模型[J].计算机应用,2007,27(12):2902-2905. 被引量：12

共引文献43

1唐琳,郭崇慧,陈静锋.中文分词技术研究综述[J].数据分析与知识发现,2020,4(2):1-17. 被引量：43
2丁洁.基于Lucene的中文分词系统设计与实现[J].自动化与仪器仪表,2016(5):208-210. 被引量：5
3刘一佳,车万翔,刘挺,张梅山.基于序列标注的中文分词、词性标注模型比较分析[J].中文信息学报,2013,27(4):30-36. 被引量：12
4胥小波,赵尔凡,康荣保.基于语义分析的互联网人物信息提取[J].信息安全与通信保密,2013,11(12):103-108. 被引量：3
5白涛,张太红,吴乃宁.基于词典和全切分的中文农业网页分词算法的研究[J].新疆农业大学学报,2014,37(2):168-172. 被引量：1
6张杰,张海超,翟东升.面向中文专利权利要求书的分词方法研究[J].现代图书情报技术,2014(9):91-98. 被引量：9
7梁喜涛,顾磊.中文分词与词性标注研究[J].计算机技术与发展,2015,25(2):175-180. 被引量：48
8刘泽文,丁冬,李春文.基于条件随机场的中文短文本分词方法[J].清华大学学报（自然科学版）,2015,55(8):906-910. 被引量：17
9方艳,周国栋.基于层叠CRF模型的词结构分析[J].中文信息学报,2015,29(4):1-7. 被引量：7
10许华婷,张玉洁,杨晓晖,单华,徐金安,陈钰枫.基于Active Learning的中文分词领域自适应[J].中文信息学报,2015,29(5):55-62. 被引量：7

同被引文献40

1李素建,王厚峰,俞士汶,辛乘胜.关键词自动标引的最大熵模型应用研究[J].计算机学报,2004,27(9):1192-1197. 被引量：92
2曹勇刚,曹羽中,金茂忠,刘超.面向信息检索的自适应中文分词系统[J].软件学报,2006,17(3):356-363. 被引量：48
3钱晶,张杰,张涛.基于最大熵的汉语人名地名识别方法研究[J].小型微型计算机系统,2006,27(9):1761-1765. 被引量：26
4李丽双,黄德根,陈春荣,杨元生.SVM与规则相结合的中文地名自动识别[J].中文信息学报,2006,20(5):51-57. 被引量：32
5SMITH B,MARK D M. Ontology with human subjects testing:an empirical investigation of geographic categories [ J ]. American jour- nal of economics and sociology, 1999,58(2) : 245 -272.
6PURVES R S, CLOUGH P,CHRISTOPHER B J,et al. The design and implementation of SPIRIT: a spatially aware search engine for information retrieval on the Internet[ J]. International journal of ge- ographical information science. 2007, 21 (7) : 717 -745.
7CHEN Y,THOMAS A L, MEI Q, et al. A study of active learning methods for named entity recognition in clinical text[ J]. Journal of biomedical informatics, 2015,58:11 - 18.
8YANG X Y. Study of the place names from the perspective of cate-gory theory [ C ]//LU Q, GAO HH. Chinese lexical semantics. Switzerland : Springer International Publishing, 2015 : 112 - 119.
9DAVEL M, MARTIROSIAN O. Pronunciation diction-nary development in resource-scarce environments [C]∥Proceedings of International Speech Communication Association. Grenoble, France: ISCA, 2009: 2851-2854.
10BISANI M, NEY H. Joint-sequence models for grapheme-to-phoneme conversion[J].Speech Communication, 2008, 50(5): 434-451.

引证文献3

1俞敬松,王惠临,杨洁.大规模地名本体数据库系统的建构技术与方法[J].图书情报工作,2016,60(8):126-131. 被引量：2
2范正光,屈丹,闫红刚,张文林.借助音频数据的发音字典新词学习方法[J].西安交通大学学报,2016,50(6):75-82. 被引量：1
3崔志远,赵尔平,雒伟群,王伟,孙浩.面向专业领域的多头注意力中文分词模型--以西藏畜牧业为例[J].中文信息学报,2021,35(7):72-80. 被引量：2

二级引证文献5

1呼媛玲,寇媛媛.基于音素的英文发音自动评测系统设计[J].自动化与仪器仪表,2018,0(11):160-163.
2朱明,何永宁,吴博.广西农业信息地理匹配引擎设计与实现[J].南方农业学报,2019,50(1):201-207.
3苏振江,张仰森,胡昌秀,黄改娟.无监督与有监督相结合的粤语分词方法[J].计算机工程与设计,2023,44(8):2482-2488.
4夏飞,陈帅琦,华珉,蒋碧鸿.基于改进BERT的电力领域中文分词方法[J].计算机应用,2023,43(12):3711-3718. 被引量：1
5陈鸿,张帅,齐宝强.计算机与软件技术视阈下的地名数据服务平台研究[J].中国地名,2018(11):13-13.

1田大东,邓伟.改进的K均值聚类算法在支持矢量机中的应用[J].计算机工程与应用,2007,43(32):161-163. 被引量：3
2张兢,候旭东,吕和胜.基于朴素贝叶斯和支持向量机的短信智能分析系统设计[J].重庆理工大学学报（自然科学）,2010,24(1):77-80. 被引量：18
3王晓丹,郑春颖,吴崇明,张宏达.一种新的SVM对等增量学习算法[J].计算机应用,2006,26(10):2440-2443. 被引量：21
4金展,范晶,陈峰,徐从富.基于朴素贝叶斯和支持向量机的自适应垃圾短信过滤系统[J].计算机应用,2008,28(3):714-718. 被引量：17
5黎航宇.命名实体识别中适应性特征的跨领域与跨风格特性研究[J].软件,2014,35(10):100-106. 被引量：1
6萧嵘,王继成,孙正兴,张福炎.一种SVM增量学习算法[J].南京大学学报（自然科学版）,2002,38(2):152-157. 被引量：24
7何育朋.混合的大规模数据库中数值型数据聚类算法研究[J].微电子学与计算机,2017,34(2):119-122. 被引量：4
8文波,单甘霖,段修生.基于KKT条件与壳向量的增量学习算法研究[J].计算机科学,2013,40(3):255-258. 被引量：10
9徐海龙,王晓丹,廖勇,权文.一种基于主动学习的SVM增量训练算法[J].控制与决策,2010,25(2):282-286. 被引量：3
10张玉红,周全,胡学钢.面向跨领域情感分类的特征选择方法[J].模式识别与人工智能,2013,26(11):1068-1072. 被引量：3

中文信息学报

2015年第5期

浏览历史

内容加载中请稍等...

基于感知器的中文分词增量训练方法研究被引量：3

参考文献12

二级参考文献9

共引文献43

同被引文献40

引证文献3

二级引证文献5

相关作者

相关机构

相关主题

浏览历史

基于感知器的中文分词增量训练方法研究 被引量：3

参考文献12

二级参考文献9

共引文献43

同被引文献40

引证文献3

二级引证文献5

相关作者

相关机构

相关主题

浏览历史

基于感知器的中文分词增量训练方法研究被引量：3