摘要
针对电力领域中文文本包含大量专有词时分词效果不佳的问题,提出一种基于改进BERT(Bidirectional Encoder Representation from Transformers)的电力领域中文分词(CWS)方法。首先,构建分别涵盖通用、领域词的词典,并设计双词典匹配融合机制将词特征直接融入BERT模型,使模型更有效地利用外部知识;其次,通过引入DEEPNORM方法提高模型对于特征的提取能力,并使用贝叶斯信息准则(BIC)确定模型的最佳深度,使BERT模型稳定加深至40层;最后,采用ProbSparse自注意力机制层替换BERT模型中的经典自注意力机制层,并利用粒子群优化(PSO)算法确定采样因子的最优值,在降低模型复杂度的同时确保模型性能不变。在人工标注的电力领域专利文本数据集上进行了分词性能测试。实验结果表明,所提方法在该数据集分词任务中的F1值达到了92.87%,相较于隐马尔可夫模型(HMM)、多标准分词模型METASEG(pre-training model with META learning for Chinese word SEGmentation)与词典增强型BERT(LEBERT)模型分别提高了14.70、9.89与3.60个百分点,验证了所提方法有效提高了电力领域中文文本的分词质量。
To solve the problem of poor performance in segmenting a large number of proprietary words in Chinese text in electric power domain,an improved Chinese Word Segmentation(CWS)method in electric power domain based on improved BERT(Bidirectional Encoder Representations from Transformer)was proposed.Firstly,two lexicons were built covering general words and domain words respectively,and a dual-lexicon matching and integration mechanism was designed to directly integrate the word features into BERT model,enabling more effective utilization of external knowledge by the model.Then,DEEPNORM method was introduced to improve the model’s ability to extract features,and the optimal depth of the model was determined by Bayesian Information Criterion(BIC),which made BERT model stable up to 40 layers.Finally,the classical self-attention layer in BERT model was replaced by the ProbSparse self-attention layer,and the best value of sampling factor was determined by using Particle Swarm Optimization(PSO)algorithm to reduce the model complexity while ensuring the model performance.The test of word segmentation was carried out on a hand-labeled patent text dataset in electric power domain.Experimental results show that the proposed method achieves the F1 score of 92.87%,which is 14.70,9.89 and 3.60 percentage points higher than those of the methods to be compared such as Hidden Markov Model(HMM),multi-standard word segmentation model METASEG(pre-training model with META learning for Chinese word SEGmentation)and Lexicon Enhanced BERT(LEBERT)model,verifying that the proposed method effectively improves the quality of Chinese text word segmentation in electric power domain.
作者
夏飞
陈帅琦
华珉
蒋碧鸿
XIA Fei;CHEN Shuaiqi;HUA Min;JIANG Bihong(College of Automation Engineering,Shanghai University of Electric Power,Shanghai 200090,China;Electric Power Research Institute,State Grid Shanghai Electric Power Company,Shanghai 200437,China;Library,Shanghai University of Electric Power,Shanghai 200090,China)
出处
《计算机应用》
CSCD
北大核心
2023年第12期3711-3718,共8页
journal of Computer Applications
基金
国家电网科技项目(52094020001A)。
关键词
中文分词
领域分词
改进BERT
电力文本
深度学习
自然语言处理
Chinese Word Segmentation(CWS)
domain word segmentation
improved BERT(Bidirectional Encoder Representations from Transformer)
electric power text
deep learning
natural language processing