摘要
计算语言学中的语言模型可以分为基于规则的语言模型、基于统计的语言模型、基于神经网络的语言模型三种类型。基于规则的语言模型主要有短语结构语法模型和依存语法模型,此类语言模型在某些"子语言"的计算语言学应用系统中获得了一定的成功,但用它们来处理真实文本仍有很大的困难。基于统计的语言模型十分重视统计在模型构建中的作用,语言学知识主要使用概率和统计的计算从大规模真实的语料库中获取,这样获得的知识能够更加全面、准确地反映自然语言的真实面貌,因此,基于统计的语言模型在计算语言学中广泛地流行开来。21世纪以来出现了基于神经网络的语言模型,该模型比基于统计的语言模型更胜一筹,占据了当前自然语言处理研究的主流地位。
In computational linguistics,to directly process natural languages by computer,we need to formalize the linguistic problem mathematically,represent it by algorithm,and establish the language model.The language model is an abstract formal system of objective language.The study of language models has a great theoretical significance and application value for computational linguistics.There are three language models in computational linguistics:rule-based language model,statistics-based language model,and neural-network-based language model.The rule-based language model mainly includes phrase structure grammar and dependency grammar.Based on the phrase structure grammar,computational linguists proposed recursive transition network,augmented transition network,top-down parsing,bottom-up parsing,general syntactic processor,chart parsing,leftcorner parsing,CYK parsing,Earley algorithm,Tomita algorithm,tree-adjoining grammar,left-associative grammar.Afterward,they proposed complex-featurebased and unification-based language models like lexical functional grammar,functional unification grammar,PATR algorithm,definitive clause grammar,generalized phrase structure grammar,head-driven phrase structure grammar,multiple-branched&multiple-labeled tree model(MMT model),etc.Based on the dependency grammar,computational linguists proposed combinatory category grammar,word grammar,valence grammar,etc.This rule-based language model is successful in some sub-language fields of computational linguistics,but it is very difficult for the model to process large-scale and authentic texts.The statistics-based language model is very successful in the fields of character recognition,speech recognition,speech synthesis,and machine translation.Statistics-based language models include N-gram model,noisy channel model,hidden Markov model,Maximum entropy model,conditional random field model,probabilistic context-free grammar,lexicalized probabilistic contextfree grammar,dynamic programming algorithm,minimum edit distance algorithm,decision tree model,weighted automata,Viterbi algorithm,forward algorithm,forward-backward algorithm,etc.These statistical language models all place great emphasis on the role of statistics in their construction,and linguistic knowledge is mainly obtained from large-scale authentic corpora using probabilistic and statistical approaches so that the knowledge obtained is more comprehensive and accurate in reflecting the true aspects of natural language.Statistical models are becoming widely popular in computational linguistics.Since the 21 st century,the neural network model has been the mainstream of natural language processing.In a neural network language model,the context of a word is represented in terms of the word vector.Representing the context of a word in terms of a word vector,rather than by a precise,concrete word as in traditional rulebased language model and statistical language model,allows the neural network language model to generalize“unseen data”,which is superior to traditional rule-based language model and statistical language model.
作者
冯志伟
丁晓梅
FENG Zhiwei;DING Xiaomei(Shandong Key Laboratory of Language Resources Development and Application,Ludong University,Yantai,Shandong 264026,China;Dalian Maritime University,Dalian,Liaoning 116026,China)
出处
《外语电化教学》
CSSCI
北大核心
2021年第6期17-24,3,共9页
Technology Enhanced Foreign Language Education
基金
国家社会科学基金项目“基于平行语料库的俄汉语言学术语词典编纂研究”(项目编号:17BYY220)的阶段性成果。