摘要
在人工智能时代,神经网络已成为自然语言处理的一种重要工具。为了建立用于自然语言处理的神经网络系统,必须进行词向量的生成。本文讨论生成词向量的三种方法:连续跳元和连续词袋的方法、奇异值分解的方法、布劳恩聚类的方法。使用这些方法可以生成稠密的词向量,从而改进自然语言处理中神经网络的性能。这种稠密的词向量具有很多潜在优点。它们容易作为连续的实数值特征被纳入深度学习系统中,并在自然语言处理的神经网络系统中更好地发挥词嵌入的作用。连续跳元和连续词袋的方法通过发现嵌入的方式来学习词嵌入,在这样的嵌入中,邻近的单词具有较高的点积,而噪声词则具有较低的点积。因而这种方法可以训练一个神经网络来预测邻近的单词。在语义上相似的单词在文本中出现时常常是彼此邻近的,所以,如果能够很好地预测邻近单词的词嵌入,就可以较好地表示单词之间的相似性。在计算词嵌入时,这种方法得到广泛的采用,而且是行之有效的。奇异值分解是发现数据集合中最为重要的维度的方法。这种方法能够用于从完整的词项一词项矩阵或词项一文献矩阵中构建维度较低的词嵌入。布劳恩聚类方法是推导词向量表示的一种聚类算法。这种方法根据前面单词和后面单词之间的关联特征来对单词进行聚类。其算法使用了基于类别的语言模型。布劳恩聚类可以用来给单词建立二进制符号的向量,使其具有句法表示的功能。
In the artificial intelligence(AI) era,neural network was used as a most important tool for natural language processing.In order to set up a neural network system of natural language processing, we have to generate the word vectors.It is the key topic for the research of neural network system in natural language processing.This paper discusses three approaches for generation of word vectors:skip-gram and C BO W(Continuous Bag of Word) approach,Singular Value Decomposition(SVD) approach, and Brown clustering approach.These approaches may generate dense word vectors, and improve the performance of neural network for natural language processing.These dense word vectors have a number of potential advantages.Because they contain fewer parameters than sparse vectors of explicit counts,they may generalize better and help avoid over-fitting.The dense vectors are easier to include as features of continuous real value in deep learning system and do better job of word embedding in neural network system of natural language processing.The skip-gram and CBOW approach can learn word embedding by finding embedding that has a high dot product with neighboring words and a low dot product with noise words, this approach trains a neural network to predict neighboring words.The words that are semantically similar often occur near each other in text,and the word embedding that are good at predicting neighboring words are also good at representing similarity between words.This approach is fast, efficient to train,and easily available online with code and pre-trained embedding.It is the popular and efficient way to compute word embedding.The Singular Value Decomposition(VSD) is an approach for finding the most important dimensions of data set,these dimensions along which the data varies the most.The VSD is part of a family of methods that can approximate an N-dimensional dataset using fewer dimensions,including Principle Components Analysis(PC A) method and Factor Analysis method.The PC A method first rotates the axis of the original dataset into a new space.The new space is chosen so that the highest order dimension captures the most variance in the original dataset,the next dimension captures the next most variance,and so on.In this new space,we can represent data with a smaller number of dimensions and still capture much of the variation in the original data.The Factor Analysis method is used to produce a dense embedding from a sparse matrix.This method factorizes the wordword PM I(Pointwise Mutual Information) matrix into three matrices:W matrix,Σ matrix and C matrix.TheΣ and C matrices are discarded,and the W matrix is truncated giving a matrix of k-dimensionality embedding vectors for each word.The Singular Value Decomposition approach can be applied to create lower-dimensional word embedding from a full term-term matrix or termdocument matrix.The Brown clustering approach is a clustering algorithm for deriving vector representations of words by clustering words based on their association features with preceding or following words.The algorithm makes use of the class-based language model.Brown cluster can be applied to create bit-vectors for a word that can function as a syntactic representation.
作者
冯志伟
FENG Zhiwei(Heilongjiang University,Harbin,Heilongjiang 150080,China;Ministry of Education Institute of Applied Linguistics,Beijing 100010,China)
出处
《外语电化教学》
CSSCI
北大核心
2021年第1期18-26,3,共10页
Technology Enhanced Foreign Language Education
关键词
词向量
神经网络
连续跳元
连续词袋
奇异值分解
布劳恩聚类
Word Vector
Neural Network
SkipGram
CBOW
Singular Value Decomposition(VSD)
Brown Clustering