基于分布式语义学理论的词向量蕴含了丰富的语义信息,一定程度上标志着自然语言处理和计算语言学领域进入了大模型发展时代。由于词向量的可计算属性,逐渐发展出了多种基于词向量的语义计算任务,语义关系辨析便是语义计算任务当中重要...基于分布式语义学理论的词向量蕴含了丰富的语义信息,一定程度上标志着自然语言处理和计算语言学领域进入了大模型发展时代。由于词向量的可计算属性,逐渐发展出了多种基于词向量的语义计算任务,语义关系辨析便是语义计算任务当中重要的一项。本研究基于fastText中文词向量和腾讯中文词向量的方法计算出表征语义关联强度的余弦相似度值,并得出以下结论:fastText中文词向量和腾讯中文词向量在辨别近义关系、反义关系、上下义关系、部分–整体关系这4种语义关系的任务上表现存在一定差异;通过比较Spearman相关系数,fastText中文词向量在实验数据上表现出其习得了更强的语义相似度特征,腾讯中文词向量则体现出其学习到了更强的语义相关度特征;在反义词辨析任务上,fastText中文词向量和腾讯中文词向量都在高度规约化的反义词对上计算出很高的余弦相似度值。The word embeddings, based on the distributed semantics theory, which contains rich linguistic information, have contributed a lot to the development of large language model (LLM) in the fields of natural language processing and computational linguistics. Due to the computable properties of word embeddings, various semantic computing tasks based on them have gradually emerged, among which semantic relation discrimination is an important task in semantic computation. In our study, we adopt two word-embedding methods, the fastText Chinese word embeddings and the Tencent Chinese word embeddings, to calculate Chinese semantic relations, where the cosine similarity is used to represent the semantic association strength between words. The following are our findings in this study: First, the fastText Chinese embeddings and the Tencent Chinese embeddings show some differences in the task of distinguishing the four types of semantic relation in Chinese, namely, synonymy, antonymy, hyponymy and meronymy;Second, by comparing the Spearman correlation coefficient, the fastText embeddings have acquired more knowledge of semantic similarity between words, while the Tencent Chinese word embeddings have acquired more knowledge of semantic relatedness between words;Third, both the fastText Chinese embeddings and the Tencent Chinese word embeddings give higher values of cosine similarity to highly conventionalized antonyms.展开更多
文摘基于分布式语义学理论的词向量蕴含了丰富的语义信息,一定程度上标志着自然语言处理和计算语言学领域进入了大模型发展时代。由于词向量的可计算属性,逐渐发展出了多种基于词向量的语义计算任务,语义关系辨析便是语义计算任务当中重要的一项。本研究基于fastText中文词向量和腾讯中文词向量的方法计算出表征语义关联强度的余弦相似度值,并得出以下结论:fastText中文词向量和腾讯中文词向量在辨别近义关系、反义关系、上下义关系、部分–整体关系这4种语义关系的任务上表现存在一定差异;通过比较Spearman相关系数,fastText中文词向量在实验数据上表现出其习得了更强的语义相似度特征,腾讯中文词向量则体现出其学习到了更强的语义相关度特征;在反义词辨析任务上,fastText中文词向量和腾讯中文词向量都在高度规约化的反义词对上计算出很高的余弦相似度值。The word embeddings, based on the distributed semantics theory, which contains rich linguistic information, have contributed a lot to the development of large language model (LLM) in the fields of natural language processing and computational linguistics. Due to the computable properties of word embeddings, various semantic computing tasks based on them have gradually emerged, among which semantic relation discrimination is an important task in semantic computation. In our study, we adopt two word-embedding methods, the fastText Chinese word embeddings and the Tencent Chinese word embeddings, to calculate Chinese semantic relations, where the cosine similarity is used to represent the semantic association strength between words. The following are our findings in this study: First, the fastText Chinese embeddings and the Tencent Chinese embeddings show some differences in the task of distinguishing the four types of semantic relation in Chinese, namely, synonymy, antonymy, hyponymy and meronymy;Second, by comparing the Spearman correlation coefficient, the fastText embeddings have acquired more knowledge of semantic similarity between words, while the Tencent Chinese word embeddings have acquired more knowledge of semantic relatedness between words;Third, both the fastText Chinese embeddings and the Tencent Chinese word embeddings give higher values of cosine similarity to highly conventionalized antonyms.