Word similarity(WS)is a fundamental and critical task in natural language processing.Existing approaches to WS are mainly to calculate the similarity or relatedness of word pairs based on word embedding obtained by ma...Word similarity(WS)is a fundamental and critical task in natural language processing.Existing approaches to WS are mainly to calculate the similarity or relatedness of word pairs based on word embedding obtained by massive and high-quality corpus.However,it may suffer from poor performance for insufficient corpus in some specific fields,and cannot capture rich semantic and sentimental information.To address these above problems,we propose an enhancing embedding-based word similarity evaluation with character-word concepts and synonyms knowledge,namely EWS-CS model,which can provide extra semantic information to enhance word similarity evaluation.The core of our approach contains knowledge encoder and word encoder.In knowledge encoder,we incorporate the semantic knowledge extracted from knowledge resources,including character-word concepts,synonyms and sentiment lexicons,to obtain knowledge representation.Word encoder is to learn enhancing embedding-based word representation from pre-trained model and knowledge representation based on similarity task.Finally,compared with baseline models,the experiments on four similarity evaluation datasets validate the effectiveness of our EWS-CS model in WS task.展开更多
Text similarity has a relatively wide range of applications in many fields, such as intelligent information retrieval, question answering system, text rechecking, machine translation, and so on. The text similarity co...Text similarity has a relatively wide range of applications in many fields, such as intelligent information retrieval, question answering system, text rechecking, machine translation, and so on. The text similarity computing based on the meaning has been used more widely in the similarity computing of the words and phrase. Using the knowledge structure of the and its method of knowledge description, taking into account the other factor and weight that influenced similarity, making full use of depth and density of the Concept-Sememe tree, an improved method of Chinese word similarity calculation based on semantic distance was provided in this paper. Finally the effectiveness of this method was verified by the simulation results.展开更多
Word sense disambiguation(WSD)is a fundamental but significant task in natural language processing,which directly affects the performance of upper applications.However,WSD is very challenging due to the problem of kno...Word sense disambiguation(WSD)is a fundamental but significant task in natural language processing,which directly affects the performance of upper applications.However,WSD is very challenging due to the problem of knowledge bottleneck,i.e.,it is hard to acquire abundant disambiguation knowledge,especially in Chinese.To solve this problem,this paper proposes a graph-based Chinese WSD method with multi-knowledge integration.Particularly,a graph model combining various Chinese and English knowledge resources by word sense mapping is designed.Firstly,the content words in a Chinese ambiguous sentence are extracted and mapped to English words with BabelNet.Then,English word similarity is computed based on English word embeddings and knowledge base.Chinese word similarity is evaluated with Chinese word embedding and HowNet,respectively.The weights of the three kinds of word similarity are optimized with simulated annealing algorithm so as to obtain their overall similarities,which are utilized to construct a disambiguation graph.The graph scoring algorithm evaluates the importance of each word sense node and judge the right senses of the ambiguous words.Extensive experimental results on SemEval dataset show that our proposed WSD method significantly outperforms the baselines.展开更多
Two learning models,Zolu-continuous bags of words(ZL-CBOW)and Zolu-skip-grams(ZL-SG),based on the Zolu function are proposed.The slope of Relu in word2vec has been changed by the Zolu function.The proposed models can ...Two learning models,Zolu-continuous bags of words(ZL-CBOW)and Zolu-skip-grams(ZL-SG),based on the Zolu function are proposed.The slope of Relu in word2vec has been changed by the Zolu function.The proposed models can process extremely large data sets as well as word2vec without increasing the complexity.Also,the models outperform several word embedding methods both in word similarity and syntactic accuracy.The method of ZL-CBOW outperforms CBOW in accuracy by 8.43%on the training set of capital-world,and by 1.24%on the training set of plural-verbs.Moreover,experimental simulations on word similarity and syntactic accuracy show that ZL-CBOW and ZL-SG are superior to LL-CBOW and LL-SG,respectively.展开更多
Category-based statistic language model is an important method to solve the problem of sparse data.But there are two bottlenecks:1) The problem of word clustering.It is hard to find a suitable clustering method with g...Category-based statistic language model is an important method to solve the problem of sparse data.But there are two bottlenecks:1) The problem of word clustering.It is hard to find a suitable clustering method with good performance and less computation.2) Class-based method always loses the prediction ability to adapt the text in different domains.In order to solve above problems,a definition of word similarity by utilizing mutual information was presented.Based on word similarity,the definition of word set similarity was given.Experiments show that word clustering algorithm based on similarity is better than conventional greedy clustering method in speed and performance,and the perplexity is reduced from 283 to 218.At the same time,an absolute weighted difference method was presented and was used to construct vari-gram language model which has good prediction ability.The perplexity of vari-gram model is reduced from 234.65 to 219.14 on Chinese corpora,and is reduced from 195.56 to 184.25 on English corpora compared with category-based model.展开更多
Long-document semantic measurement has great significance in many applications such as semantic searchs, plagiarism detection, and automatic technical surveys. However, research efforts have mainly focused on the sema...Long-document semantic measurement has great significance in many applications such as semantic searchs, plagiarism detection, and automatic technical surveys. However, research efforts have mainly focused on the semantic similarity of short texts. Document-level semantic measurement remains an open issue due to problems such as the omission of background knowledge and topic transition. In this paper, we propose a novel semantic matching method for long documents in the academic domain. To accurately represent the general meaning of an academic article, we construct a semantic profile in which key semantic elements such as the research purpose, methodology, and domain are included and enriched. As such, we can obtain the overall semantic similarity of two papers by computing the distance between their profiles. The distances between the concepts of two different semantic profiles are measured by word vectors. To improve the semantic representation quality of word vectors, we propose a joint word-embedding model for incorporating a domain-specific semantic relation constraint into the traditional context constraint. Our experimental results demonstrate that, in the measurement of document semantic similarity, our approach achieves substantial improvement over state-of-the-art methods, and our joint word-embedding model produces significantly better word representations than traditional word-embedding models.展开更多
In order to improve the accuracy of text similarity calculation,this paper presents a text similarity function part of speech and word order-smooth inverse frequency(PO-SIF)based on sentence vector,which optimizes the...In order to improve the accuracy of text similarity calculation,this paper presents a text similarity function part of speech and word order-smooth inverse frequency(PO-SIF)based on sentence vector,which optimizes the classical SIF calculation method in two aspects:part of speech and word order.The classical SIF algorithm is to calculate sentence similarity by getting a sentence vector through weighting and reducing noise.However,the different methods of weighting or reducing noise would affect the efficiency and the accuracy of similarity calculation.In our proposed PO-SIF,the weight parameters of the SIF sentence vector are first updated by the part of speech subtraction factor,to determine the most crucial words.Furthermore,PO-SIF calculates the sentence vector similarity taking into the account of word order,which overcomes the drawback of similarity analysis that is mostly based on the word frequency.The experimental results validate the performance of our proposed PO-SIF on improving the accuracy of text similarity calculation.展开更多
基金This work is supported by the National Natural Science Foundation of China(No.61801440),the High-quality and Cutting-edge Disciplines Construction Project for Universities in Beijing(Internet Information,Communication University of China),State Key Laboratory of Media Convergence and Communication(Communication University of China),and the Fundamental Research Funds for the Central Universities.
文摘Word similarity(WS)is a fundamental and critical task in natural language processing.Existing approaches to WS are mainly to calculate the similarity or relatedness of word pairs based on word embedding obtained by massive and high-quality corpus.However,it may suffer from poor performance for insufficient corpus in some specific fields,and cannot capture rich semantic and sentimental information.To address these above problems,we propose an enhancing embedding-based word similarity evaluation with character-word concepts and synonyms knowledge,namely EWS-CS model,which can provide extra semantic information to enhance word similarity evaluation.The core of our approach contains knowledge encoder and word encoder.In knowledge encoder,we incorporate the semantic knowledge extracted from knowledge resources,including character-word concepts,synonyms and sentiment lexicons,to obtain knowledge representation.Word encoder is to learn enhancing embedding-based word representation from pre-trained model and knowledge representation based on similarity task.Finally,compared with baseline models,the experiments on four similarity evaluation datasets validate the effectiveness of our EWS-CS model in WS task.
文摘Text similarity has a relatively wide range of applications in many fields, such as intelligent information retrieval, question answering system, text rechecking, machine translation, and so on. The text similarity computing based on the meaning has been used more widely in the similarity computing of the words and phrase. Using the knowledge structure of the and its method of knowledge description, taking into account the other factor and weight that influenced similarity, making full use of depth and density of the Concept-Sememe tree, an improved method of Chinese word similarity calculation based on semantic distance was provided in this paper. Finally the effectiveness of this method was verified by the simulation results.
基金The research work is supported by National Key R&D Program of China under Grant No.2018YFC0831704National Nature Science Foundation of China under Grant No.61502259+1 种基金Natural Science Foundation of Shandong Province under Grant No.ZR2017MF056Taishan Scholar Program of Shandong Province in China(Directed by Prof.Yinglong Wang).
文摘Word sense disambiguation(WSD)is a fundamental but significant task in natural language processing,which directly affects the performance of upper applications.However,WSD is very challenging due to the problem of knowledge bottleneck,i.e.,it is hard to acquire abundant disambiguation knowledge,especially in Chinese.To solve this problem,this paper proposes a graph-based Chinese WSD method with multi-knowledge integration.Particularly,a graph model combining various Chinese and English knowledge resources by word sense mapping is designed.Firstly,the content words in a Chinese ambiguous sentence are extracted and mapped to English words with BabelNet.Then,English word similarity is computed based on English word embeddings and knowledge base.Chinese word similarity is evaluated with Chinese word embedding and HowNet,respectively.The weights of the three kinds of word similarity are optimized with simulated annealing algorithm so as to obtain their overall similarities,which are utilized to construct a disambiguation graph.The graph scoring algorithm evaluates the importance of each word sense node and judge the right senses of the ambiguous words.Extensive experimental results on SemEval dataset show that our proposed WSD method significantly outperforms the baselines.
基金Supported by the National Natural Science Foundation of China(61771051,61675025)。
文摘Two learning models,Zolu-continuous bags of words(ZL-CBOW)and Zolu-skip-grams(ZL-SG),based on the Zolu function are proposed.The slope of Relu in word2vec has been changed by the Zolu function.The proposed models can process extremely large data sets as well as word2vec without increasing the complexity.Also,the models outperform several word embedding methods both in word similarity and syntactic accuracy.The method of ZL-CBOW outperforms CBOW in accuracy by 8.43%on the training set of capital-world,and by 1.24%on the training set of plural-verbs.Moreover,experimental simulations on word similarity and syntactic accuracy show that ZL-CBOW and ZL-SG are superior to LL-CBOW and LL-SG,respectively.
基金Project(60763001) supported by the National Natural Science Foundation of ChinaProject(2010GZS0072) supported by the Natural Science Foundation of Jiangxi Province,ChinaProject(GJJ12271) supported by the Science and Technology Foundation of Provincial Education Department of Jiangxi Province,China
文摘Category-based statistic language model is an important method to solve the problem of sparse data.But there are two bottlenecks:1) The problem of word clustering.It is hard to find a suitable clustering method with good performance and less computation.2) Class-based method always loses the prediction ability to adapt the text in different domains.In order to solve above problems,a definition of word similarity by utilizing mutual information was presented.Based on word similarity,the definition of word set similarity was given.Experiments show that word clustering algorithm based on similarity is better than conventional greedy clustering method in speed and performance,and the perplexity is reduced from 283 to 218.At the same time,an absolute weighted difference method was presented and was used to construct vari-gram language model which has good prediction ability.The perplexity of vari-gram model is reduced from 234.65 to 219.14 on Chinese corpora,and is reduced from 195.56 to 184.25 on English corpora compared with category-based model.
基金supported by the Foundation of the State Key Laboratory of Software Development Environment(No.SKLSDE-2015ZX-04)
文摘Long-document semantic measurement has great significance in many applications such as semantic searchs, plagiarism detection, and automatic technical surveys. However, research efforts have mainly focused on the semantic similarity of short texts. Document-level semantic measurement remains an open issue due to problems such as the omission of background knowledge and topic transition. In this paper, we propose a novel semantic matching method for long documents in the academic domain. To accurately represent the general meaning of an academic article, we construct a semantic profile in which key semantic elements such as the research purpose, methodology, and domain are included and enriched. As such, we can obtain the overall semantic similarity of two papers by computing the distance between their profiles. The distances between the concepts of two different semantic profiles are measured by word vectors. To improve the semantic representation quality of word vectors, we propose a joint word-embedding model for incorporating a domain-specific semantic relation constraint into the traditional context constraint. Our experimental results demonstrate that, in the measurement of document semantic similarity, our approach achieves substantial improvement over state-of-the-art methods, and our joint word-embedding model produces significantly better word representations than traditional word-embedding models.
基金supported by Chongqing Education Committee(20SKGH059)。
文摘In order to improve the accuracy of text similarity calculation,this paper presents a text similarity function part of speech and word order-smooth inverse frequency(PO-SIF)based on sentence vector,which optimizes the classical SIF calculation method in two aspects:part of speech and word order.The classical SIF algorithm is to calculate sentence similarity by getting a sentence vector through weighting and reducing noise.However,the different methods of weighting or reducing noise would affect the efficiency and the accuracy of similarity calculation.In our proposed PO-SIF,the weight parameters of the SIF sentence vector are first updated by the part of speech subtraction factor,to determine the most crucial words.Furthermore,PO-SIF calculates the sentence vector similarity taking into the account of word order,which overcomes the drawback of similarity analysis that is mostly based on the word frequency.The experimental results validate the performance of our proposed PO-SIF on improving the accuracy of text similarity calculation.