The meaning of a word includes a conceptual meaning and a distributive meaning.Word embedding based on distribution suffers from insufficient conceptual semantic representation caused by data sparsity,especially for l...The meaning of a word includes a conceptual meaning and a distributive meaning.Word embedding based on distribution suffers from insufficient conceptual semantic representation caused by data sparsity,especially for low-frequency words.In knowledge bases,manually annotated semantic knowledge is stable and the essential attributes of words are accurately denoted.In this paper,we propose a Conceptual Semantics Enhanced Word Representation(CEWR)model,computing the synset embedding and hypernym embedding of Chinese words based on the Tongyici Cilin thesaurus,and aggregating it with distributed word representation to have both distributed information and the conceptual meaning encoded in the representation of words.We evaluate the CEWR model on two tasks:word similarity computation and short text classification.The Spearman correlation between model results and human judgement are improved to 64.71%,81.84%,and 85.16%on Wordsim297,MC30,and RG65,respectively.Moreover,CEWR improves the F1 score by 3%in the short text classification task.The experimental results show that CEWR can represent words in a more informative approach than distributed word embedding.This proves that conceptual semantics,especially hypernymous information,is a good complement to distributed word representation.展开更多
Online social media exhibit massive organizational event relevant messages, and the well categorized event information can be useful in many real-world applications. In this paper, we propose a research framework to e...Online social media exhibit massive organizational event relevant messages, and the well categorized event information can be useful in many real-world applications. In this paper, we propose a research framework to extract high quality event information from massive online media data. The main contributions lie in two aspects: First, we present an event-extraction and event-categorization system for online media data; second, we present a novel approach for both discovering important event categories and classifying extracted events based on word representation and clustering model. Experimental results with real dataset show that the proposed framework is effective to extract high quality event information.展开更多
Word similarity(WS)is a fundamental and critical task in natural language processing.Existing approaches to WS are mainly to calculate the similarity or relatedness of word pairs based on word embedding obtained by ma...Word similarity(WS)is a fundamental and critical task in natural language processing.Existing approaches to WS are mainly to calculate the similarity or relatedness of word pairs based on word embedding obtained by massive and high-quality corpus.However,it may suffer from poor performance for insufficient corpus in some specific fields,and cannot capture rich semantic and sentimental information.To address these above problems,we propose an enhancing embedding-based word similarity evaluation with character-word concepts and synonyms knowledge,namely EWS-CS model,which can provide extra semantic information to enhance word similarity evaluation.The core of our approach contains knowledge encoder and word encoder.In knowledge encoder,we incorporate the semantic knowledge extracted from knowledge resources,including character-word concepts,synonyms and sentiment lexicons,to obtain knowledge representation.Word encoder is to learn enhancing embedding-based word representation from pre-trained model and knowledge representation based on similarity task.Finally,compared with baseline models,the experiments on four similarity evaluation datasets validate the effectiveness of our EWS-CS model in WS task.展开更多
As a key technology of rapid and low-cost drug development, drug repositioning is getting popular. In this study, a text mining approach to the discovery of unknown drug-disease relation was tested. Using a word embed...As a key technology of rapid and low-cost drug development, drug repositioning is getting popular. In this study, a text mining approach to the discovery of unknown drug-disease relation was tested. Using a word embedding algorithm, senses of over 1.7 million words were well represented in sufficiently short feature vectors. Through various analysis including clustering and classification, feasibility of our approach was tested. Finally, our trained classification model achieved 87.6% accuracy in the prediction of drug-disease relation in cancer treatment and succeeded in discovering novel drug-disease relations that were actually reported in recent studies.展开更多
Most word embedding models have the following problems:(1)In the models based on bag-of-words contexts,the structural relations of sentences are completely neglected;(2)Each word uses a single embedding,which makes th...Most word embedding models have the following problems:(1)In the models based on bag-of-words contexts,the structural relations of sentences are completely neglected;(2)Each word uses a single embedding,which makes the model indiscriminative for polysemous words;(3)Word embedding easily tends to contextual structure similarity of sentences.To solve these problems,we propose an easy-to-use representation algorithm of syntactic word embedding(SWE).The main procedures are:(1)A polysemous tagging algorithm is used for polysemous representation by the latent Dirichlet allocation(LDA)algorithm;(2)Symbols‘+’and‘-’are adopted to indicate the directions of the dependency syntax;(3)Stopwords and their dependencies are deleted;(4)Dependency skip is applied to connect indirect dependencies;(5)Dependency-based contexts are inputted to a word2vec model.Experimental results show that our model generates desirable word embedding in similarity evaluation tasks.Besides,semantic and syntactic features can be captured from dependency-based syntactic contexts,exhibiting less topical and more syntactic similarity.We conclude that SWE outperforms single embedding learning models.展开更多
Word embedding has drawn a lot of attention due to its usefulness in many NLP tasks. So far a handful of neural-network based word embedding algorithms have been proposed without considering the effects of pronouns in...Word embedding has drawn a lot of attention due to its usefulness in many NLP tasks. So far a handful of neural-network based word embedding algorithms have been proposed without considering the effects of pronouns in the training corpus. In this paper, we propose using co-reference resolution to improve the word embedding by extracting better context. We evaluate four word embeddings with considerations of co-reference resolution and compare the quality of word embedding on the task of word analogy and word similarity on multiple data sets.Experiments show that by using co-reference resolution, the word embedding performance in the word analogy task can be improved by around 1.88%. We find that the words that are names of countries are affected the most,which is as expected.展开更多
基金This research is supported by the National Science Foundation of China(grant 61772278,author:Qu,W.grant number:61472191,author:Zhou,J.http://www.nsfc.gov.cn/)+2 种基金the National Social Science Foundation of China(grant number:18BYY127,author:Li B.http://www.cssn.cn)the Philosophy and Social Science Foundation of Jiangsu Higher Institution(grant number:2019SJA0220,author:Wei,T.https://jyt.jiangsu.gov.cn)Jiangsu Higher Institutions’Excellent Innovative Team for Philosophy and Social Science(grant number:2017STD006,author:Gu,W.https://jyt.jiangsu.gov.cn)。
文摘The meaning of a word includes a conceptual meaning and a distributive meaning.Word embedding based on distribution suffers from insufficient conceptual semantic representation caused by data sparsity,especially for low-frequency words.In knowledge bases,manually annotated semantic knowledge is stable and the essential attributes of words are accurately denoted.In this paper,we propose a Conceptual Semantics Enhanced Word Representation(CEWR)model,computing the synset embedding and hypernym embedding of Chinese words based on the Tongyici Cilin thesaurus,and aggregating it with distributed word representation to have both distributed information and the conceptual meaning encoded in the representation of words.We evaluate the CEWR model on two tasks:word similarity computation and short text classification.The Spearman correlation between model results and human judgement are improved to 64.71%,81.84%,and 85.16%on Wordsim297,MC30,and RG65,respectively.Moreover,CEWR improves the F1 score by 3%in the short text classification task.The experimental results show that CEWR can represent words in a more informative approach than distributed word embedding.This proves that conceptual semantics,especially hypernymous information,is a good complement to distributed word representation.
基金supported by the National Natural Science Foundation of China under Grants No.71271044,No.U1233118,and No.71572029
文摘Online social media exhibit massive organizational event relevant messages, and the well categorized event information can be useful in many real-world applications. In this paper, we propose a research framework to extract high quality event information from massive online media data. The main contributions lie in two aspects: First, we present an event-extraction and event-categorization system for online media data; second, we present a novel approach for both discovering important event categories and classifying extracted events based on word representation and clustering model. Experimental results with real dataset show that the proposed framework is effective to extract high quality event information.
基金This work is supported by the National Natural Science Foundation of China(No.61801440),the High-quality and Cutting-edge Disciplines Construction Project for Universities in Beijing(Internet Information,Communication University of China),State Key Laboratory of Media Convergence and Communication(Communication University of China),and the Fundamental Research Funds for the Central Universities.
文摘Word similarity(WS)is a fundamental and critical task in natural language processing.Existing approaches to WS are mainly to calculate the similarity or relatedness of word pairs based on word embedding obtained by massive and high-quality corpus.However,it may suffer from poor performance for insufficient corpus in some specific fields,and cannot capture rich semantic and sentimental information.To address these above problems,we propose an enhancing embedding-based word similarity evaluation with character-word concepts and synonyms knowledge,namely EWS-CS model,which can provide extra semantic information to enhance word similarity evaluation.The core of our approach contains knowledge encoder and word encoder.In knowledge encoder,we incorporate the semantic knowledge extracted from knowledge resources,including character-word concepts,synonyms and sentiment lexicons,to obtain knowledge representation.Word encoder is to learn enhancing embedding-based word representation from pre-trained model and knowledge representation based on similarity task.Finally,compared with baseline models,the experiments on four similarity evaluation datasets validate the effectiveness of our EWS-CS model in WS task.
文摘As a key technology of rapid and low-cost drug development, drug repositioning is getting popular. In this study, a text mining approach to the discovery of unknown drug-disease relation was tested. Using a word embedding algorithm, senses of over 1.7 million words were well represented in sufficiently short feature vectors. Through various analysis including clustering and classification, feasibility of our approach was tested. Finally, our trained classification model achieved 87.6% accuracy in the prediction of drug-disease relation in cancer treatment and succeeded in discovering novel drug-disease relations that were actually reported in recent studies.
基金Project supported by the National Natural Science Foundation of China(Nos.61663041 and 61763041)the Program for Changjiang Scholars and Innovative Research Team in Universities,China(No.IRT_15R40)+2 种基金the Research Fund for the Chunhui Program of Ministry of Education of China(No.Z2014022)the Natural Science Foundation of Qinghai Province,China(No.2014-ZJ-721)the Fundamental Research Funds for the Central Universities,China(No.2017TS045)
文摘Most word embedding models have the following problems:(1)In the models based on bag-of-words contexts,the structural relations of sentences are completely neglected;(2)Each word uses a single embedding,which makes the model indiscriminative for polysemous words;(3)Word embedding easily tends to contextual structure similarity of sentences.To solve these problems,we propose an easy-to-use representation algorithm of syntactic word embedding(SWE).The main procedures are:(1)A polysemous tagging algorithm is used for polysemous representation by the latent Dirichlet allocation(LDA)algorithm;(2)Symbols‘+’and‘-’are adopted to indicate the directions of the dependency syntax;(3)Stopwords and their dependencies are deleted;(4)Dependency skip is applied to connect indirect dependencies;(5)Dependency-based contexts are inputted to a word2vec model.Experimental results show that our model generates desirable word embedding in similarity evaluation tasks.Besides,semantic and syntactic features can be captured from dependency-based syntactic contexts,exhibiting less topical and more syntactic similarity.We conclude that SWE outperforms single embedding learning models.
基金supported by the National HighTech Research and Development(863)Program(No.2015AA015401)the National Natural Science Foundation of China(Nos.61533018 and 61402220)+2 种基金the State Scholarship Fund of CSC(No.201608430240)the Philosophy and Social Science Foundation of Hunan Province(No.16YBA323)the Scientific Research Fund of Hunan Provincial Education Department(Nos.16C1378 and 14B153)
文摘Word embedding has drawn a lot of attention due to its usefulness in many NLP tasks. So far a handful of neural-network based word embedding algorithms have been proposed without considering the effects of pronouns in the training corpus. In this paper, we propose using co-reference resolution to improve the word embedding by extracting better context. We evaluate four word embeddings with considerations of co-reference resolution and compare the quality of word embedding on the task of word analogy and word similarity on multiple data sets.Experiments show that by using co-reference resolution, the word embedding performance in the word analogy task can be improved by around 1.88%. We find that the words that are names of countries are affected the most,which is as expected.