How to select appropriate wolds in a translation is a significant problem in current studies of machine translation, because it directly decides the translation quality. This paper uses an unsupervised corpus-based st...How to select appropriate wolds in a translation is a significant problem in current studies of machine translation, because it directly decides the translation quality. This paper uses an unsupervised corpus-based statisticalmethod to select target word. Based on the concurrence probabilities, all ambiguous words in a sentence are disambiguated at the same time. Because a corpus of limited size cannot cover all the collocation of words, we use an effectivesmoothing method to increase the coverage of the corpus. In ceder to solve the problem in our English-Chinese MT system, we have applied the algorithm to disambiguate senses of verbs, nouns and adjectitves in target language, and theresult shows that the approach is very promising.展开更多
Natural language processing has got great progress recently. Controlling robots with spoken natural language has become expectable. With the reliability problem of this kind of control in mind a confirmation process o...Natural language processing has got great progress recently. Controlling robots with spoken natural language has become expectable. With the reliability problem of this kind of control in mind a confirmation process of natural language instruction should be included before carried out by the robot autonomously and the prototype dialog system was designed thus the standardization problem was raised for the natural and understandable language interaction. In the application background of remotely navigating a mobile robot inside a building with Chinese natural spoken language considering that as an important navigation element in instructions a place name can be expressed with different lexical terms in spoken language this paper proposes a model for substituting different alternatives of a place name with a standard one (called standardization). First a CRF (Conditional Random Fields) model is trained to label the term required be standardized then a trained word embedding model is to represent lexical terms as digital vectors. In the vector space similarity of lexical terms is defined and used to find out the most similar one to the term picked out to be standardized. Experiments show that the method proposed works well and the dialog system responses to confirm the instructions are natural and understandable.展开更多
The global integration has brought China closer t the rest of the world, especially after China succeeded in en tering the WTO and was entitled to host the 2008 Olympi Games in Beijing and the 2010 World Exposition in...The global integration has brought China closer t the rest of the world, especially after China succeeded in en tering the WTO and was entitled to host the 2008 Olympi Games in Beijing and the 2010 World Exposition in Shang hai.With the frequent exchanges between China and other countries in recent decades, more and more internationa friends have come to China.Meanwhile, more and mor bilingual sign words appear in the big cities in China so as t facilitate international friends’ stay in China.Therefore, th English translations of sign words are of paramount impor tance to international tourists.展开更多
Most word embedding models have the following problems:(1)In the models based on bag-of-words contexts,the structural relations of sentences are completely neglected;(2)Each word uses a single embedding,which makes th...Most word embedding models have the following problems:(1)In the models based on bag-of-words contexts,the structural relations of sentences are completely neglected;(2)Each word uses a single embedding,which makes the model indiscriminative for polysemous words;(3)Word embedding easily tends to contextual structure similarity of sentences.To solve these problems,we propose an easy-to-use representation algorithm of syntactic word embedding(SWE).The main procedures are:(1)A polysemous tagging algorithm is used for polysemous representation by the latent Dirichlet allocation(LDA)algorithm;(2)Symbols‘+’and‘-’are adopted to indicate the directions of the dependency syntax;(3)Stopwords and their dependencies are deleted;(4)Dependency skip is applied to connect indirect dependencies;(5)Dependency-based contexts are inputted to a word2vec model.Experimental results show that our model generates desirable word embedding in similarity evaluation tasks.Besides,semantic and syntactic features can be captured from dependency-based syntactic contexts,exhibiting less topical and more syntactic similarity.We conclude that SWE outperforms single embedding learning models.展开更多
文摘How to select appropriate wolds in a translation is a significant problem in current studies of machine translation, because it directly decides the translation quality. This paper uses an unsupervised corpus-based statisticalmethod to select target word. Based on the concurrence probabilities, all ambiguous words in a sentence are disambiguated at the same time. Because a corpus of limited size cannot cover all the collocation of words, we use an effectivesmoothing method to increase the coverage of the corpus. In ceder to solve the problem in our English-Chinese MT system, we have applied the algorithm to disambiguate senses of verbs, nouns and adjectitves in target language, and theresult shows that the approach is very promising.
基金Sponsored by the Basic Research Development Program of China ( Grant No. 2013CB03554)the Fundamental Research Funds for Universities, Central South University (Grant No. 2017zzts394).
文摘Natural language processing has got great progress recently. Controlling robots with spoken natural language has become expectable. With the reliability problem of this kind of control in mind a confirmation process of natural language instruction should be included before carried out by the robot autonomously and the prototype dialog system was designed thus the standardization problem was raised for the natural and understandable language interaction. In the application background of remotely navigating a mobile robot inside a building with Chinese natural spoken language considering that as an important navigation element in instructions a place name can be expressed with different lexical terms in spoken language this paper proposes a model for substituting different alternatives of a place name with a standard one (called standardization). First a CRF (Conditional Random Fields) model is trained to label the term required be standardized then a trained word embedding model is to represent lexical terms as digital vectors. In the vector space similarity of lexical terms is defined and used to find out the most similar one to the term picked out to be standardized. Experiments show that the method proposed works well and the dialog system responses to confirm the instructions are natural and understandable.
文摘The global integration has brought China closer t the rest of the world, especially after China succeeded in en tering the WTO and was entitled to host the 2008 Olympi Games in Beijing and the 2010 World Exposition in Shang hai.With the frequent exchanges between China and other countries in recent decades, more and more internationa friends have come to China.Meanwhile, more and mor bilingual sign words appear in the big cities in China so as t facilitate international friends’ stay in China.Therefore, th English translations of sign words are of paramount impor tance to international tourists.
基金Project supported by the National Natural Science Foundation of China(Nos.61663041 and 61763041)the Program for Changjiang Scholars and Innovative Research Team in Universities,China(No.IRT_15R40)+2 种基金the Research Fund for the Chunhui Program of Ministry of Education of China(No.Z2014022)the Natural Science Foundation of Qinghai Province,China(No.2014-ZJ-721)the Fundamental Research Funds for the Central Universities,China(No.2017TS045)
文摘Most word embedding models have the following problems:(1)In the models based on bag-of-words contexts,the structural relations of sentences are completely neglected;(2)Each word uses a single embedding,which makes the model indiscriminative for polysemous words;(3)Word embedding easily tends to contextual structure similarity of sentences.To solve these problems,we propose an easy-to-use representation algorithm of syntactic word embedding(SWE).The main procedures are:(1)A polysemous tagging algorithm is used for polysemous representation by the latent Dirichlet allocation(LDA)algorithm;(2)Symbols‘+’and‘-’are adopted to indicate the directions of the dependency syntax;(3)Stopwords and their dependencies are deleted;(4)Dependency skip is applied to connect indirect dependencies;(5)Dependency-based contexts are inputted to a word2vec model.Experimental results show that our model generates desirable word embedding in similarity evaluation tasks.Besides,semantic and syntactic features can be captured from dependency-based syntactic contexts,exhibiting less topical and more syntactic similarity.We conclude that SWE outperforms single embedding learning models.