This paper describes the experiments with Korean-to-Vietnamese statistical machine translation(SMT). The fact that Korean is a morphologically complex language that does not have clear optimal word boundaries causes a...This paper describes the experiments with Korean-to-Vietnamese statistical machine translation(SMT). The fact that Korean is a morphologically complex language that does not have clear optimal word boundaries causes a major problem of translating into or from Korean. To solve this problem, we present a method to conduct a Korean morphological analysis by using a pre-analyzed partial word-phrase dictionary(PWD).Besides, we build a Korean-Vietnamese parallel corpus for training SMT models by collecting text from multilingual magazines. Then, we apply such a morphology analysis to Korean sentences that are included in the collected parallel corpus as a preprocessing step. The experiment results demonstrate a remarkable improvement of Korean-to-Vietnamese translation quality in term of bi-lingual evaluation understudy(BLEU).展开更多
This paper proposed a method to incorporate syntax-based language models in phrase-based statistical machine translation (SMT) systems. The syntax-based language model used in this paper is based on link grammar,which...This paper proposed a method to incorporate syntax-based language models in phrase-based statistical machine translation (SMT) systems. The syntax-based language model used in this paper is based on link grammar,which is a high lexical formalism. In order to apply language models based on link grammar in phrase-based models,the concept of linked phrases,an extension of the concept of traditional phrases in phrase-based models was brought out. Experiments were conducted and the results showed that the use of syntax-based language models could improve the performance of the phrase-based models greatly.展开更多
A novel model based on structure alignments is proposed for statistical machine translation in this paper. Meta-structure and sequence of meta-structure for a parse tree are defined. During the translation process, a ...A novel model based on structure alignments is proposed for statistical machine translation in this paper. Meta-structure and sequence of meta-structure for a parse tree are defined. During the translation process, a parse tree is decomposed to deal with the structure divergence and the alignments can be constructed at different levels of recombination of meta-structure (RM). This method can perform the structure mapping across the sub-tree structure between languages. As a result, we get not only the translation for the target language, but sequence of meta-stmctu .re of its parse tree at the same time. Experiments show that the model in the framework of log-linear model has better generative ability and significantly outperforms Pharaoh, a phrase-based system.展开更多
Lexicalized reordering models are very important components of phrasebased translation systems.By examining the reordering relationships between adjacent phrases,conventional methods learn these models from the word a...Lexicalized reordering models are very important components of phrasebased translation systems.By examining the reordering relationships between adjacent phrases,conventional methods learn these models from the word aligned bilingual corpus,while ignoring the effect of the number of adjacent bilingual phrases.In this paper,we propose a method to take the number of adjacent phrases into account for better estimation of reordering models.Instead of just checking whether there is one phrase adjacent to a given phrase,our method firstly uses a compact structure named reordering graph to represent all phrase segmentations of a parallel sentence,then the effect of the adjacent phrase number can be quantified in a forward-backward fashion,and finally incorporated into the estimation of reordering models.Experimental results on the NIST Chinese-English and WMT French-Spanish data sets show that our approach significantly outperforms the baseline method.展开更多
Companies like Google, MSN and Yahoo provide translation services on their websites, generating translations based on statistical bilingual text corpora. Human translation seems to be inferior in face of huge amount o...Companies like Google, MSN and Yahoo provide translation services on their websites, generating translations based on statistical bilingual text corpora. Human translation seems to be inferior in face of huge amount of information and fast development of computer science. Despite the functions and versatility of statistical machine translation, it may never take the place of human effort. Teachers are supposed to guide the students in using online translation system.展开更多
Retelling extraction is an important branch of Natural Language Processing(NLP),and high-quality retelling resources are very helpful to improve the performance of machine translation.However,traditional methods based...Retelling extraction is an important branch of Natural Language Processing(NLP),and high-quality retelling resources are very helpful to improve the performance of machine translation.However,traditional methods based on the bilingual parallel corpus often ignore the document background in the process of retelling acquisition and application.In order to solve this problem,we introduce topic model information into the translation mode and propose a topic-based statistical machine translation method to improve the translation performance.In this method,Probabilistic Latent Semantic Analysis(PLSA)is used to obtains the co-occurrence relationship between words and documents by the hybrid matrix decomposition.Then we design a decoder to simplify the decoding process.Experiments show that the proposed method can effectively improve the accuracy of translation.展开更多
Unknown words are one of the key factors that greatly affect the translation quality. Traditionally, nearly all the related researches focus on obtaining the translation of the unknown words. However, these approaches...Unknown words are one of the key factors that greatly affect the translation quality. Traditionally, nearly all the related researches focus on obtaining the translation of the unknown words. However, these approaches have two disadvantages. On the one hand, they usually rely on many additional resources such as bilingual web data; on the other hand, they cannot guarantee good reordering and lexical selection of surrounding words. This paper gives a new perspective on handling unknown words in statistical machine translation (SMT). Instead of making great efforts to find the translation of unknown words, we focus on determining the semantic function of the unknown word in the test sentence and keeping the semantic function unchanged in the translation process. In this way, unknown words can help the phrase reordering and lexical selection of their surrounding words even though they still remain untranslated. In order to determine the semantic function of an unknown word, we employ the distributional semantic model and the bidirectional language model. Extensive experiments on both phrase-based and linguistically syntax-based SMT models in Chinese-to-English translation show that our method can substantially improve the translation quality.展开更多
The pivot language approach for statistical machine translation(SMT) is a good method to break the resource bottleneck for certain language pairs. However, in the implementation of conventional approaches, pivotside c...The pivot language approach for statistical machine translation(SMT) is a good method to break the resource bottleneck for certain language pairs. However, in the implementation of conventional approaches, pivotside context information is far from fully utilized, resulting in erroneous estimations of translation probabilities. In this study, we propose two topic-aware pivot language approaches to use different levels of pivot-side context. The first method takes advantage of document-level context by assuming that the bridged phrase pairs should be similar in the document-level topic distributions. The second method focuses on the effect of local context. Central to this approach are that the phrase sense can be reflected by local context in the form of probabilistic topics, and that bridged phrase pairs should be compatible in the latent sense distributions. Then, we build an interpolated model bringing the above methods together to further enhance the system performance. Experimental results on French-Spanish and French-German translations using English as the pivot language demonstrate the effectiveness of topic-based context in pivot-based SMT.展开更多
Loose phrase extraction method is proposed and applied for phrase-based statistical ma- chine translation. The method extracts phrase pairs that are not strictly consistent with word align- ments. Two types of constra...Loose phrase extraction method is proposed and applied for phrase-based statistical ma- chine translation. The method extracts phrase pairs that are not strictly consistent with word align- ments. Two types of constraints on word positions are investigated for this method. Furthermore, n-best alignments are introduced for phrase extraction instead of the one-best. Experimental results show that the proposed approach outperforms the baseline system, Pharaoh system, for both one-best and n-best alignments.展开更多
基金supported by the Institute for Information&communications Technology Promotion under Grant No.R0101-16-0176the Project of Core Technology Development for Human-Like Self-Taught Learning Based on Symbolic Approach
文摘This paper describes the experiments with Korean-to-Vietnamese statistical machine translation(SMT). The fact that Korean is a morphologically complex language that does not have clear optimal word boundaries causes a major problem of translating into or from Korean. To solve this problem, we present a method to conduct a Korean morphological analysis by using a pre-analyzed partial word-phrase dictionary(PWD).Besides, we build a Korean-Vietnamese parallel corpus for training SMT models by collecting text from multilingual magazines. Then, we apply such a morphology analysis to Korean sentences that are included in the collected parallel corpus as a preprocessing step. The experiment results demonstrate a remarkable improvement of Korean-to-Vietnamese translation quality in term of bi-lingual evaluation understudy(BLEU).
基金National Natural Science Foundation of China ( No.60803078)National High Technology Research and Development Programs of China (No.2006AA010107, No.2006AA010108)
文摘This paper proposed a method to incorporate syntax-based language models in phrase-based statistical machine translation (SMT) systems. The syntax-based language model used in this paper is based on link grammar,which is a high lexical formalism. In order to apply language models based on link grammar in phrase-based models,the concept of linked phrases,an extension of the concept of traditional phrases in phrase-based models was brought out. Experiments were conducted and the results showed that the use of syntax-based language models could improve the performance of the phrase-based models greatly.
基金the National High Technology Research and Development Progran of China(No.200606010108.2006AA01Z150)
文摘A novel model based on structure alignments is proposed for statistical machine translation in this paper. Meta-structure and sequence of meta-structure for a parse tree are defined. During the translation process, a parse tree is decomposed to deal with the structure divergence and the alignments can be constructed at different levels of recombination of meta-structure (RM). This method can perform the structure mapping across the sub-tree structure between languages. As a result, we get not only the translation for the target language, but sequence of meta-stmctu .re of its parse tree at the same time. Experiments show that the model in the framework of log-linear model has better generative ability and significantly outperforms Pharaoh, a phrase-based system.
基金supported by the National Natural Science Foundation of China(No.61303082) the Research Fund for the Doctoral Program of Higher Education of China(No.20120121120046)
文摘Lexicalized reordering models are very important components of phrasebased translation systems.By examining the reordering relationships between adjacent phrases,conventional methods learn these models from the word aligned bilingual corpus,while ignoring the effect of the number of adjacent bilingual phrases.In this paper,we propose a method to take the number of adjacent phrases into account for better estimation of reordering models.Instead of just checking whether there is one phrase adjacent to a given phrase,our method firstly uses a compact structure named reordering graph to represent all phrase segmentations of a parallel sentence,then the effect of the adjacent phrase number can be quantified in a forward-backward fashion,and finally incorporated into the estimation of reordering models.Experimental results on the NIST Chinese-English and WMT French-Spanish data sets show that our approach significantly outperforms the baseline method.
文摘Companies like Google, MSN and Yahoo provide translation services on their websites, generating translations based on statistical bilingual text corpora. Human translation seems to be inferior in face of huge amount of information and fast development of computer science. Despite the functions and versatility of statistical machine translation, it may never take the place of human effort. Teachers are supposed to guide the students in using online translation system.
基金supported by National Social Science Fund of China(Youth Program):“A Study of Acceptability of Chinese Government Public Signs in the New Era and the Countermeasures of the English Translation”(No.:13CYY010)the Subject Construction and Management Project of Zhejiang Gongshang University:“Research on the Organic Integration Path of Constructing Ideological and Political Training and Design of Mixed Teaching Platform during Epidemic Period”(No.:XKJS2020007)Ministry of Education IndustryUniversity Cooperative Education Program:“Research on the Construction of Cross-border Logistics Marketing Bilingual Course Integration”(NO.:202102494002).
文摘Retelling extraction is an important branch of Natural Language Processing(NLP),and high-quality retelling resources are very helpful to improve the performance of machine translation.However,traditional methods based on the bilingual parallel corpus often ignore the document background in the process of retelling acquisition and application.In order to solve this problem,we introduce topic model information into the translation mode and propose a topic-based statistical machine translation method to improve the translation performance.In this method,Probabilistic Latent Semantic Analysis(PLSA)is used to obtains the co-occurrence relationship between words and documents by the hybrid matrix decomposition.Then we design a decoder to simplify the decoding process.Experiments show that the proposed method can effectively improve the accuracy of translation.
基金Supported by the National High Technology Research and Development 863 Program of China under Grant Nos. 2011AA01A207,2012AA011101, and 2012AA011102
文摘Unknown words are one of the key factors that greatly affect the translation quality. Traditionally, nearly all the related researches focus on obtaining the translation of the unknown words. However, these approaches have two disadvantages. On the one hand, they usually rely on many additional resources such as bilingual web data; on the other hand, they cannot guarantee good reordering and lexical selection of surrounding words. This paper gives a new perspective on handling unknown words in statistical machine translation (SMT). Instead of making great efforts to find the translation of unknown words, we focus on determining the semantic function of the unknown word in the test sentence and keeping the semantic function unchanged in the translation process. In this way, unknown words can help the phrase reordering and lexical selection of their surrounding words even though they still remain untranslated. In order to determine the semantic function of an unknown word, we employ the distributional semantic model and the bidirectional language model. Extensive experiments on both phrase-based and linguistically syntax-based SMT models in Chinese-to-English translation show that our method can substantially improve the translation quality.
基金Project supported by the National High-Tech R&D Program of China(No.2012BAH14F03)the National Natural Science Foundation of China(Nos.61005052 and 61303082)+2 种基金the Re-search Fund for the Doctoral Program of Higher Education of China(No.20120121120046)the Natural Science Foundation of Fujian Province of China(No.2011J01360)the Funda-mental Research Funds for the Central Universities,China(No.2010121068)
文摘The pivot language approach for statistical machine translation(SMT) is a good method to break the resource bottleneck for certain language pairs. However, in the implementation of conventional approaches, pivotside context information is far from fully utilized, resulting in erroneous estimations of translation probabilities. In this study, we propose two topic-aware pivot language approaches to use different levels of pivot-side context. The first method takes advantage of document-level context by assuming that the bridged phrase pairs should be similar in the document-level topic distributions. The second method focuses on the effect of local context. Central to this approach are that the phrase sense can be reflected by local context in the form of probabilistic topics, and that bridged phrase pairs should be compatible in the latent sense distributions. Then, we build an interpolated model bringing the above methods together to further enhance the system performance. Experimental results on French-Spanish and French-German translations using English as the pivot language demonstrate the effectiveness of topic-based context in pivot-based SMT.
基金the High Technology Research and Develop-ment Program of China (No.2004AA117010-08).
文摘Loose phrase extraction method is proposed and applied for phrase-based statistical ma- chine translation. The method extracts phrase pairs that are not strictly consistent with word align- ments. Two types of constraints on word positions are investigated for this method. Furthermore, n-best alignments are introduced for phrase extraction instead of the one-best. Experimental results show that the proposed approach outperforms the baseline system, Pharaoh system, for both one-best and n-best alignments.