This paper proposed a method to incorporate syntax-based language models in phrase-based statistical machine translation (SMT) systems. The syntax-based language model used in this paper is based on link grammar,which...This paper proposed a method to incorporate syntax-based language models in phrase-based statistical machine translation (SMT) systems. The syntax-based language model used in this paper is based on link grammar,which is a high lexical formalism. In order to apply language models based on link grammar in phrase-based models,the concept of linked phrases,an extension of the concept of traditional phrases in phrase-based models was brought out. Experiments were conducted and the results showed that the use of syntax-based language models could improve the performance of the phrase-based models greatly.展开更多
This paper describes the experiments with Korean-to-Vietnamese statistical machine translation(SMT). The fact that Korean is a morphologically complex language that does not have clear optimal word boundaries causes a...This paper describes the experiments with Korean-to-Vietnamese statistical machine translation(SMT). The fact that Korean is a morphologically complex language that does not have clear optimal word boundaries causes a major problem of translating into or from Korean. To solve this problem, we present a method to conduct a Korean morphological analysis by using a pre-analyzed partial word-phrase dictionary(PWD).Besides, we build a Korean-Vietnamese parallel corpus for training SMT models by collecting text from multilingual magazines. Then, we apply such a morphology analysis to Korean sentences that are included in the collected parallel corpus as a preprocessing step. The experiment results demonstrate a remarkable improvement of Korean-to-Vietnamese translation quality in term of bi-lingual evaluation understudy(BLEU).展开更多
Lexicalized reordering models are very important components of phrasebased translation systems.By examining the reordering relationships between adjacent phrases,conventional methods learn these models from the word a...Lexicalized reordering models are very important components of phrasebased translation systems.By examining the reordering relationships between adjacent phrases,conventional methods learn these models from the word aligned bilingual corpus,while ignoring the effect of the number of adjacent bilingual phrases.In this paper,we propose a method to take the number of adjacent phrases into account for better estimation of reordering models.Instead of just checking whether there is one phrase adjacent to a given phrase,our method firstly uses a compact structure named reordering graph to represent all phrase segmentations of a parallel sentence,then the effect of the adjacent phrase number can be quantified in a forward-backward fashion,and finally incorporated into the estimation of reordering models.Experimental results on the NIST Chinese-English and WMT French-Spanish data sets show that our approach significantly outperforms the baseline method.展开更多
A novel model based on structure alignments is proposed for statistical machine translation in thispaper.Meta-stnlcture and sequence of meta-structure for a parse tree are defined.During the translationprocess,a parse...A novel model based on structure alignments is proposed for statistical machine translation in thispaper.Meta-stnlcture and sequence of meta-structure for a parse tree are defined.During the translationprocess,a parse tree is decomposed to deal with the structure divergence and the alignments can be con-stmcted at different levels of recombination of meta-structure(RM).This method can perform the struc-ture mapping across the sub-tree structure between languages.As a result,we get not only the translationfor the target language,but sequence of meta-structure of its parse tree at the same time.Experimentsshow that the model in the framework of log-linear model has better generative ability and significantlyoutperforms Pharaoh,a phrase-based system.展开更多
Retelling extraction is an important branch of Natural Language Processing(NLP),and high-quality retelling resources are very helpful to improve the performance of machine translation.However,traditional methods based...Retelling extraction is an important branch of Natural Language Processing(NLP),and high-quality retelling resources are very helpful to improve the performance of machine translation.However,traditional methods based on the bilingual parallel corpus often ignore the document background in the process of retelling acquisition and application.In order to solve this problem,we introduce topic model information into the translation mode and propose a topic-based statistical machine translation method to improve the translation performance.In this method,Probabilistic Latent Semantic Analysis(PLSA)is used to obtains the co-occurrence relationship between words and documents by the hybrid matrix decomposition.Then we design a decoder to simplify the decoding process.Experiments show that the proposed method can effectively improve the accuracy of translation.展开更多
Unknown words are one of the key factors that greatly affect the translation quality. Traditionally, nearly all the related researches focus on obtaining the translation of the unknown words. However, these approaches...Unknown words are one of the key factors that greatly affect the translation quality. Traditionally, nearly all the related researches focus on obtaining the translation of the unknown words. However, these approaches have two disadvantages. On the one hand, they usually rely on many additional resources such as bilingual web data; on the other hand, they cannot guarantee good reordering and lexical selection of surrounding words. This paper gives a new perspective on handling unknown words in statistical machine translation (SMT). Instead of making great efforts to find the translation of unknown words, we focus on determining the semantic function of the unknown word in the test sentence and keeping the semantic function unchanged in the translation process. In this way, unknown words can help the phrase reordering and lexical selection of their surrounding words even though they still remain untranslated. In order to determine the semantic function of an unknown word, we employ the distributional semantic model and the bidirectional language model. Extensive experiments on both phrase-based and linguistically syntax-based SMT models in Chinese-to-English translation show that our method can substantially improve the translation quality.展开更多
Annotation in translation is of great value in communicating"the local"to the global readership.Based on our content and function-centered statistics on the 483 notes of the four English versions of Shen Fu...Annotation in translation is of great value in communicating"the local"to the global readership.Based on our content and function-centered statistics on the 483 notes of the four English versions of Shen Fu’s autobiographical work Fushengliuji,we find that 1)in terms of content,cultural,geographic,historical,and literary references are the most important categories of annotation in the English translation of this work;annotations in the four versions are employed to serve 6 major functions/purposes:to further inform,to facilitate understanding,to avoid misunderstanding,to interpret personally,to cite or allude,and to correct mistakes;2)no correlation can be established between the use of annotation and the reception of the work per se,but it can reflect the translator’s poise and strategy which ultimately affect the reception of the work;and 3)Lin’s version used relatively few notes and relied heavily on paraphrasing,a practice which leads to better accessibility of his translation and at the same time to the possible sacrifice of some culturally and socially significant elements of the original.Black’s translation used notes sparingly,and she was so creative as to rearrange and edit the original text,revealing her approach of radical"reader-centeredness".Pratt and Chiang’s version and Sanders’version used a large number of notes carrying a sinological mission,revealing their respect for the original and their decision to inform and inspire their readers.We argue that cultural translation,whether aided by annotation or not,is predominantly an art about"glocalism"and that both author-centeredness and reader-centeredness can be reconciled,since ultimately they serve the same"communicative"purpose.展开更多
The pivot language approach for statistical machine translation(SMT) is a good method to break the resource bottleneck for certain language pairs. However, in the implementation of conventional approaches, pivotside c...The pivot language approach for statistical machine translation(SMT) is a good method to break the resource bottleneck for certain language pairs. However, in the implementation of conventional approaches, pivotside context information is far from fully utilized, resulting in erroneous estimations of translation probabilities. In this study, we propose two topic-aware pivot language approaches to use different levels of pivot-side context. The first method takes advantage of document-level context by assuming that the bridged phrase pairs should be similar in the document-level topic distributions. The second method focuses on the effect of local context. Central to this approach are that the phrase sense can be reflected by local context in the form of probabilistic topics, and that bridged phrase pairs should be compatible in the latent sense distributions. Then, we build an interpolated model bringing the above methods together to further enhance the system performance. Experimental results on French-Spanish and French-German translations using English as the pivot language demonstrate the effectiveness of topic-based context in pivot-based SMT.展开更多
基金National Natural Science Foundation of China ( No.60803078)National High Technology Research and Development Programs of China (No.2006AA010107, No.2006AA010108)
文摘This paper proposed a method to incorporate syntax-based language models in phrase-based statistical machine translation (SMT) systems. The syntax-based language model used in this paper is based on link grammar,which is a high lexical formalism. In order to apply language models based on link grammar in phrase-based models,the concept of linked phrases,an extension of the concept of traditional phrases in phrase-based models was brought out. Experiments were conducted and the results showed that the use of syntax-based language models could improve the performance of the phrase-based models greatly.
基金supported by the Institute for Information&communications Technology Promotion under Grant No.R0101-16-0176the Project of Core Technology Development for Human-Like Self-Taught Learning Based on Symbolic Approach
文摘This paper describes the experiments with Korean-to-Vietnamese statistical machine translation(SMT). The fact that Korean is a morphologically complex language that does not have clear optimal word boundaries causes a major problem of translating into or from Korean. To solve this problem, we present a method to conduct a Korean morphological analysis by using a pre-analyzed partial word-phrase dictionary(PWD).Besides, we build a Korean-Vietnamese parallel corpus for training SMT models by collecting text from multilingual magazines. Then, we apply such a morphology analysis to Korean sentences that are included in the collected parallel corpus as a preprocessing step. The experiment results demonstrate a remarkable improvement of Korean-to-Vietnamese translation quality in term of bi-lingual evaluation understudy(BLEU).
基金supported by the National Natural Science Foundation of China(No.61303082) the Research Fund for the Doctoral Program of Higher Education of China(No.20120121120046)
文摘Lexicalized reordering models are very important components of phrasebased translation systems.By examining the reordering relationships between adjacent phrases,conventional methods learn these models from the word aligned bilingual corpus,while ignoring the effect of the number of adjacent bilingual phrases.In this paper,we propose a method to take the number of adjacent phrases into account for better estimation of reordering models.Instead of just checking whether there is one phrase adjacent to a given phrase,our method firstly uses a compact structure named reordering graph to represent all phrase segmentations of a parallel sentence,then the effect of the adjacent phrase number can be quantified in a forward-backward fashion,and finally incorporated into the estimation of reordering models.Experimental results on the NIST Chinese-English and WMT French-Spanish data sets show that our approach significantly outperforms the baseline method.
基金the National High Technology Research and Development Progran of China(No.200606010108.2006AA01Z150)
文摘A novel model based on structure alignments is proposed for statistical machine translation in thispaper.Meta-stnlcture and sequence of meta-structure for a parse tree are defined.During the translationprocess,a parse tree is decomposed to deal with the structure divergence and the alignments can be con-stmcted at different levels of recombination of meta-structure(RM).This method can perform the struc-ture mapping across the sub-tree structure between languages.As a result,we get not only the translationfor the target language,but sequence of meta-structure of its parse tree at the same time.Experimentsshow that the model in the framework of log-linear model has better generative ability and significantlyoutperforms Pharaoh,a phrase-based system.
基金supported by National Social Science Fund of China(Youth Program):“A Study of Acceptability of Chinese Government Public Signs in the New Era and the Countermeasures of the English Translation”(No.:13CYY010)the Subject Construction and Management Project of Zhejiang Gongshang University:“Research on the Organic Integration Path of Constructing Ideological and Political Training and Design of Mixed Teaching Platform during Epidemic Period”(No.:XKJS2020007)Ministry of Education IndustryUniversity Cooperative Education Program:“Research on the Construction of Cross-border Logistics Marketing Bilingual Course Integration”(NO.:202102494002).
文摘Retelling extraction is an important branch of Natural Language Processing(NLP),and high-quality retelling resources are very helpful to improve the performance of machine translation.However,traditional methods based on the bilingual parallel corpus often ignore the document background in the process of retelling acquisition and application.In order to solve this problem,we introduce topic model information into the translation mode and propose a topic-based statistical machine translation method to improve the translation performance.In this method,Probabilistic Latent Semantic Analysis(PLSA)is used to obtains the co-occurrence relationship between words and documents by the hybrid matrix decomposition.Then we design a decoder to simplify the decoding process.Experiments show that the proposed method can effectively improve the accuracy of translation.
基金Supported by the National High Technology Research and Development 863 Program of China under Grant Nos. 2011AA01A207,2012AA011101, and 2012AA011102
文摘Unknown words are one of the key factors that greatly affect the translation quality. Traditionally, nearly all the related researches focus on obtaining the translation of the unknown words. However, these approaches have two disadvantages. On the one hand, they usually rely on many additional resources such as bilingual web data; on the other hand, they cannot guarantee good reordering and lexical selection of surrounding words. This paper gives a new perspective on handling unknown words in statistical machine translation (SMT). Instead of making great efforts to find the translation of unknown words, we focus on determining the semantic function of the unknown word in the test sentence and keeping the semantic function unchanged in the translation process. In this way, unknown words can help the phrase reordering and lexical selection of their surrounding words even though they still remain untranslated. In order to determine the semantic function of an unknown word, we employ the distributional semantic model and the bidirectional language model. Extensive experiments on both phrase-based and linguistically syntax-based SMT models in Chinese-to-English translation show that our method can substantially improve the translation quality.
基金sponsored by the“Overseas Reception of Suzhou Local Culture”fund(No.2019SJA1330)the Jiangsu Social Sciences and Humanities Fund(No.18WWD005)。
文摘Annotation in translation is of great value in communicating"the local"to the global readership.Based on our content and function-centered statistics on the 483 notes of the four English versions of Shen Fu’s autobiographical work Fushengliuji,we find that 1)in terms of content,cultural,geographic,historical,and literary references are the most important categories of annotation in the English translation of this work;annotations in the four versions are employed to serve 6 major functions/purposes:to further inform,to facilitate understanding,to avoid misunderstanding,to interpret personally,to cite or allude,and to correct mistakes;2)no correlation can be established between the use of annotation and the reception of the work per se,but it can reflect the translator’s poise and strategy which ultimately affect the reception of the work;and 3)Lin’s version used relatively few notes and relied heavily on paraphrasing,a practice which leads to better accessibility of his translation and at the same time to the possible sacrifice of some culturally and socially significant elements of the original.Black’s translation used notes sparingly,and she was so creative as to rearrange and edit the original text,revealing her approach of radical"reader-centeredness".Pratt and Chiang’s version and Sanders’version used a large number of notes carrying a sinological mission,revealing their respect for the original and their decision to inform and inspire their readers.We argue that cultural translation,whether aided by annotation or not,is predominantly an art about"glocalism"and that both author-centeredness and reader-centeredness can be reconciled,since ultimately they serve the same"communicative"purpose.
基金Project supported by the National High-Tech R&D Program of China(No.2012BAH14F03)the National Natural Science Foundation of China(Nos.61005052 and 61303082)+2 种基金the Re-search Fund for the Doctoral Program of Higher Education of China(No.20120121120046)the Natural Science Foundation of Fujian Province of China(No.2011J01360)the Funda-mental Research Funds for the Central Universities,China(No.2010121068)
文摘The pivot language approach for statistical machine translation(SMT) is a good method to break the resource bottleneck for certain language pairs. However, in the implementation of conventional approaches, pivotside context information is far from fully utilized, resulting in erroneous estimations of translation probabilities. In this study, we propose two topic-aware pivot language approaches to use different levels of pivot-side context. The first method takes advantage of document-level context by assuming that the bridged phrase pairs should be similar in the document-level topic distributions. The second method focuses on the effect of local context. Central to this approach are that the phrase sense can be reflected by local context in the form of probabilistic topics, and that bridged phrase pairs should be compatible in the latent sense distributions. Then, we build an interpolated model bringing the above methods together to further enhance the system performance. Experimental results on French-Spanish and French-German translations using English as the pivot language demonstrate the effectiveness of topic-based context in pivot-based SMT.