We introduce a novel Sermntic-Category- Tree (SCT) model to present the sen-antic structure of a sentence for Chinese-English Machine Translation (MT). We use the SCT model to handle the reordering in a hierarchic...We introduce a novel Sermntic-Category- Tree (SCT) model to present the sen-antic structure of a sentence for Chinese-English Machine Translation (MT). We use the SCT model to handle the reordering in a hierarchical structure in which one reordering is dependent on the others. Different from other reordering approaches, we handle the reordering at three levels: sentence level, chunk level, and word level. The chunk-level reordering is dependent on the sentence-level reordering, and the word-level reordering is dependent on the chunk-level reordering. In this paper, we formally describe the SCT model and discuss the translation strategy based on the SCT model. Further, we present an algorithm for analyzing the source language in SCT and transforming the source SCT into the target SCT. We apply the SCT model to a role-based patent text MT to evaluate the ability of the SCT model. The experimental results show that SCT is efficient in handling the hierarehical reordering operation in MT.展开更多
Translation lexicons are fundamental to natural language processing tasks like machine translation and cross language information retrieval. This paper presents a lexicon builder that can auto extract (or assist lexic...Translation lexicons are fundamental to natural language processing tasks like machine translation and cross language information retrieval. This paper presents a lexicon builder that can auto extract (or assist lexicographer in compiling) the word translations from Chinese English parallel corpus. Key mechanisms in this builder system are further described, including co occurrence measure, indirection association resolution and multi word unit translation. Experiment results indicate the effectiveness of the authors’ method and the potentiality of the lexicon builder system.展开更多
Chunk alignment for the bilingual corpus is the base of Example-based Machine Translation. An anchor-based English-Chinese bilingual chunk alignment model and the corresponding algorithm of alignment are presented in ...Chunk alignment for the bilingual corpus is the base of Example-based Machine Translation. An anchor-based English-Chinese bilingual chunk alignment model and the corresponding algorithm of alignment are presented in this paper. It can effectively overcome the sparse data problem due to the limited size of the bilingual corpus. In this model, the chunk segmentation disarnbiguation is delayed to the alignment process, and hence the accuracy of chunk segmentation is improved. The experimental results demonstrate the feasibility and viability of this model.展开更多
A hybrid approach to English Part-of-Speech(PoS) tagging with its target application being English-Chinese machine translation in business domain is presented,demonstrating how a present tagger can be adapted to learn...A hybrid approach to English Part-of-Speech(PoS) tagging with its target application being English-Chinese machine translation in business domain is presented,demonstrating how a present tagger can be adapted to learn from a small amount of data and handle unknown words for the purpose of machine translation.A small size of 998 k English annotated corpus in business domain is built semi-automatically based on a new tagset;the maximum entropy model is adopted,and rule-based approach is used in post-processing.The tagger is further applied in Noun Phrase(NP) chunking.Experiments show that our tagger achieves an accuracy of 98.14%,which is a quite satisfactory result.In the application to NP chunking,the tagger gives rise to 2.21% increase in F-score,compared with the results using Stanford tagger.展开更多
Text alignment is crucial to the accuracy of MT (Machine Translation) systems, some NLP (Natural Language Processing) tools or any other text processing tasks requiring bilingual data. This research proposes a lan...Text alignment is crucial to the accuracy of MT (Machine Translation) systems, some NLP (Natural Language Processing) tools or any other text processing tasks requiring bilingual data. This research proposes a language independent sentence alignment approach based on Polish (not position-sensitive language) to English experiments. This alignment approach was developed on the TED (Translanguage English Database) talks corpus, but can be used for any text domain or language pair. The proposed approach implements various heuristics for sentence recognition. Some of them value synonyms and semantic text structure analysis as a part of additional information. Minimization of data loss was ensured. The solution is compared to other sentence alignment implementations. Also an improvement in MT system score with text processed with the described tool is shown.展开更多
A lack of labeled corpora obstructs the research progress on implicit discourse relation recognition (DRR) for Chinese, while there are some available discourse corpora in other languages, such as English. In this p...A lack of labeled corpora obstructs the research progress on implicit discourse relation recognition (DRR) for Chinese, while there are some available discourse corpora in other languages, such as English. In this paper, we propose a cross-lingual implicit DRR framework that exploits an available English corpus for the Chinese DRR task. We use machine translation to generate Chinese instances from a labeled English discourse corpus. In this way, each instance has two independent views: Chinese and English views. Then we train two classifiers in Chinese and English in a co-training way, which exploits unlabeled Chinese data to implement better implicit DRR for Chinese. Experimental results demonstrate the effectiveness of our method.展开更多
基金supported by the National High Technology Research and Development Program of China under Grant No.2012AA011104the Fundamental Research Funds for the Center Universities
文摘We introduce a novel Sermntic-Category- Tree (SCT) model to present the sen-antic structure of a sentence for Chinese-English Machine Translation (MT). We use the SCT model to handle the reordering in a hierarchical structure in which one reordering is dependent on the others. Different from other reordering approaches, we handle the reordering at three levels: sentence level, chunk level, and word level. The chunk-level reordering is dependent on the sentence-level reordering, and the word-level reordering is dependent on the chunk-level reordering. In this paper, we formally describe the SCT model and discuss the translation strategy based on the SCT model. Further, we present an algorithm for analyzing the source language in SCT and transforming the source SCT into the target SCT. We apply the SCT model to a role-based patent text MT to evaluate the ability of the SCT model. The experimental results show that SCT is efficient in handling the hierarehical reordering operation in MT.
文摘Translation lexicons are fundamental to natural language processing tasks like machine translation and cross language information retrieval. This paper presents a lexicon builder that can auto extract (or assist lexicographer in compiling) the word translations from Chinese English parallel corpus. Key mechanisms in this builder system are further described, including co occurrence measure, indirection association resolution and multi word unit translation. Experiment results indicate the effectiveness of the authors’ method and the potentiality of the lexicon builder system.
文摘Chunk alignment for the bilingual corpus is the base of Example-based Machine Translation. An anchor-based English-Chinese bilingual chunk alignment model and the corresponding algorithm of alignment are presented in this paper. It can effectively overcome the sparse data problem due to the limited size of the bilingual corpus. In this model, the chunk segmentation disarnbiguation is delayed to the alignment process, and hence the accuracy of chunk segmentation is improved. The experimental results demonstrate the feasibility and viability of this model.
基金supported by the National Natural Science Foundation of China under Grant No.61173100the Fundamental Research Funds for the Central Universities under Grant No.GDUT10RW202
文摘A hybrid approach to English Part-of-Speech(PoS) tagging with its target application being English-Chinese machine translation in business domain is presented,demonstrating how a present tagger can be adapted to learn from a small amount of data and handle unknown words for the purpose of machine translation.A small size of 998 k English annotated corpus in business domain is built semi-automatically based on a new tagset;the maximum entropy model is adopted,and rule-based approach is used in post-processing.The tagger is further applied in Noun Phrase(NP) chunking.Experiments show that our tagger achieves an accuracy of 98.14%,which is a quite satisfactory result.In the application to NP chunking,the tagger gives rise to 2.21% increase in F-score,compared with the results using Stanford tagger.
文摘Text alignment is crucial to the accuracy of MT (Machine Translation) systems, some NLP (Natural Language Processing) tools or any other text processing tasks requiring bilingual data. This research proposes a language independent sentence alignment approach based on Polish (not position-sensitive language) to English experiments. This alignment approach was developed on the TED (Translanguage English Database) talks corpus, but can be used for any text domain or language pair. The proposed approach implements various heuristics for sentence recognition. Some of them value synonyms and semantic text structure analysis as a part of additional information. Minimization of data loss was ensured. The solution is compared to other sentence alignment implementations. Also an improvement in MT system score with text processed with the described tool is shown.
基金Project supported by the National Natural Science Foundation of China(No.61672440)the Natural Science Foundation of Fujian Province,China(No.2016J05161)+2 种基金the Research Fund of the State Key Laboratory for Novel Software Technology in Nanjing University,China(No.KFKT2015B11)the Scientific Research Project of the National Language Committee of China(No.YB135-49)the Fundamental Research Funds for the Central Universities,China(No.ZK1024)
文摘A lack of labeled corpora obstructs the research progress on implicit discourse relation recognition (DRR) for Chinese, while there are some available discourse corpora in other languages, such as English. In this paper, we propose a cross-lingual implicit DRR framework that exploits an available English corpus for the Chinese DRR task. We use machine translation to generate Chinese instances from a labeled English discourse corpus. In this way, each instance has two independent views: Chinese and English views. Then we train two classifiers in Chinese and English in a co-training way, which exploits unlabeled Chinese data to implement better implicit DRR for Chinese. Experimental results demonstrate the effectiveness of our method.