Parallel corpus is of great importance to machine translation, and automatic sentence alignment is the first step towards its processing. This paper puts forward a bilingual dictionary based sentence alignment method ...Parallel corpus is of great importance to machine translation, and automatic sentence alignment is the first step towards its processing. This paper puts forward a bilingual dictionary based sentence alignment method for Chinese English parallel corpus, which differs from previous length based algorithm in its knowledge-rich approach. Experimental result shows that this method produces over 93% accuracy with usual English-Chinese dictionaries whose translations cover 31 88%~47 90% of the corpus.展开更多
Translation lexicons are fundamental to natural language processing tasks like machine translation and cross language information retrieval. This paper presents a lexicon builder that can auto extract (or assist lexic...Translation lexicons are fundamental to natural language processing tasks like machine translation and cross language information retrieval. This paper presents a lexicon builder that can auto extract (or assist lexicographer in compiling) the word translations from Chinese English parallel corpus. Key mechanisms in this builder system are further described, including co occurrence measure, indirection association resolution and multi word unit translation. Experiment results indicate the effectiveness of the authors’ method and the potentiality of the lexicon builder system.展开更多
On the basis of description of the necessity in construction of the Jiangxi red tourism resource E-C/C-E bilingual parallel corpus, this paper discusses the design and construction of the corpus. In its design, it des...On the basis of description of the necessity in construction of the Jiangxi red tourism resource E-C/C-E bilingual parallel corpus, this paper discusses the design and construction of the corpus. In its design, it describes the general design and the framework of the corpus, then it describes its construction including data collection, the standard for the sorted data, data selection, data digitalization, data tagging and data aligning. With the construction, it will not only realize purposes and functions of the corpus, but also provide others with ways or means to use the corpus and to establish such kind of corpus.展开更多
This paper discusses the construction of Jiangxi Tourism Resource E-C Parallel Courpus on the basis of description of the necessity in its construction. It first describes its framework, then its construction. In cons...This paper discusses the construction of Jiangxi Tourism Resource E-C Parallel Courpus on the basis of description of the necessity in its construction. It first describes its framework, then its construction. In construction, it includes data collection, the standard for the sorted data, data selection, data digitalization, data tagging and data aligning. With the introduction of its construction, it realizes its purposes and functions of the corpus, but also provides others with ways or means to use the corpus and to establish such kind of corpus.展开更多
The performance of a machine translation system heavily depends on the quantity and quality of the bilingual language resource. However,getting a parallel corpus,which has a large scale and is of high quality,is a ver...The performance of a machine translation system heavily depends on the quantity and quality of the bilingual language resource. However,getting a parallel corpus,which has a large scale and is of high quality,is a very difficult task especially for low resource languages such as Chinese-Vietnamese. Fortunately,multilingual user generated contents( UGC),such as bilingual movie subtitles,provide us access to automatic construction of the parallel corpus. Although the amount of UGC parallel corpora can be considerable,the original corpus is not suitable for statistical machine translation( SMT) systems. The corpus may contain translation errors,sentence mismatching,free translations,etc. To improve the quality of the bilingual corpus for SMT systems,three filtering methods are proposed: sentence length difference,the semantic of sentence pairs,and machine learning. Experiments are conducted on the Chinese to Vietnamese translation corpus.Experimental results demonstrate that all the three methods effectively improve the corpus quality,and the machine translation performance( BLEU score) can be improved by 1. 32.展开更多
Based on the Chinese-English parallel corpus of Yu Hua's novel To Live ( the first two sections), this paper focuses on analyzing Chinese reporting verb"shuo"and its translation through concordance, conc...Based on the Chinese-English parallel corpus of Yu Hua's novel To Live ( the first two sections), this paper focuses on analyzing Chinese reporting verb"shuo"and its translation through concordance, concordance plot, collocates, word list and other functions of AntConc. It aims to study translator Michael Berry's diction style and investigate how the source text has influenced the target text.展开更多
This paper describes the experiments with Korean-to-Vietnamese statistical machine translation(SMT). The fact that Korean is a morphologically complex language that does not have clear optimal word boundaries causes a...This paper describes the experiments with Korean-to-Vietnamese statistical machine translation(SMT). The fact that Korean is a morphologically complex language that does not have clear optimal word boundaries causes a major problem of translating into or from Korean. To solve this problem, we present a method to conduct a Korean morphological analysis by using a pre-analyzed partial word-phrase dictionary(PWD).Besides, we build a Korean-Vietnamese parallel corpus for training SMT models by collecting text from multilingual magazines. Then, we apply such a morphology analysis to Korean sentences that are included in the collected parallel corpus as a preprocessing step. The experiment results demonstrate a remarkable improvement of Korean-to-Vietnamese translation quality in term of bi-lingual evaluation understudy(BLEU).展开更多
Taxonomy denotes the hierarchical structure of a knowledge organization system. It has important applications in knowledge navigation, semantic annotation and semantic search. It is a useful instrument to study the mu...Taxonomy denotes the hierarchical structure of a knowledge organization system. It has important applications in knowledge navigation, semantic annotation and semantic search. It is a useful instrument to study the multilingual taxonomy generated automatically under the dynamic information environment in which massive amounts of information are processed and found. Multilingual taxonomy is the core component of the multilingual thesaurus or ontology. This paper presents two methods of bilingual generated taxonomy: Cross-language terminology clustering and mixed-language based terminology clustering. According to our experimental results of terminology clustering related to four specific subject domains, we found that if the parallel corpus is used to cluster multilingual terminologies, the method of using mixed-language based terminology clustering outperforms that of using the cross-language terminology clustering.展开更多
One of the characteristics of good writing is appropriate use of cohesive devices, which many learners often find difficult to do. Although there is a substantial body of research on cohesion in writing, little has be...One of the characteristics of good writing is appropriate use of cohesive devices, which many learners often find difficult to do. Although there is a substantial body of research on cohesion in writing, little has been documented on how to teach it to EFL students. The present study was an attempt to address this under-researched issue by taking a terminological approach, as the diverse and vague terminology used to describe link words has been found to be more of a hindrance than a facilitator in the teaching/learning process. A total of 16 cohesion-related terms were surveyed to find out about the quality and quantity of their use in a self-built corpus of 14 grammar books of different levels. Furthermore, a test was run to measure students' familiarity with these terms, and field-notes were taken while scoring the students' papers in their presence. The findings, in addition to giving insights into the usage points regarding each of the 16 terms, led to drawing a terminological network, which, if used consistently, could help in teaching and learning to use cohesion effectively.展开更多
文摘Parallel corpus is of great importance to machine translation, and automatic sentence alignment is the first step towards its processing. This paper puts forward a bilingual dictionary based sentence alignment method for Chinese English parallel corpus, which differs from previous length based algorithm in its knowledge-rich approach. Experimental result shows that this method produces over 93% accuracy with usual English-Chinese dictionaries whose translations cover 31 88%~47 90% of the corpus.
文摘Translation lexicons are fundamental to natural language processing tasks like machine translation and cross language information retrieval. This paper presents a lexicon builder that can auto extract (or assist lexicographer in compiling) the word translations from Chinese English parallel corpus. Key mechanisms in this builder system are further described, including co occurrence measure, indirection association resolution and multi word unit translation. Experiment results indicate the effectiveness of the authors’ method and the potentiality of the lexicon builder system.
文摘On the basis of description of the necessity in construction of the Jiangxi red tourism resource E-C/C-E bilingual parallel corpus, this paper discusses the design and construction of the corpus. In its design, it describes the general design and the framework of the corpus, then it describes its construction including data collection, the standard for the sorted data, data selection, data digitalization, data tagging and data aligning. With the construction, it will not only realize purposes and functions of the corpus, but also provide others with ways or means to use the corpus and to establish such kind of corpus.
文摘This paper discusses the construction of Jiangxi Tourism Resource E-C Parallel Courpus on the basis of description of the necessity in its construction. It first describes its framework, then its construction. In construction, it includes data collection, the standard for the sorted data, data selection, data digitalization, data tagging and data aligning. With the introduction of its construction, it realizes its purposes and functions of the corpus, but also provides others with ways or means to use the corpus and to establish such kind of corpus.
基金Supported by the National Basic Research Program of China(973Program)(2013CB329303)the National Natural Science Foundation of China(61502035)
文摘The performance of a machine translation system heavily depends on the quantity and quality of the bilingual language resource. However,getting a parallel corpus,which has a large scale and is of high quality,is a very difficult task especially for low resource languages such as Chinese-Vietnamese. Fortunately,multilingual user generated contents( UGC),such as bilingual movie subtitles,provide us access to automatic construction of the parallel corpus. Although the amount of UGC parallel corpora can be considerable,the original corpus is not suitable for statistical machine translation( SMT) systems. The corpus may contain translation errors,sentence mismatching,free translations,etc. To improve the quality of the bilingual corpus for SMT systems,three filtering methods are proposed: sentence length difference,the semantic of sentence pairs,and machine learning. Experiments are conducted on the Chinese to Vietnamese translation corpus.Experimental results demonstrate that all the three methods effectively improve the corpus quality,and the machine translation performance( BLEU score) can be improved by 1. 32.
文摘Based on the Chinese-English parallel corpus of Yu Hua's novel To Live ( the first two sections), this paper focuses on analyzing Chinese reporting verb"shuo"and its translation through concordance, concordance plot, collocates, word list and other functions of AntConc. It aims to study translator Michael Berry's diction style and investigate how the source text has influenced the target text.
基金supported by the Institute for Information&communications Technology Promotion under Grant No.R0101-16-0176the Project of Core Technology Development for Human-Like Self-Taught Learning Based on Symbolic Approach
文摘This paper describes the experiments with Korean-to-Vietnamese statistical machine translation(SMT). The fact that Korean is a morphologically complex language that does not have clear optimal word boundaries causes a major problem of translating into or from Korean. To solve this problem, we present a method to conduct a Korean morphological analysis by using a pre-analyzed partial word-phrase dictionary(PWD).Besides, we build a Korean-Vietnamese parallel corpus for training SMT models by collecting text from multilingual magazines. Then, we apply such a morphology analysis to Korean sentences that are included in the collected parallel corpus as a preprocessing step. The experiment results demonstrate a remarkable improvement of Korean-to-Vietnamese translation quality in term of bi-lingual evaluation understudy(BLEU).
基金supported by the National Natural Science Foundation of China(Grant No.:70903032)the Foundation for Humanities and Social Science of the Chinese Ministry of Education(Grant No.:08JC870007)
文摘Taxonomy denotes the hierarchical structure of a knowledge organization system. It has important applications in knowledge navigation, semantic annotation and semantic search. It is a useful instrument to study the multilingual taxonomy generated automatically under the dynamic information environment in which massive amounts of information are processed and found. Multilingual taxonomy is the core component of the multilingual thesaurus or ontology. This paper presents two methods of bilingual generated taxonomy: Cross-language terminology clustering and mixed-language based terminology clustering. According to our experimental results of terminology clustering related to four specific subject domains, we found that if the parallel corpus is used to cluster multilingual terminologies, the method of using mixed-language based terminology clustering outperforms that of using the cross-language terminology clustering.
文摘One of the characteristics of good writing is appropriate use of cohesive devices, which many learners often find difficult to do. Although there is a substantial body of research on cohesion in writing, little has been documented on how to teach it to EFL students. The present study was an attempt to address this under-researched issue by taking a terminological approach, as the diverse and vague terminology used to describe link words has been found to be more of a hindrance than a facilitator in the teaching/learning process. A total of 16 cohesion-related terms were surveyed to find out about the quality and quantity of their use in a self-built corpus of 14 grammar books of different levels. Furthermore, a test was run to measure students' familiarity with these terms, and field-notes were taken while scoring the students' papers in their presence. The findings, in addition to giving insights into the usage points regarding each of the 16 terms, led to drawing a terminological network, which, if used consistently, could help in teaching and learning to use cohesion effectively.