Alignment of the Polish-English Parallel Text for a Statistical Machine "Translation
Alignment of the Polish-English Parallel Text for a Statistical Machine "Translation
摘要
Text alignment is crucial to the accuracy of MT (Machine Translation) systems, some NLP (Natural Language Processing) tools or any other text processing tasks requiring bilingual data. This research proposes a language independent sentence alignment approach based on Polish (not position-sensitive language) to English experiments. This alignment approach was developed on the TED (Translanguage English Database) talks corpus, but can be used for any text domain or language pair. The proposed approach implements various heuristics for sentence recognition. Some of them value synonyms and semantic text structure analysis as a part of additional information. Minimization of data loss was ensured. The solution is compared to other sentence alignment implementations. Also an improvement in MT system score with text processed with the described tool is shown.
参考文献31
-
1S.J. Russell, P. Norvig, Artificial Intelligence: A Modern Approach, 3rd ed., Prentice Hall, 2010, pp. 907-910.
-
2Y. Deng, S. Kumar, W. Byrne, Segmentation and alignment of parallel text for statistical machine translation, Natural Language Engineering 12 (4) (2006) 1-26.
-
3S. Karimi, F. Scholer, A. Turpin, Machine transliteration survey, ACM Computing Surveys 43 (3) (2011) 6-46.
-
4F. Braune, A. Fraser, Improved Unsupervised Sentence Alignment for Symmetrical and Asymmetrical Parallel Corpora, in: Coling 2010: Poster Volume, 2010, pp. 81-89.
-
5K. Marasek, TED Polish-to-English translation system for the IWSLT 2012, in: Proc. of International Workshop on Spoken Language Translation (IWSLT) 2010, Hong Kong 2012.
-
6M. Cettolo, C. Girardi, M. Federico, Wit 3: Web inventory of transcribed and translated talks, in: Proc. of 16th Conference of the European Association for Machine Translation (EAMT), Trento, Italy, 2012. pp. 261-268.
-
7A. Santos, A survey on parallel corpora alignment, in: MI-STAR, 2011, pp. 117-128.
-
8P.F. Brown, J.C. Lai, R.L. Mercer, Aligning sentences in parallel corpora, in: Proc. of 29th Annual Meeting of the ACL, Berkeley, 1991, pp. 169-176.
-
9W.A. Gale, K.W. Church, Identifying word correspondences in parallel texts, in: Proc. of DARPA Workshop on Speech and Nual Language, 1991, pp. 152-157,.
-
10D. Varga, P. Halacsy, A. Kornai, V. Nagy, L. Nemeth, et al., Parallel corpora for medium density languages, in: Proc. of the RANLP 2005, Borovets, Bulgaria, 2005, pp. 590-596.
-
1平淡.对齐更简单 玩转Word文本对齐[J].电脑爱好者,2017,0(4):46-47.
-
2方媛,车启凤.基于“文本对齐”进行短文本过滤算法的研究[J].信息通信,2013,26(9):22-23.
-
3宁慧,王素红,王明星,徐丽.基于图论的片段合并方法研究[J].应用科技,2016,43(1):40-45. 被引量:2
-
4刘长德.还你一个真正干净的页眉[J].电脑爱好者,2012(22):30-30.
-
5汉-维平行语料库的构建及应用[J].新疆大学学报(哲学社会科学版)(维吾尔文),2016,0(1):8-12.
-
6张霄军,陈小荷.双语平行语料的预处理[J].外语教育,2007(1):145-149. 被引量:1
-
7木又木人.巧用Excel表格制作席卡[J].电脑迷,2008,0(17):73-73.
-
8Office加油站[J].电脑迷,2007,0(6):84-84.
-
9大虾人.实用软件技巧荟萃[J].计算机与网络,2005,31(21):20-20.
-
10新产品&工具[J].程序员,2009(6):126-127.