期刊文献+

Web环境下自动获取汉、维语料库 被引量:1

AUTOMATIC ACQUIRING CHINESE AND UIGHUR CORPUS LIBRARY IN WEB ENVIRONMENT
下载PDF
导出
摘要 句子级的语料库是机器翻译的重要资源,但由于获取途径的限制,句子级的语料库不仅数量有限而且经常集中在特定领域,很难适应真实应用的需求。根据锚文本信息通过搜索引擎在网络上找到汉维双语平行网站,并下载网站中所有的双语平行网页。提取出有正文的网页,根据html特征,建立html树,提出一种将html树结构作为识别网页正文内容重要特征的网页分析方法,并根据正文内容信息相似性提取网页正文。对提取出的正文进行句子切分,分别创建句子级的汉、维语料库,为以后创建句子级的汉维双语平行语料库服务。 Sentence level corpus library is an important resource for machine translation.However,since there are limited ways to acquire it,there is not enough sentence level corpus library.Moreover it is often focused to a few specific fields so that it is hard to meet real application demands.In the thesis,according to anchor text information,the network is searched with search engines to find Chinese-Uighur bilingual parallel websites,then to download all bilingual parallel webpages from them.After extracting pages that contain main body,according to HTML features,an HTML tree is built.A webpage analysis method is proposed that regards HTML tree structure as an important feature to identify webpage main body contents.In addition,on the basis of main body content information similarity,webpage main body is extracted.The extracted main body is then segmented into sentences in order to create sentence level Chinese and Uighur corpus library to serve for future creation of sentence level Chinese-Uighur bilingual corpus library.
出处 《计算机应用与软件》 CSCD 2011年第12期19-21,70,共4页 Computer Applications and Software
基金 国家自然科学基金资助项目(60963017) 国家社科基金资助项目(10BTQ045) 新疆自治区高校科研计划重点项目(XJEDU2009I05)
关键词 双语平行语料库 双语平行句对 正文提取 Bilingual parallel corpus library Bilingual parallel sentence pair Text extraction
  • 相关文献

参考文献13

  • 1孙乐,金友兵,杜林,孙玉芳.平行语料库中双语术语词典的自动抽取[J].中文信息学报,2000,14(6):33-39. 被引量:30
  • 2Brown P F, Lai J, CandMercer R L. Aligni sentences in parallel corpora [ C]//Proceedings of the 829th annual meeting of the association for computational linguistics, Berkeley, CA, 1991.
  • 3Chen S F. Aligning sentences in bilingual corpora using lexical informa- tion[ C ]//proceedings of the 31 st annual meeting of the association for computational linguistics, Columbus, OH, 1993.
  • 4Collier N, OnoK, Hirakawa H. An experimentin hybrid dictionary and statistical sentence [ C ]//proceedings of the 36th annual meeting of the association for Computational linguistics and the 17th intemational con- ference on computational linguistics, Montreal, Canada, 1998.
  • 5Gale W A, Church K W. A program for aligning sentences in bilingual corpora[ C ]//Proceedings of the 29th annual meeting of the associa- tion for computational linguistics, Berkeley, CA, 1991.
  • 6Collier N, Hirakawa, Kumano. A machine translation vs. Dictionary term translation a comparison for english-Japanese news article align- ment[ C ]//proceedings of the 36th Annual meeting of the association for computational linguistics and the 17th internationa conferenceon computational linguistics, Montreal, Canada, 1998.
  • 7Wu D. Aligning a Parallel English-Chinese corpus statistically with lex- ical criteria [ C ]//Proceedings of the 32nd annual meeting of the asso- ciation for computational linguistics, Las Cruces, NM, 1994.
  • 8Philip Resnik, Noah A Smith. The Web as a parallel corpus [ J ]. Com- putational. Linguistics,2003,29:349 - 380.
  • 9叶莎妮,吕雅娟,黄赟,刘群.基于Web的双语平行句对自动获取[J].中文信息学报,2008,22(5):67-73. 被引量:12
  • 10韩家炜 范明 孟小峰.数据挖掘概念与技术[M].北京:机械工业出版社,2001..

二级参考文献27

  • 1常育红,姜哲,朱小燕.基于标记树表示方法的页面结构分析[J].计算机工程与应用,2004,40(16):129-132. 被引量:24
  • 2王志琪,王永成.HTML文件的文本信息预处理技术[J].计算机工程,2006,32(5):46-48. 被引量:12
  • 3王斌.汉语语料库自动对齐研究(博士学位论文)[M].北京:中国科学院计算技术研究所,1999..
  • 4Sun Le,ProceedingoftheworkshopMAL’99,1999年,135页
  • 5王斌,博士学位论文,1999年
  • 6Chang J S,Proceedingsofthe 35thMeetingoftheAssociationforComputationalLinguistics,Madrid,1997年,297页
  • 7Wu Daikai,MachineTranslation,1995年,9卷,3/4期,285页
  • 8Fung P,Proceedingsofthe 15thInternationalConferenceonComputationalLinguistics (COLING。?994年,1096页
  • 9Wu Daikai,Proceedingsofthe 32ndAnnualMeetingoftheAssociationforComputationalLinguistics (,1994年,80页
  • 10Chen S F,Proceedingsofthe 31thAnnualMeetingoftheAssociationforComputationalLinguistics(A,1993年,9页

共引文献111

同被引文献48

引证文献1

二级引证文献2

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部