摘要
句子级的语料库是机器翻译的重要资源,但由于获取途径的限制,句子级的语料库不仅数量有限而且经常集中在特定领域,很难适应真实应用的需求。根据锚文本信息通过搜索引擎在网络上找到汉维双语平行网站,并下载网站中所有的双语平行网页。提取出有正文的网页,根据html特征,建立html树,提出一种将html树结构作为识别网页正文内容重要特征的网页分析方法,并根据正文内容信息相似性提取网页正文。对提取出的正文进行句子切分,分别创建句子级的汉、维语料库,为以后创建句子级的汉维双语平行语料库服务。
Sentence level corpus library is an important resource for machine translation.However,since there are limited ways to acquire it,there is not enough sentence level corpus library.Moreover it is often focused to a few specific fields so that it is hard to meet real application demands.In the thesis,according to anchor text information,the network is searched with search engines to find Chinese-Uighur bilingual parallel websites,then to download all bilingual parallel webpages from them.After extracting pages that contain main body,according to HTML features,an HTML tree is built.A webpage analysis method is proposed that regards HTML tree structure as an important feature to identify webpage main body contents.In addition,on the basis of main body content information similarity,webpage main body is extracted.The extracted main body is then segmented into sentences in order to create sentence level Chinese and Uighur corpus library to serve for future creation of sentence level Chinese-Uighur bilingual corpus library.
出处
《计算机应用与软件》
CSCD
2011年第12期19-21,70,共4页
Computer Applications and Software
基金
国家自然科学基金资助项目(60963017)
国家社科基金资助项目(10BTQ045)
新疆自治区高校科研计划重点项目(XJEDU2009I05)
关键词
双语平行语料库
双语平行句对
正文提取
Bilingual parallel corpus library Bilingual parallel sentence pair Text extraction