摘要
平行语料是自然语言处理中一项重要的基础资源,在双语平行网页中大量存在。该文首先介绍双语URL匹配模式的可信度计算方法,然后提出基于局部可信度的双语平行网页识别算法,再依据匹配模式的全局可信度,提出两种优化方法:即利用全局可信度,救回因低于局部可信度阈值而被初始算法滤掉的匹配模式;通过全局可信度和网页检测方法,挖出深层网页。进一步,结合网站双语可信度、链接关系,侦测出种子网站周边更多较具可信度的双语网站。除了双语URL匹配模式自动识别,还利用搜索引擎,依据少数高可信度的匹配模式快速识别双语网页。为了提高以上五种方法识别候选双语网页对的准确率,计算了候选双语网页对的双语相似度,并设置阈值过滤非双语网页对。通过实验验证了所提方法的有效性。
Parallel corpora are one of the most important resources for natural language processing,a large volume of which can be mined from bilingual parallel web pages.This paper formulates a practical algorithm for recognizing parallel web pages based on the credibility of automatically discovered bilingual URL pairing patterns(or keys),then this paper extends it in two ways to find more parallel web pages,namely,rescue weak keys of low local credibility in terms of their global credibility,and unearth bilingual parallel deep web pages by means of applying strong keys of high global credibility.Furthermore,we detect more bilingual web sites according to their credibility in terms of their link relationship with the seed set of web sites in use,and also utilize search engines to recognize bilingual web sites efficiently with only a small set of URL pairing patterns of high credibility.To further enhance the recognition accuracy on top of these five methods,we calculate cross-lingual similarity of candidate parallel web pages and filter out weak ones with a threshold.The effectiveness of our approaches is confirmed by a series of experiments.
作者
章成志
马舒天
揭春雨
姚旭晨
ZHANG Chengzhi;MA Shutian;KIT Chunyu;YAO Xuchen(Department of Information Management, Nanjing University of Science & Technology, Nanjing, Jiangsu 210094, China;Department of Linguistics and Translation, City University of Hong Kong, Hong Kong, China;Baidu Online Network Technology (Beijing) Co. Ltd. , Beijing 100085, China)
出处
《中文信息学报》
CSCD
北大核心
2018年第3期91-100,共10页
Journal of Chinese Information Processing
基金
香港城市大学SRG-Fd项目(7008003)
香港研资局GRF项目(CityU 144410
11600415)
国家自然科学基金(70903032)