基于双语URL匹配模式可信度的平行网页识别研究被引量：3

Detection of Parallel Web Pages Based on the Automatically Discovered Bilingual URL Pairing Patterns

下载PDF

导出

摘要平行语料是自然语言处理中一项重要的基础资源,在双语平行网页中大量存在。该文首先介绍双语URL匹配模式的可信度计算方法,然后提出基于局部可信度的双语平行网页识别算法,再依据匹配模式的全局可信度,提出两种优化方法:即利用全局可信度,救回因低于局部可信度阈值而被初始算法滤掉的匹配模式;通过全局可信度和网页检测方法,挖出深层网页。进一步,结合网站双语可信度、链接关系,侦测出种子网站周边更多较具可信度的双语网站。除了双语URL匹配模式自动识别,还利用搜索引擎,依据少数高可信度的匹配模式快速识别双语网页。为了提高以上五种方法识别候选双语网页对的准确率,计算了候选双语网页对的双语相似度,并设置阈值过滤非双语网页对。通过实验验证了所提方法的有效性。 Parallel corpora are one of the most important resources for natural language processing,a large volume of which can be mined from bilingual parallel web pages.This paper formulates a practical algorithm for recognizing parallel web pages based on the credibility of automatically discovered bilingual URL pairing patterns（or keys）,then this paper extends it in two ways to find more parallel web pages,namely,rescue weak keys of low local credibility in terms of their global credibility,and unearth bilingual parallel deep web pages by means of applying strong keys of high global credibility.Furthermore,we detect more bilingual web sites according to their credibility in terms of their link relationship with the seed set of web sites in use,and also utilize search engines to recognize bilingual web sites efficiently with only a small set of URL pairing patterns of high credibility.To further enhance the recognition accuracy on top of these five methods,we calculate cross-lingual similarity of candidate parallel web pages and filter out weak ones with a threshold.The effectiveness of our approaches is confirmed by a series of experiments.

作者章成志马舒天揭春雨姚旭晨 ZHANG Chengzhi;MA Shutian;KIT Chunyu;YAO Xuchen(Department of Information Management, Nanjing University of Science ＆ Technology, Nanjing, Jiangsu 210094, China;Department of Linguistics and Translation, City University of Hong Kong, Hong Kong, China;Baidu Online Network Technology （Beijing） Co. Ltd. , Beijing 100085, China)

机构地区南京理工大学信息管理系香港城市大学翻译及语言学系百度在线网络技术(北京)有限公司

出处《中文信息学报》 CSCD 北大核心 2018年第3期91-100,共10页 Journal of Chinese Information Processing

基金香港城市大学SRG-Fd项目(7008003) 香港研资局GRF项目(CityU 144410 11600415) 国家自然科学基金(70903032)

关键词平行网页获取平行语料库双语URL匹配模式双语文本挖掘 parallel webpage mining parallel corpora bilingual URL pairing pattern bilingual text mining

分类号 TP391 [自动化与计算机技术—计算机应用技术]

引文网络
相关文献

参考文献1

1刘奇,刘洋,孙茂松.URL模式与HTML结构相结合的平行网页获取方法[J].中文信息学报,2013,27(3):91-99. 被引量：6

二级参考文献16

1Koehn P, Och F J, Marcu D. Statistical phrase-based translation[C]//Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Tech nology-Volume 1. Association for Computational Lin- guistics, 2003: 48-54.
2Chiang D. Hierarchical phrase based translation [J ]. computational linguistics, 2007, 33(2) : 201-228.
3Galley M, Graehl J, Knight K, et al. Scalable infer ence and training of context-rich syntactic translation models[C]//Proceedings of the 21st International Con ference on Computational Linguistics and the 44th An nual Meeting of the Association for Computational Lin- guistics. Association for Computational I.inguistics, 2006: 961-968.
4Munteanu D S, Marcu D. Improving machine transla- tion performance by exploiting non parallel corpora [J]. Computational Linguistics, 2005, 31 (4) : 477- 504.
5Ma X, Liberman M. Bits= A method for bilingual text search over the web[C]//Machine Translation Summit VII. 1999:538-542.
6Chen J, Nie J Y. Automatic construction of parallel English-Chinese corpus for cross-language information retrieval[C]//Proeeedings of the 16th Conference on Applied Natural Language Processing. Association forComputational Linguistics, 2000: 21-28.
7Resnik P, Smith N A. The web as a parallel corpus [J]. Computational Linguistics, 2003, 29 (3) : 349- 380.
8Chen J, Chau R, Yeh C H. Discovering parallel text from the World Wide Web[C]//Proceedings of the second workshop on Australasian information security, Data Mining and Web Intelligence, and Software Inter- nationalisation-Volume 32. Australian Computer Soci- ety, Inc. , 2004: 157-161.
9Shi L, Niu C, Zhou M, et al. A dora tree alignment model for mining parallel data from the weh[C]//Pro- ceedings of the 21st International Conference on Com- putational Linguistics and the 44th Annual Meeting of the Association for Computational Linguistics. Associ- ation for Computational Linguistics, 2006: 489-496.
10Zhang Y, Wu K, Gao J, et al. Automatic Acquisition of Chinese-English Parallel Corpus from the Web [M]//Advances in Information Retrieval. Springer Berlin Heidelberg, 2006: 420-431.

共引文献5

1邵健,章成志.从互联网上自动获取领域平行语料[J].现代图书情报技术,2014(12):36-43. 被引量：2
2司莉,贾欢.2004—2014年我国多语言信息组织与检索研究进展与启示[J].情报学报,2015,34(6):662-672. 被引量：10
3莫源源,潘丽同,严馨,余正涛,刘小惠.基于最大熵模型的柬英平行网页获取[J].计算机工程,2016,42(5):194-200. 被引量：2
4司莉,何依.2000年以来我国多语言语料库研究进展[J].现代情报,2016,36(6):165-170. 被引量：2
5齐慧平.HTML语言的网页制作技巧与方法分析[J].电脑迷,2016(1). 被引量：7

同被引文献27

1刁兴春,谭明超,曹建军.一种融合多种编辑距离的字符串相似度计算方法[J].计算机应用研究,2010,27(12):4523-4525. 被引量：41
2王玉冰,程嗣怡,周一鹏,呙鹏程,周东青,罗朝义.基于DS证据理论的机载火控雷达空空工作模式判定[J].现代雷达,2017,39(5):79-84. 被引量：31
3姚雪梅,李少波,璩晶磊.基于神经网络和证据理论的样本预测方法[J].组合机床与自动化加工技术,2017(6):110-113. 被引量：6
4贾宝柱,贾志涛,赵祥.基于信息融合的船舶中央冷却系统运行状态评估[J].大连海事大学学报,2017,43(4):89-96. 被引量：5
5赵书涛,王亚潇,孙会伟,魏瑶.基于自适应权重证据理论的断路器故障诊断方法研究[J].中国电机工程学报,2017,37(23):7040-7046. 被引量：27
6孟媛媛,徐连诚,任敏,王燕飞.基于高冲突修正的D-S证据融合方法[J].计算机工程,2018,44(1):79-83. 被引量：4
7王姣,蒋言.改进的基于半监督稀疏自编码IM流量识别模型的研究与比较[J].电子设计工程,2018,26(7):49-54. 被引量：3
8杨艳国,穆永亮,秦洪岩.工作面瓦斯浓度时间序列特征挖掘与预警应用[J].中国安全科学学报,2018,28(3):120-125. 被引量：11
9高湛军,李思远,彭正良,赵耀.基于网络树状图和改进D-S证据理论的配电网故障定位方法[J].电力自动化设备,2018,38(6):65-71. 被引量：29
10郭佳,黄程松.国外网络环境中信息过载研究进展[J].情报科学,2018,36(7):170-176. 被引量：15

引证文献3

1孔旸.基于D-S理论的电子档案信息可信度识别系统[J].电子设计工程,2020,28(20):138-141.
2孙国梓,吕建伟,李华康.基于编辑距离的多实体可信确认算法[J].计算机科学,2020,47(12):327-331. 被引量：2
3童星.多通道通信网络数据跨层采集时序控制[J].计算机仿真,2021,38(12):341-344. 被引量：3

二级引证文献5

1蒙天双.移动通信网络优化中的分析技术[J].新一代信息技术,2022,5(7):43-45.
2李根,王科峰,贲卫国,宋微,刘鸿儒,徐亦晋.基于自分簇自学习算法的垃圾短信识别[J].吉林大学学报（信息科学版）,2021,39(5):583-588. 被引量：4
3齐彩霞.基于图编辑距离的图匹配算法研究[J].自动化与仪器仪表,2023(6):49-53.
4王迎山.多通道通信网络数据跨层采集时序控制[J].中国新通信,2024,26(1):1-3.
5段晓辉,郑真真.基于分布参数模型的通信网络数据挖掘加速算法[J].长江信息通信,2024,37(3):55-57.

1孔雅婷.基于平行语料库的对外汉语词汇教学模式探讨[J].甘肃广播电视大学学报,2018,28(2):87-90.
2邢杰,张群,庄慧慧.粤港台旅游景点英汉双语网站对比分析[J].广东外语外贸大学学报,2017,28(6):57-64. 被引量：1
3方可,周玉臣,赵恩娇.关于仿真模型验证指标体系的探讨[J].系统工程与电子技术,2017,39(11):2592-2602. 被引量：4
4陈祥宝,杜玉晓.基于多传感器融合的跌倒检测算法研究[J].自动化与仪表,2018,33(1):46-50. 被引量：1
5INSTRUCTIONS TO AUTHORS[J].Chinese Annals of Mathematics,Series B,1994,15(1).
6俞海东.基于FME的DSM变化检测提取方法研究[J].北京测绘,2018,32(2):218-220. 被引量：2
7李翰翔,孙丹,施业旺.SQL注入漏洞攻击与防御方式浅析[J].科技广场,2017(10):80-84. 被引量：2
8问与答[J].网络安全技术与应用,2003(1):57-57.
9刘芳.高校实验用房管理的问题与对策研究[J].和田师范专科学校学报,2017,36(6):64-67. 被引量：4
10庞玉华,贾芝兰,沈志廉.信息高速公路与档案管理工作[J].机电兵船档案,1996,12(5):17-19.

中文信息学报

2018年第3期

浏览历史

内容加载中请稍等...

基于双语URL匹配模式可信度的平行网页识别研究被引量：3

参考文献1

二级参考文献16

共引文献5

同被引文献27

引证文献3

二级引证文献5

相关作者

相关机构

相关主题

浏览历史

基于双语URL匹配模式可信度的平行网页识别研究 被引量：3

参考文献1

二级参考文献16

共引文献5

同被引文献27

引证文献3

二级引证文献5

相关作者

相关机构

相关主题

浏览历史

基于双语URL匹配模式可信度的平行网页识别研究被引量：3