期刊文献+

基于正则表达式的大规模网页术语对抽取研究 被引量:13

The Study of Large-scale Web Term-pairs Extraction based on Regular Expressions
下载PDF
导出
摘要 多语术语对的收集对于跨语言信息检索、机器翻译和语言学习等具有重要应用价值,但传统的手工方式或基于平行语料的术语收集方法均有各自的局限性。针对Web上存在的大规模术语网页,基于Web挖掘技术,提出了一种采用正则表达式的术语对抽取方法。首先是获取网页源文件,接着依据已定义的正则表达式从中抽取出正确的术语对,并存储到本地术语库中。实验结果表明,该方法可实现66.7%的术语网页的抽取,并且对于可抽取网页,抽取出的术语对准确率接近100%。
作者 程岚岚
出处 《情报杂志》 CSSCI 北大核心 2008年第11期62-64,68,共4页 Journal of Intelligence
基金 天津市高等院校科技发展基金项目"不均匀数据的自动分级聚类方法研究"(编号:20071303)
  • 相关文献

参考文献9

  • 1孙乐,金友兵,杜林,孙玉芳.平行语料库中双语术语词典的自动抽取[J].中文信息学报,2000,14(6):33-39. 被引量:30
  • 2杨沐昀,刘晓月,李生.基于汉英双语语料库的汉英词典编撰研究[J].情报学报,2003,22(3):310-314. 被引量:7
  • 3Lars Ahrenberg, Mikael Andersson, Magnus Merkel. A Simple Hybrid Aligner for Generating Lexical Correspondences in Parallel Texts [C]. In 36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics (COLING - ACL' 98), Montreal, 1998 : 29 - 35
  • 4Jorg Tiedemann. Extraction of Translation Equivalents From Parallel Corpora[ C]. In 11th Nordic Conference of Computational Linguistics, Copenhagen, Denmark, 1998 :120 - 128
  • 5D. Hiemstra, F. de Jong, W. Kraaij. A Domain Specific Lexicon Acquisition Tool for Cross- Language Information Retrieval[ C]. In Proceedings of RIAO97, Montreal ,Canada, 1997:217 - 232
  • 6W. A. Gale, K. W. Church. Identifying Word Correspondences in Parallel Texts [ C ]. Proceedings of the 4th DARPA Workshop on Speech and Natural Language. 1991 : 152 - 157
  • 7I. Dagan, K. W. Church, W. A. Gale. Robust Bilingual Word Alignment for Machine Aided Translation[ C]. Proceedings of Workshop on Very Large Corpora, 1993 : 1 - 8
  • 8Nagata. M, Sailo. T,Suzuki. K. Using the Web as a Bilingual Dictionary[C]. Proceeding of workshop on Data- driven Methods in Machine Translation, 2001 : 95 - 102
  • 9Jian - Cheng Wu, Tracy Lin, Jason S. Chang. Learning Source - Target Surface Patterns for Web - Based Terminology Translation [C]. Proceedings of the ACL Interactive Poster and Demonstration Sessions, 2005 : 37 - 40

二级参考文献22

  • 1王斌.汉语语料库自动对齐研究(博士学位论文)[M].北京:中国科学院计算技术研究所,1999..
  • 2J Nie, M Simard, et al. Cross-language information retrieval based on parallel texts and automatic mining parallel texts from the Web. ACM-SIGIR Conference, Berkeley, California,1999.
  • 3D Lonsdale, E Mitamura, E Nyberg. Acquisition of large lexicons for practical knowledge-based MT. Machine Translation,1995, 9(3) : 101 - 133.
  • 4M Barlow. Parallel texts in language reaching. In: A M McEnery, et al. ed. Corpora and Language Reasearch: A Selection of Papers from Talc96. Lancaster University. 1996.
  • 5W A Gale, K W Church. Identifying word correspondences in parallel texts. Proceedings of the 4th DARPA Workshop on Speech and Natural Language. 1991: 152- 157.
  • 6P F Brown, J Cocke and S A Pietra, et al. A statistical approach to machine translation. Computational Linguistics,1990, 16(2) :79 - 85.
  • 7I Dagan, K W Church and W A Gale. Robust bilingual word alignment for machine aided translation. Proc. of Workshop on Very Large Corpora. 1993 : 1 - 8.
  • 8A Chen, K Kishida, et al. Automatic construction of a japanese-english lexicon and its application in cross-lanague information retrieval. In Joint ACM DIdACM SIGIR Workshop on Muhilingual Information Discovery and Access (MIDAS).
  • 9R C Moore. Towards a simple and accurate statistical approach to learning translation relationships among words. Proceedings of Workshop on Data-driven Machine Translation of 39th ACL and 10th ACL European Chapter. 2001:79 - 86.
  • 10K W Church, P Hanks. Word association norms, mutual information and lexicography. Computational Linguistics, 1991, 16(1).

共引文献35

同被引文献123

引证文献13

二级引证文献59

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部