期刊文献+

Web汉语料的智能抽取与词汇切分 被引量:4

Intelligent extraction and Chinese word segmentation of Web corpus
下载PDF
导出
摘要 提出一种Web汉语料智能抽取和汉语词切分的包装器。用户无需打开网站,无需点击链接,只需键入URL(UnitResourceLocation,统一资源定位符),即可获取Web汉语料并切分词到汉词库中。给出了系统的总体构架,阐述了各功能模块的设计原理和技术实现。测试结果表明,该包装器能快速、有效地抓取Web页面并分离其中的汉语料,对歧义句、新词汇的识别率分别达到70%和60%,可应用于Web上汉语词汇的收集与分离。 The wrapper with intelligentextraction and Chinese word segmentation based on web corpus are proposed. Users can get web Chinese corpus and segment Chinese word into glossary corpus database after inputing URL (unit resource location), without opening websites or clicking link. The architecture of system is presented and the design theory and technology implementation for every function module was dissertated. The result shows that it can snatch at Web pages fleetly and separate Chinese Corpus in them efficiently. The identification precision is 70% to divergentsentence and 60% to new glossary on web, respectively, it can apply to Chinese new-glossary compiling and separation.
出处 《计算机工程与设计》 CSCD 北大核心 2005年第6期1422-1424,共3页 Computer Engineering and Design
基金 国务院侨办人文社会科学研究基金项目(04CQBYB0011)
关键词 Web语料 HTML格式 包装器 Web页面抓取器 词汇分离器 web corpus html format wrapper web page-snatcher glossary separator
  • 相关文献

参考文献10

  • 1Joachim Hammer. Semi-structured information from the web[C].Proceedings of the First Workshop on Management of Semistructured Data, Tucson,Arizona, 1997.18-25.
  • 2Arnaud Sahuguet, Fabien Azavant. Building light-weight wrappers for legacy web data-sources using W4F[C]. International Conference on Very Large Database, Edinburgh, Scotland,1997.738-741.
  • 3Hammer J, McHugh J. Semi-structured data: The TSIMMIS experience[C].In: Proceeding of the First East-European Symposium on Advance in Database and Information System, 1997.1-8.
  • 4Sahugurt A, Azavant F. Building intelligent web applications using lightweight wrappers [J]. Data and Knowledge Engineering, 2001, 36(3):283-316.
  • 5Valter Crescenzi, Giansalvatore Mecca. Road runner:Towards automatic data extration from large eb site[C].In: Proceeding of the 26th International Conference on Data Engineering, 2000.611-620.
  • 6Alberto H F Laender, Berthier A Ribeiro-Neto. A brief survey of web data extraction tools[J]. SIGMODRecord,2002, 31(2):84-93.
  • 7郭庆琳,樊孝忠.基于NLU的智能搜索和信息提取技术的研究[J].计算机应用研究,2004,21(2):14-16. 被引量:2
  • 8Daisuke Ikeda, Yasuhiro Yamada. Expressive power of tree and string based wrapper[C].In: On-Line Proceeding of IJCAI' 03Workshop on Information Integration on the Web,2003.
  • 9刘源 梁南元.汉语处理的基础工程—现代汉语词频统计[J].中文信息学报,1986,(1):17-25.
  • 10黄萱菁,吴立德,王文欣,叶丹瑾.基于机器学习的无需人工编制词典的切词系统[J].模式识别与人工智能,1996,9(4):297-303. 被引量:24

二级参考文献6

共引文献29

同被引文献32

引证文献4

二级引证文献28

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部