摘要
提出一种Web汉语料智能抽取和汉语词切分的包装器。用户无需打开网站,无需点击链接,只需键入URL(UnitResourceLocation,统一资源定位符),即可获取Web汉语料并切分词到汉词库中。给出了系统的总体构架,阐述了各功能模块的设计原理和技术实现。测试结果表明,该包装器能快速、有效地抓取Web页面并分离其中的汉语料,对歧义句、新词汇的识别率分别达到70%和60%,可应用于Web上汉语词汇的收集与分离。
The wrapper with intelligentextraction and Chinese word segmentation based on web corpus are proposed. Users can get web Chinese corpus and segment Chinese word into glossary corpus database after inputing URL (unit resource location), without opening websites or clicking link. The architecture of system is presented and the design theory and technology implementation for every function module was dissertated. The result shows that it can snatch at Web pages fleetly and separate Chinese Corpus in them efficiently. The identification precision is 70% to divergentsentence and 60% to new glossary on web, respectively, it can apply to Chinese new-glossary compiling and separation.
出处
《计算机工程与设计》
CSCD
北大核心
2005年第6期1422-1424,共3页
Computer Engineering and Design
基金
国务院侨办人文社会科学研究基金项目(04CQBYB0011)