摘要
本文构建了一个基于搜索引擎技术的中文歧义词收集系统。该系统从Internet上抓取网页内容,清除掉HTML标记及其他脚本后,得到网页内容的纯文本形式,然后采用双向扫描法找出歧义词位置并保存,接着做进一步的分析处理,得到包含歧义词的句子及歧义词在句中的相对位置。该结果可以供分词消岐算法研究人员使用,能够有效解决分词消歧研究中测试语料难以获取和不同消歧算法的结果难以对比的问题。
A system for collecting test material used in disambiguation of chinese word segmentation was built,which was based on search engine technology.Firstly,web page was captured by crawler,HTML tag and other unnecessary content was cleaned,plain text was obtained.Then the bidirectional scanning method was adapted to find the position that needs disambiguation in process of word segmentation,all result was saved for further processing,after judgement manually,the final result could be used for testing.
出处
《现代情报》
CSSCI
2010年第6期125-127,共3页
Journal of Modern Information
关键词
搜索引擎
歧义词
语料收集
search engine
chinese ambiguity words
collecting