摘要
【目的】大规模搜集、整理新词扩充现有词典,提高汉语分词准确率,推动中文信息处理的发展。【方法】根据搜索日志查询串特征及新词特点,提出扩展搜索日志上下文的新词识别方法。首先,通过分析查询串的特点获取种子词集合,利用种子词集在搜索日志中进行全文扩展,提取候选新词。其次,根据新词的时间属性发现新词串,最后基于词语的边界信息,提出改进左右熵方法抽取语料中存在的新词语。【结果】在搜狗日志上进行实验,P@100的平均准确率达到89.60%。【局限】对比词串集合的规模会在一定程度上影响新词的正确率。【结论】实验表明该方法适用于搜索日志这种缺失上下文信息的文本的新词识别。
[Objective] Collect and collate new words to expand the current dictionary, which can improve the accuracy of Chinese segment and promote the development of Chinese information processing. [Methods] A new word recognition method of context extension is proposed depending on features of query strings and new words. Firstly, get the seed collection based on features of query strings and obtain candidate new words through full extension. Secondly, get candidate new words according to the words time span. Finally, filter candidates by the use of improved left-right entropy according to the boundary information of words. [Results] Experiments on Sogou log show that precision rate of P@100 can reach 89.60%. [Limitations] The scale of contrast strings affects the accuracy of new words, to a certain extent. [Conclusions] Experiment results demonstrate that the method is suitable for the search logs of which context information to identify new words is missed.
出处
《现代图书情报技术》
CSSCI
北大核心
2014年第11期59-65,共7页
New Technology of Library and Information Service
基金
国家自然科学基金项目"基于本体的专利自动标引研究"(项目编号:61271304)
北京市教委科技发展计划重点项目暨北京市自然科学基金B类重点项目"面向领域的互联网多模态信息精准搜索方法研究"(项目编号:KZ201311232037)
北京市属高等学校创新团队建设与教师职业发展计划项目(项目编号:IDHT20130519)的研究成果之一
关键词
搜索日志
全文扩展
新词
边界
改进左右熵
Search log Full extension New words Boundary Improved left-right entropy