摘要
[目的 /意义]从互联网公众查询数据中发现公众使用的健康术语,为建立公众健康术语与医学专业术语的映射提供基础,进而优化健康类知识服务平台的知识组织与管理性能。[方法 /过程]设计规则与NGram相结合的健康术语新词的识别模型,采集公众查询数据,开展实验验证,通过多次实验,逐步完善过滤语料集合,结合人工判读,不断优化并验证方案的有效性。[结果 /结论]从互联网中公众提问句抽取出规则,结合统计算法进行公众使用的健康类新词抽取,该技术方法对识别公众使用的健康术语具有一定的通用性,能为建立公众术语与医学术语映射提供数据基础。实验结果表明:基于规则进行公众日志数据预处理,能为后续的实验方案提供较好的预处理文本,而采用N-Gram及各种过滤规则结合的术语识别方法,能较好地识别发现短文本中的新词。
[Purpose / significance]Identify the health term by consumer understanding from Web query data,to provide fundamental term set for carrying out the mapping between the consumer-friendly terms and the professionals in medical domain. [Method / process]The consumer health term identification model is set up combining N-Gram and rule,and the Web query data is captured from consumers. Using these data as samples,implement experiment,the rationality of the model is verified by expert reviewing. [Result / conclusion]The method of new term identified in this paper is extracting rules from consumers' question data in Web query dataset,and combining statistical methods. The identified model in this paper has better identification capability,which can provide significant dataset for mapping the lay terms between the professionals in consumer health domain. The experimental results show that it can provide preprocessing text for follow-up experiment by processing the public Web data based on rules,the identified model of combining N-Gram and rules can identify new health terms in short text,and the model is reasonable and scientific.
出处
《图书情报工作》
CSSCI
北大核心
2015年第23期115-123,共9页
Library and Information Service
基金
国家社会科学基金"面向知识服务的公众健康知识组织体系构建研究"(项目编号:14BTQ032)
"十二五"国家科技支撑计划课题"公众健康知识整合与服务技术研究与应用"(项目编号:2013BAI06B01)研究成果之一