摘要
【目的】实现基于UGC信息源的本体概念抽取。【方法】针对UGC信息源特征,提出一种基于语言学的细粒度词抽取组合并应用统计过滤组成概念的本体概念抽取方法,建立基于UGC信息源的概念抽取模型并对原型系统进行验证。【结果】在UGC信息源概念抽取实验中,该方法的结果比其他4组概念抽取方法的表现更为优异,准确率达68.42%,召回率达85.35%。【局限】概念抽取的测试集来自信息质量较高的UGC信息源,部分信息经过人工过滤,语料规模存在不足。【结论】概念抽取方法与技术在实现基于UGC信息源的本体概念抽取中具有一定的意义。
[Objective] In order to extract Ontology concepts from Chinese UGC information sources. [Methods] This paper proposes a mixed Ontology extraction method which extracting the fine-grained words and combining them into concepts based on linguistic methods and filters the concepts based on statistical methods. To prove the methods, the paper establishes the Ontology extraction model and develops a prototype system of concept extraction which is based on the UGC sources. [Results] The method has more excellent performance than other four concept extraction methods as the comparative samples in the experiments of concept extraction from UGC. The results of the accuracy rate and the recall rate respectively reaches 68.42% and 85.35%. [Limitations] The test set of concept extraction is from high-quality UGC sources and some of the test set is filtered manually.So the corpus scale is not enough. [Conclusions] This concept extraction method and technology has some significance in the Ontology concept extraction based on UGC.
出处
《现代图书情报技术》
CSSCI
北大核心
2014年第5期41-49,共9页
New Technology of Library and Information Service
基金
国家自然科学基金项目"社会化媒体集成检索与语义分析方法研究"(项目编号:71273194)的研究成果之一
关键词
概念抽取
词性规则
中心词
互信息
信息熵
Concept extraction Speech rules Seed word Mutual information Information entropy