摘要
针对大规模查询日志中丰富的命名实体的挖掘是数据挖掘领域中的重要研究课题。已有的研究工作提出了一种基于种子实体的抽取框架,利用实体间的分布相似度进行挖掘。然而该工作只有当种子实体仅属于单个语义类别时才能取得好的结果,实际上命名实体往往可能从属于多个类别。该文通过引入一个弱指导话题模型,利用少量的人工指导信息,很好地解决了实体的类别模糊性,提高了挖掘的有效性。实验表明该文提出的方法在实体挖掘性能上显著优于已有的方法。
Mining named entities from query logs is an important research field in data mining. Previous work proposed a seed--based framework to mine named entities from query logs by leveraging distribution similarity, which works well only when each named entity only belongs to a signle semantic class. In fact, named entities may often belong to multiple classes. In this paper, we introduce a weakly-supervised topic model to resolve class ambiguity of named entities by leveraging weak supervision from human. The experiment results show that our approach significantly outperforms the previous method.
出处
《中文信息学报》
CSCD
北大核心
2010年第1期71-76,116,共7页
Journal of Chinese Information Processing