摘要
中文短文本聚合的目的是将两个数据集中属于同一对象的短文本信息进行匹配关联,同时要避免匹配不属于同一对象的短文本信息,这项研究对于多源异构的短文本数据资源整合具有重要的理论和现实意义.提出了一种有效的中文短文本聚合模型,通过快速匹配和精细匹配两个关键步骤可以大幅度降低匹配的候选对数量,并保证匹配的精度.针对传统短文本相似度算法的不足,提出了一种新颖的广义Jaro-Winkler相似度算法,并从理论上分析了该算法的参数特性.通过对不同数据集上的商户信息数据进行聚合实验,结果表明,新算法与传统算法相比,在匹配准确率和稳定性上具有最优的性能.
Aggregation task for Chinese short texts is to associate a pair of similar short texts together. The pair needs to belong to same entity in two data sets. Such study has important theoretical and practical interests for data resource integration across different fields. In this article, an effective aggregation model is devised for Chinese short text. The model is able to decrease the volume of candidate pairs sharply for matching and ensure the matching accuracy via two key steps, namely fast matching and refined matching. Meanwhile, aiming to the deficiency of the traditional similarity algorithms for short text, an improved similarity algorithm, called generalized Jaro-Winkler is proposed. The aggregation experiments performed on different merchant data sets suggest that the new algorithm has the best performance both in matching accuracy and stability compared with those traditional algorithms.
作者
刘震
陈晶
郑建宾
华锦芝
肖淋峰
LIU Zhen CHEN Jing ZHENG Jian-Bin HUA Jin-Zhi XIAO Lin-Feng(Web Sciences Center, School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu 611731, China Institute of Electronic Payment, China Unionpay Limited Liability Company, Shanghai 201201, China Big Data Research Center, University of Electronic Science and Technology of China, Chengdu 611731, China)
出处
《软件学报》
EI
CSCD
北大核心
2017年第10期2674-2692,共19页
Journal of Software
基金
国家自然科学基金(61300018)
中国银联-电子科技大学-金融大数据研究项目~~