期刊文献+

中文短文本聚合模型研究 被引量:11

Research on Aggregation Model for Chinese Short Texts
下载PDF
导出
摘要 中文短文本聚合的目的是将两个数据集中属于同一对象的短文本信息进行匹配关联,同时要避免匹配不属于同一对象的短文本信息,这项研究对于多源异构的短文本数据资源整合具有重要的理论和现实意义.提出了一种有效的中文短文本聚合模型,通过快速匹配和精细匹配两个关键步骤可以大幅度降低匹配的候选对数量,并保证匹配的精度.针对传统短文本相似度算法的不足,提出了一种新颖的广义Jaro-Winkler相似度算法,并从理论上分析了该算法的参数特性.通过对不同数据集上的商户信息数据进行聚合实验,结果表明,新算法与传统算法相比,在匹配准确率和稳定性上具有最优的性能. Aggregation task for Chinese short texts is to associate a pair of similar short texts together. The pair needs to belong to same entity in two data sets. Such study has important theoretical and practical interests for data resource integration across different fields. In this article, an effective aggregation model is devised for Chinese short text. The model is able to decrease the volume of candidate pairs sharply for matching and ensure the matching accuracy via two key steps, namely fast matching and refined matching. Meanwhile, aiming to the deficiency of the traditional similarity algorithms for short text, an improved similarity algorithm, called generalized Jaro-Winkler is proposed. The aggregation experiments performed on different merchant data sets suggest that the new algorithm has the best performance both in matching accuracy and stability compared with those traditional algorithms.
作者 刘震 陈晶 郑建宾 华锦芝 肖淋峰 LIU Zhen CHEN Jing ZHENG Jian-Bin HUA Jin-Zhi XIAO Lin-Feng(Web Sciences Center, School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu 611731, China Institute of Electronic Payment, China Unionpay Limited Liability Company, Shanghai 201201, China Big Data Research Center, University of Electronic Science and Technology of China, Chengdu 611731, China)
出处 《软件学报》 EI CSCD 北大核心 2017年第10期2674-2692,共19页 Journal of Software
基金 国家自然科学基金(61300018) 中国银联-电子科技大学-金融大数据研究项目~~
关键词 中文短文本 聚合模型 文本相似度 广义Jaro—Winkler算法 快速匹配 精细匹配 Chinese short text aggregation model similarity of text generalized Jaro-Winkler fast matching refined matching
  • 相关文献

参考文献4

二级参考文献70

  • 1赵作鹏,尹志民,王潜平,许新征,江海峰.一种改进的编辑距离算法及其在数据处理中的应用[J].计算机应用,2009,29(2):424-426. 被引量:51
  • 2车万翔,刘挺,秦兵,李生.基于改进编辑距离的中文相似句子检索[J].高技术通讯,2004,14(7):15-19. 被引量:63
  • 3[OL].<http://hadoop.apache.org.>.
  • 4WinterCorp: 2005 TopTen Program Summary. http:// www. wintercorp, com/WhitePapers/WC TopTenWP. pdf.
  • 5TDWI Checklist Report: Big Data Analytics. http://tdwi. org/research/2010/08/Big-Data-Analytics, aspx.
  • 6Chaudhuri S, Dayal U. An overview of data warehousing and OLAP technology. SIGMOD Rec, 1997,26(1): 65-74.
  • 7Madden S, DeWitt D J, Stonebraker M. Database parallelism choices greatly impact scalability. DatabaseColumn Blog. http://www, databasecolumn, com/2007/10/database-parallelism-choices, html.
  • 8Dean J, Ghemawat S. MapReduce: Simplified data processing on large clusters//Proceedings of the 6th Symposium on Operating System Design and Implementation (OSDI ' 04). San Francisco, California, USA, 2004: 137-150.
  • 9DeWitt D J, Gerber R H, Graefe G, Heytens M L, Kumar K B, Muralikrishna M. GAMMA--A high performance dataflow database machine//Proceedings of the 12th International Conference on Very Large Data Bases (VLDB' 86). Kyoto, Japan, 1986:228-237.
  • 10Fushimi S, Kitsuregawa M, Tanaka H. An overview of the system software of a parallel relational database machine// Proceedings of the 12th International Conference on Very Large DataBases(VLDB'86). Kyoto, Japan, 1986:209-219.

共引文献776

同被引文献65

引证文献11

二级引证文献17

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部