期刊文献+

面向大数据实体识别的超图分割算法 被引量:4

Hypergraph Partitioning for Entity Identification in Big Data
下载PDF
导出
摘要 当前信息系统中存在海量复杂异构数据,极大地降低了数据可用性.为有效地"清洗"数据、提高数据实体同一性,借助云计算平台Hadoop设计并实现了基于超图模型的实体识别算法.算法共分为数据预处理、构造超图模型和实体识别三个阶段:在预处理阶段,通过建立属性-值倒排索引表、挖掘频繁项集来对数据进行初步处理;在构造超图模型阶段,改进超边权重的定义,建立超边带权重的超图模型,将所有数据转化为超图模式;在实体识别阶段,改进超图分割算法并基于云平台来完成对同一实体的识别.在Hadoop平台上对真实数据集的实验结果表明该算法在实体识别方面具有良好的准确性和高效性. In view of the current information system,there is a huge amount of complex data which seriously reduces the data availability.To " clean" data efficiently and improve the quality of data,based on hypergraph model,this paper proposes an entity identification algorithm utilizing Hadoop cluster. The algorithm is divided into three stages: data preprocessing,constructing hypergraph model and entity identification. In the data preprocessing stage,it processes the data via creating property-value inverted index tables and mining the frequent itemsets; Then it establishes a weighted hypergraph model and transforms all data into the hypergraph in the stage of constructing hypergraph model; Moreover,in the stage of entity identification,it completes the entity identification using an improved hypergraph partitioning algorithm. The experiments on real data sets based on Hadoop cluster indicate that the algorithm is efficient,with great accuracy.
作者 胡志刚 刘佳 HU Zhi-gang;LIU Jia(College of Software Engineering , Central South University ,Changsha 410073 ,China)
出处 《小型微型计算机系统》 CSCD 北大核心 2018年第7期1542-1547,共6页 Journal of Chinese Computer Systems
基金 国家自然科学基金面上项目(61572525)资助 中南大学研究生自主探索创新(2017zzts618)资助
关键词 实体识别 大数据 云计算 MAP REDUCE 超图 entity identify big data cloud computing MapReduce hypergraph
  • 相关文献

参考文献9

二级参考文献168

  • 1张奥千,宋韶旭,王建民.基于数据质量规则的缺失结果解释约减[J].计算机研究与发展,2013,50(S1):221-229. 被引量:2
  • 2金连,王宏志,黄沈滨,高宏.基于Map-Reduce的大数据缺失值填充算法[J].计算机研究与发展,2013,50(S1):312-321. 被引量:18
  • 3霍然,王宏志,朱鎔,李建中,高宏.基于Map-Reduce的大数据实体识别算法[J].计算机研究与发展,2013,50(S2):170-179. 被引量:9
  • 4刘非凡,赵军,吕碧波,徐波,于浩,夏迎炬.面向商务信息抽取的产品命名实体识别研究[J].中文信息学报,2006,20(1):7-13. 被引量:47
  • 5李石君,于俊清,欧伟杰.基于HTML模式代数的Web信息提取方法[J].计算机研究与发展,2006,43(9):1644-1650. 被引量:8
  • 6Nikki S. Gartner warns firms of "dirty data". Information Management Journal, 2007, 41 (3). http://www, allbusi ness. com/company-activities-management/operations quality-control/8901885-1. html.
  • 7Kohn L T, Corrigan J M, Donaldson M S. To err is human, building a safer health system. Washington, D. C. , USA: National Academies Press, 2000.
  • 8Eckerson W. Data quality and the bottom line: Achieving business success through a commitment to high quality data. The Data Warehousing Institute: Technical Report, 2002. http://download. 101com. com/pub/tdwi/Files/DQReport. pdf.
  • 9Weis M, Naumann F. DogmatiX tracks down duplicates in XML//Proceedings of the ACM S1GMOD International Con ference on Management of Data. Baltimore, Maryland, USA, 2005:431 -442.
  • 10Augsten N, Bohlen M H, Gamper J. Approximate matching of hierarchical data using pq-grams//Proceedings of the 31st International Conference on Very Large Data Bases. Trondheim, 2005:301-312.

共引文献97

同被引文献49

引证文献4

二级引证文献15

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部