大数据集成中确定数据准确属性值的WR方法被引量：1

WR Approach:Determining Accurate Attribute Values in Big Data Integration

下载PDF

导出

摘要大数据集成是提供高质量数据以进行决策的基础.集成的一个关键环节是根据实体在数据库中的不同元组确定其准确属性值.最新的R-topK方法在数据上实施人工设计的规则确定属性值间的准确程度，得到了相对准确的属性值.然而这种方法在处理多个可能的准确值或设计的规则存在冲突等情况下需要较多人工交互.为此提出基于权重规则的WR（weighted-rule）方法确定大数据集成中数据的准确属性值.该方法为属性值间准确程度的判断规则扩充了权重，在准确值发生冲突时避免了R-topK方法中人工交互干预.基于追逐过程设计了约束条件推理算法，并证明它能够在O（n／＋2）内推导出每对属性值间的带权重的准确程度，形成推导准确属性值的约束条件.面对约束条件中可能的冲突，提出了目标求解算法，在O（n）时间内从所有属性值组合中搜索最可能的准确属性值.在真实和合成数据集中进行了充分的实验，验证了WR方法的效果和效率.WR方法较R-topK方法在性能上提高了3~15倍，在效果上提升7%~80%. Big data integration lays the foundation for high quality data-driven decision. One critical section thereof is to determine the accurate attribute values from records in data pertaining to a given entity. The state-of-the-art approach R-topK argues to design rules to decide relative accuracy among the attribute values and thus obtain accurate values. Unfortunately, in cases where multiple true values or conflicted rules exist, it requires rounds of human intervention. In this paper, we propose a weighted rule （WR） approach for determining accurate attribute values in big data integration. Each rule is augmented with weight and thus avoid human intervention when conflicts occur. This paper designs a chase procedure-based inference algorithm, and proves that it can figure out weighted constraints over relative accuracy among attribute values in O（n／＋2）, which introduces constraints for finding accurate data values. Taking conflicts among constraints into consideration, this paper proposes an O（n） algorithm to discover accurate attribute values among the combination of data values. We conduct extensive experiments under real world and synthetic datasets, and the results demonstrate the effectiveness and efficiency of WR approach. WR approach boosts performance by factor of 3-15x and improves effectiveness by 7%-80%.

作者周宁南盛万兴刘科研张孝王珊

机构地区中国电力科学研究院中国人民大学信息学院数据工程与知识工程教育部重点实验室(中国人民大学)

出处《计算机研究与发展》 EI CSCD 北大核心 2016年第2期449-458,共10页 Journal of Computer Research and Development

基金国家"九七三"重点基础研究发展计划基金项目(2014CB340403) 国家电网公司研究项目(EPRIPDKJ[2014]3763号)~~

关键词大数据集成数据质量数据准确性数据清洗权重规则 big data integration data quality data accuracy data cleaning weighted rules

分类号 TP311 [自动化与计算机技术—计算机软件与理论]

引文网络
相关文献

参考文献26

1Dong Xin Luna, Srivastava D. Big data integration[C]//Proc ofICDE'13. Piscataway, NJ: IEEE, 2013: 1245-1248.
2Dong Xin Luna, Srivastava D. Big data integration [J]. Proceedings of the VLDB Endowment, 2013, 6 ( 11 ) : 1188- 1189.
3Bellahsene Z, Bonifati A, Rahm E. Schema Matching and Mapping [M]. Berlin: Springer, 2011.
4Gelman I. Setting priorities for data accuracy improvements in satisficing decision-making scenarios: A guiding theory [J]. Decision Support System, 2010, 48(1): 507-520.
5Getoor L, Machanavajjhala A. Entity resolution: Theory, practice & open challenges [J]. Proceedings of the VLDB Endowment, 2012, 5(12): 2018-2019.
6Fan Wenfei. Querying Big Social Data [M]. Berlin: Springer, 2013:14-28.
7Cao Yang, Fan Wenfei, Yu Wenyuan. Determining the relative accuracy of attributes[C]//Proc of SIGMOD'13. New York: ACM, 2013: 565-576.
8Radcliffe J, White A. Key issues for master data management, G00210255 [R]. Stanford, CT: Gartner, 2008.
9Fan Wenfei, Geerts F, Nan Tang, et al. Inferring data currency and consistency for conflict resolution [C] //Proc of ICDE'13. Piscataway, NJ: IEEE, 2013:470-481.
10Abiteboul S, Hull R, Vianu V. Foundations of Databases [M]. Reading, MA: Addison-Wesley, 1995.

二级参考文献10

1Eckerson W W. Data quality and the bottom line: Achieving business success through a commitment to high quality data. Data Warehousing Institute: Technical Report TDWI Report Series, 2002.
2Zhang H, Diao Y, Immerman N. Recognizing patterns in streams with imprecise timestamps. Proceedings of the VLDB Endowment, 2010, 3(1-2): 244-255.
3Fan W, Geerts F, Wijsen J. Determining the currency of data//Proceedings of the ACM Symposium on Principles of Database Systems(PODS). Athens, Greece, 2011:71-82.
4Berti-EquiUe L, Sarma A D, Dong X, Marian A, Srivastava D.Sailing the information ocean with awareness of currents: Discovery and application of source dependence//Proceedings of the Conference on Innovative Data Systems Research (CIDR). Asilomar, CA, USA, 2009.
5Dong X, Berti-Equille L, Hu Y, Srivastava D. Global detec- tion of complex copying relationships between sources. Pro- ceedings of the VLDB Endowment, 2010, 3(1 2) : 1358-1369.
6Dong X, Berti-Equille L, Srivastava D. Truth discovery and copying detection in a dynamic world. Proceedings of the VLDB Endowment, 2009, 2(1) : 562-573.
7Clifford J, Dyreson C E, Isakowitz T, Jensen C S, Snodgrass R T. On the semantics of "now" in databases. ACM Transactions on Database Systems (TODS), 1997, 22 (2):171-214.
8Snodgrass R T, Gao D, Zhang R, Thomas S W. Temporal support for persistent stored modules//Proceedings of the 1EEE International Conference on Data Engineering (ICDE). Washington, DC, USA, 2012.
9Bodirsky M, Kara J. The cortxplexity of temporal constraint satisfaction problems//Proceedings of the 40th Annual ACM Symposium on Theory of Computing. Victoria, British Columbia, Canada, 2008:29-38.
10Elmagarmid A K, Ipeirotis P G, Verykios V S. Duplicate record detection: A survey. Transactions on Knowledge and Data Engineering (TKDE), 2007, 19(1) : 1-16.

共引文献19

1张炜,李志杰.大区域情报板集群发布及信息回收管控体系研究[J].公路,2019,64(2):170-173. 被引量：1
2张立国,苏星云,倪力军.部分上市中药企业科技投入对财务指标的时效分析[J].中医药管理杂志,2014,22(12):1978-1981.
3杨道平,简岩.数据可用性的评估方法分析[J].企业技术开发（下旬刊）,2015,34(5):62-63.
4廖建新.大数据技术的应用现状与展望[J].电信科学,2015,31(7):1-12. 被引量：68
5范小将,郑丽伟.一种社交网络数据时效性及可信度度量方法[J].北京信息科技大学学报（自然科学版）,2015,30(4):66-70. 被引量：1
6李默涵,李建中.数据时效性修复问题的求解算法[J].计算机研究与发展,2015,52(9):1992-2001. 被引量：4
7李建中,王宏志,高宏.大数据可用性的研究进展[J].软件学报,2016,27(7):1605-1625. 被引量：63
8丁小欧,王宏志,张笑影,李建中,高宏.数据质量多种性质的关联关系研究[J].软件学报,2016,27(7):1626-1644. 被引量：33
9杜岳峰,申德荣,聂铁铮,寇月,于戈.基于关联数据的一致性和时效性清洗方法[J].计算机学报,2017,40(1):92-106. 被引量：41
10赵星,李石君,余伟,杨莎,丁永刚,胡亚慧.大数据环境下Web数据源质量评估方法研究[J].计算机工程,2017,34(2):48-56. 被引量：18

同被引文献8

1李娜,李咏洁,赵慧洁,曹扬.基于光谱与空间特征结合的改进高光谱数据分类算法[J].光谱学与光谱分析,2014,34(2):526-531. 被引量：10
2张春,郭明亮.大数据环境下朴素贝叶斯分类算法的改进与实现[J].北京交通大学学报,2015,39(2):35-41. 被引量：12
3任艳.微信息大数据粗糙集的近似约简[J].沈阳工业大学学报,2016,38(3):309-313. 被引量：4
4徐昊,吴明慧,刘伟.基于动态路由与蚁群优化的移动无线自组织网络算法[J].计算机应用研究,2016,33(6):1843-1848. 被引量：3
5樊凌,龚伟.无线网络MOOCs大数据聚类方法优化研究[J].计算机仿真,2016,33(7):435-439. 被引量：10
6王浩宇,孙启明,胡凯.信令大数据技术在精准营销中的应用[J].北京邮电大学学报（社会科学版）,2016,18(4):70-76. 被引量：11
7霍永华,于建,曹毅.一种基于策略的网络资源感知和信息传送控制方法[J].计算机与网络,2016,42(20):68-71. 被引量：2
8程嘉朗,倪巍,吴维刚,曹建农,李宏建.车载自组织网络在智能交通中的应用研究综述[J].计算机科学,2014,41(S1):1-10. 被引量：16

引证文献1

1朱珊娜.移动网络中技术资源信息有效管理仿真研究[J].计算机仿真,2018,35(1):416-419. 被引量：2

二级引证文献2

1顾玲玲,袁新颜,何燕燕.透射式测差技术在智能眼镜设计中的应用[J].木工机床,2019,0(1):13-14.
2金艺,宋晓霞.移动信息技术在改进门诊输液管理流程中的应用[J].中医药管理杂志,2019,27(11):151-152.

1苑森淼,赵远峰,商立国.ARJ解释方法及其目标求解算法的研究[J].计算机学报,1993,16(9):675-681.
2郭爽,郝矿荣,丁永生,彭澎.基于曲率的角点检测及目标区域提取法[J].计算机系统应用,2015,24(4):123-128. 被引量：3
3邓方华,许有才,陶然,郭澍,李新仕,苟敏,李琨.基于层次分析法与差分进化算法极限学习机的自动扶梯故障检测[J].微型机与应用,2016,35(7):12-15. 被引量：3
4苏文龙,罗海鹏,黎贞崇,何建东.5个Van der Waerden数W(3,q)的准确值[J].广西科学院学报,2006,22(3):141-147.
5高炼,曹大平.基于磁通门传感器的数据采集和修正[J].传感器与微系统,2017,36(4):128-130. 被引量：5
6李小航.网络业务流量自相似性及判断方法研究[J].现代计算机,2010,16(11):21-24.
7戴玉婷,吴志刚,杨超.实参数摄动下结构奇异值计算的新方法[J].控制理论与应用,2011,28(1):113-117. 被引量：4
8张勇.一种高精度的自动化数据处理方法[J].自动化学报,1996,22(5):624-628. 被引量：2
9李肯立,石岿然.提高软件成本估计精度的方法[J].湖南农业大学学报（自然科学版）,1999,25(6):491-493. 被引量：3
10刘磊,张建军,陆阳,卫星,韩江洪.仅依赖连通度的压缩感知多目标定位方法[J].通信学报,2016,37(5):152-164. 被引量：6

计算机研究与发展

2016年第2期

浏览历史

内容加载中请稍等...

大数据集成中确定数据准确属性值的WR方法被引量：1

参考文献26

二级参考文献10

共引文献19

同被引文献8

引证文献1

二级引证文献2

相关作者

相关机构

相关主题

浏览历史

大数据集成中确定数据准确属性值的WR方法 被引量：1

参考文献26

二级参考文献10

共引文献19

同被引文献8

引证文献1

二级引证文献2

相关作者

相关机构

相关主题

浏览历史

大数据集成中确定数据准确属性值的WR方法被引量：1