期刊文献+

大数据环境下的电子商务商品实体同一性识别 被引量:11

Recognizing the Same Commodity Entities in Big Data
下载PDF
导出
摘要 怎样从多源异构的、自治独立的、多样化的、不一致的电子商务数据中找出同一商品实体是当前面临的主要挑战.通过分析不同平台的数据特征,首先建立基于商品属性?值的索引模型,构造商品属性-值的全局模式图并进行模式集成,形成模式统一、质量高效的商品信息数据;而后基于层次概率模型对商品的同一性进行多层相似度量;最终完成商品实体识别,并归一化输出满足同一性的商品集和关联属性并进行排序.基于Hadoop平台对3个B2C电子商务数据源中的商品进行了实验,并与传统方法和产品进行了比较,实验结果证明了本框架的可行性、精确性和高效性. The recent blossom of big data and e-commerce has revolutionized our life by providing everyone with the ease and fun never before. How to identify the same commodity entities from these multi-source heterogeneous, fragmented, various and inconsistent e-commerce data for better business intelligence raises a very valuable and challenging topic. In this light, we analyze the characteristics of Web big data and collect the crawled original commodity information data from the different e- commerce platforms, which are the multi-source heterogeneous and mass scales of data. Then, we build an index model based on commodity's attributes and values, and construct a global model map to record the commodity's attribute and value, and form the unified model and high efficient commodity information for the next step. And we measure the similarity of the commodity's identity on the multilayer hierarchical probabilistic model, including identifying the possible candidate commodity set, similarity filtering the candidate commodity set and similarity filtering based on the special items of candidate commodities set. Finally, we output We also evaluate our method on the datasets collecte platforms with Hadoop framework. Experimental method. the same commodity set in the inverted index list. d from Chinese three main-stream B2C e-commerce results show the accuracy and effectiveness of our
出处 《计算机研究与发展》 EI CSCD 北大核心 2015年第8期1794-1805,共12页 Journal of Computer Research and Development
基金 国家自然科学基金项目(61272109) 中央高校基本科研业务费专项资金项目(2042014KF0057) 湖北省自然科学基金项目(2014CFB289) 空军预警学院青年创新基金项目(2013ZDJC0101)
关键词 Web大数据 电子商务 层次概率模型 商品 HADOOP Web big data e-commerce hierarchical probabilistic model commodity Hadoop
  • 相关文献

参考文献16

  • 1孟小峰,李勇,祝建华.社会计算:大数据时代的机遇与挑战[J].计算机研究与发展,2013,50(12):2483-2491. 被引量:148
  • 2Herndndez M A, Stolfo S J. The merge/purge problem for large databases [C] //Proc of the 1995 ACM SIGMOD Int Conf on Management of Data. New Yorkz ACM, 1995: 127-138.
  • 3Arasu A, Kaushik R. A grammar-based entity representation framework for data cleaning [C] //Proc of the ACM SIGMOD Int Conf on Management of Data (SIGMOD 2009). NewYork: ACM, 2009: 233-244.
  • 4Fan Wenfei, Jia Xibei, Li Jianzhong, et al. Reasoning about record matching rules [C] //Proc of the 35th Int Conf on Very Large Data Bases. Trondheim, Norway: VLDB Endowment, 2009: 407-418.
  • 5Chaudhuri S, Ganti V, Motwani R. Robust identification of fuzzy duplicates [C] //Proc of the 21st Int Conf on Data Engineering. Piscataway, NJ: IEEE, 2005: 865-876.
  • 6Chen Z, Kalashnikov D V, Mehrotra S. Adaptive graphical approach to entity reolution [C] //Proc of the 7th ACM IEEE-CS Joint Conf on Digital Liloraries. New York: ACM, 2007: 204-213.
  • 7Singla P, Domingos P. Entity resolution with Markov logic [C] //Proc of the 6th IEEE Int Conf on Data Mining. Piscataway, NJ: IEEE, 2006:572-582.
  • 8Augsten N, Bohlen M, DyresonC, et al. Approximate joins for data-centric XML [C]//Proc of the 24th Int Conf on Data Engineering. Piseataway, NJ: IEEE, 2008: 814-823.
  • 9王立,张蓉,沙朝锋,王晓玲,周傲英.电子商务商品归一化方法研究[J].计算机学报,2014,37(2):312-325. 被引量:10
  • 10李建中,刘显敏.大数据的一个重要方面:数据可用性[J].计算机研究与发展,2013,50(6):1147-1162. 被引量:260

二级参考文献223

  • 1张奥千,宋韶旭,王建民.基于数据质量规则的缺失结果解释约减[J].计算机研究与发展,2013,50(S1):221-229. 被引量:2
  • 2金连,王宏志,黄沈滨,高宏.基于Map-Reduce的大数据缺失值填充算法[J].计算机研究与发展,2013,50(S1):312-321. 被引量:18
  • 3李石君,于俊清,欧伟杰.基于HTML模式代数的Web信息提取方法[J].计算机研究与发展,2006,43(9):1644-1650. 被引量:8
  • 4Redman T. The impact of poor data quality on the typical enterprise [J]. Communications of the ACM, 1998, 41(2) : 79-82.
  • 5Miller D W, Yeast J D, Evans R L. Missing prenatal records at a birth center: A communication problem quantified [C] // Proc of AMIA Annual Syrup Proceedings. Maryland: American Medical Informatics Association, 2005 : 535-539.
  • 6Swartz N. Gartner warns firms of 'dirty data' [J]. Information Management Journal, 2007, 41(3): 6.
  • 7Kohn L T, Corrigan J M, Donaldson M S. To Err is Human: Building a Safer Health System [M]. Washington: National Academies Press, 2000.
  • 8Eckerson W. Data Warehousing Special Report Data quality and the bottom line [R]. Applications Development Trends, 2002.
  • 9English L P. Improving Data Warehouse and Business Information Quality: Methods for Reducing Costs and Increasing Profits [M]. New York: Wiley, 1999.
  • 10Woolsey B, Schulz M. Credit card statistics, industry facts, debt statistics [OL]. [2013-04-20 ]. http://www. creditcards, com/credit-card-news/credit-card-indust ry-facts- personal-debt-statistics-1276, php.

共引文献441

同被引文献70

  • 1余伟,李石君,洪辉,田建伟.基于覆盖关系的Deep Web数据源排名[J].计算机研究与发展,2007,44(z3):29-34. 被引量:4
  • 2李明达,王宏志,张佳程,李建中,高宏.PEIF:基于并行机群的大数据实体识别算法[J].计算机研究与发展,2013,50(S1):211-220. 被引量:4
  • 3霍然,王宏志,朱鎔,李建中,高宏.基于Map-Reduce的大数据实体识别算法[J].计算机研究与发展,2013,50(S2):170-179. 被引量:9
  • 42015年中国电子商务市场数据监测报告[R].hap://www.100ec.cn/zt/bd/.
  • 5T. Bernecker, H. P. Kriegel, N. Mamoulis, et al. Scalable Proba- bilistic SimilarityRanking in Uncertain Databases [J].IEEE Trans- actions on Knowledge and DataEngineering, 2010, 22 (9):1234- 1246.
  • 6F. Naumann, M. Herschel. An Introduction to Duplicate Detection [J]. SynthesisLectures on Data Management, 2010, 2(1): 1-87.
  • 7J. Wang, G. Li, J. X. Yu, et al. Entity Matching: How Similar Is Similar [J]. Pro-ceedings of the VLDB Endowment, 2011, 4(10): 622-633.
  • 8S. B. Roy, M. D. Cock, V. Mandava, et al. The Microsoft Academic Search Dataset and Kdd Cup 2013 [C]//Proceedings of the 2013 KDD Cup 2013 Work-shop. 2013:1.
  • 9S. E. Whang, H. Garcia- Molina. Joint Entity Resolution[C]//IEEE 28th Interna-tional Con- ference on Data Engineering. 2012:294-305.
  • 10S. E. Whang, H. Garcia-Molina. Entity Resolution with Evolving Rules [J]. Pro-ceedings of the VLDB Endowment,2010,3 (1-2): 1326-1337.

引证文献11

二级引证文献45

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部