期刊文献+
共找到2篇文章
< 1 >
每页显示 20 50 100
Duplicate identification model for deep web 被引量:4
1
作者 刘丽楠 寇月 +2 位作者 孙高尚 申德荣 于戈 《Journal of Southeast University(English Edition)》 EI CAS 2008年第3期315-317,共3页
A duplicate identification model is presented to deal with semi-structured or unstructured data extracted from multiple data sources in the deep web.First,the extracted data is generated to the entity records in the d... A duplicate identification model is presented to deal with semi-structured or unstructured data extracted from multiple data sources in the deep web.First,the extracted data is generated to the entity records in the data preprocessing module,and then,in the heterogeneous records processing module it calculates the similarity degree of the entity records to obtain the duplicate records based on the weights calculated in the homogeneous records processing module.Unlike traditional methods,the proposed approach is implemented without schema matching in advance.And multiple estimators with selective algorithms are adopted to reach a better matching efficiency.The experimental results show that the duplicate identification model is feasible and efficient. 展开更多
关键词 duplicate records deep web data cleaning semi-structured data
下载PDF
Random Forests Algorithm Based Duplicate Detection in On-Site Programming Big Data Environment 被引量:1
2
作者 Qianqian Li Meng Li +1 位作者 Lei Guo Zhen Zhang 《Journal of Information Hiding and Privacy Protection》 2020年第4期199-205,共7页
On-site programming big data refers to the massive data generated in the process of software development with the characteristics of real-time,complexity and high-difficulty for processing.Therefore,data cleaning is e... On-site programming big data refers to the massive data generated in the process of software development with the characteristics of real-time,complexity and high-difficulty for processing.Therefore,data cleaning is essential for on-site programming big data.Duplicate data detection is an important step in data cleaning,which can save storage resources and enhance data consistency.Due to the insufficiency in traditional Sorted Neighborhood Method(SNM)and the difficulty of high-dimensional data detection,an optimized algorithm based on random forests with the dynamic and adaptive window size is proposed.The efficiency of the algorithm can be elevated by improving the method of the key-selection,reducing dimension of data set and using an adaptive variable size sliding window.Experimental results show that the improved SNM algorithm exhibits better performance and achieve higher accuracy. 展开更多
关键词 On-site programming big data duplicate record detection random forests adaptive sliding window
下载PDF
上一页 1 下一页 到第
使用帮助 返回顶部