Copy-Move Forgery(CMF) is one of the simple and effective operations to create forged digital images.Recently,techniques based on Scale Invariant Features Transform(SIFT) are widely used to detect CMF.Various approach...Copy-Move Forgery(CMF) is one of the simple and effective operations to create forged digital images.Recently,techniques based on Scale Invariant Features Transform(SIFT) are widely used to detect CMF.Various approaches under the SIFT-based framework are the most acceptable ways to CMF detection due to their robust performance.However,for some CMF images,these approaches cannot produce satisfactory detection results.For instance,the number of the matched keypoints may be too less to prove an image to be a CMF image or to generate an accurate result.Sometimes these approaches may even produce error results.According to our observations,one of the reasons is that detection results produced by the SIFT-based framework depend highly on parameters whose values are often determined with experiences.These values are only applicable to a few images,which limits their application.To solve the problem,a novel approach named as CMF Detection with Particle Swarm Optimization(CMFDPSO) is proposed in this paper.CMFD-PSO integrates the Particle Swarm Optimization(PSO) algorithm into the SIFT-based framework.It utilizes the PSO algorithm to generate customized parameter values for images,which are used for CMF detection under the SIFT-based framework.Experimental results show that CMFD-PSO has good performance.展开更多
On-site programming big data refers to the massive data generated in the process of software development with the characteristics of real-time,complexity and high-difficulty for processing.Therefore,data cleaning is e...On-site programming big data refers to the massive data generated in the process of software development with the characteristics of real-time,complexity and high-difficulty for processing.Therefore,data cleaning is essential for on-site programming big data.Duplicate data detection is an important step in data cleaning,which can save storage resources and enhance data consistency.Due to the insufficiency in traditional Sorted Neighborhood Method(SNM)and the difficulty of high-dimensional data detection,an optimized algorithm based on random forests with the dynamic and adaptive window size is proposed.The efficiency of the algorithm can be elevated by improving the method of the key-selection,reducing dimension of data set and using an adaptive variable size sliding window.Experimental results show that the improved SNM algorithm exhibits better performance and achieve higher accuracy.展开更多
Discovery of service nodes in flows is a challenging task, especially in large ISPs or campus networks where the amount of traffic across net-work is rmssive. We propose an effective data structure called Round-robin ...Discovery of service nodes in flows is a challenging task, especially in large ISPs or campus networks where the amount of traffic across net-work is rmssive. We propose an effective data structure called Round-robin Buddy Bloom Filters (RBBF) to detect duplicate elements in flows. A two-stage approximate algorithm based on RBBF which can be used for detecting service nodes from NetFlow data is also given and the perfonmnce of the algorithm is analyzed. In our case, the proposed algorithm uses about 1% memory of hash table with false positive error rate less than 5%. A proto-type system, which is compatible with both IPv4 and IPv6, using the proposed data structure and al-gorithm is introduced. Some real world case studies based on the prototype system are discussed.展开更多
ESystems based on EHRs(Electronic health records)have been in use for many years and their amplified realizations have been felt recently.They still have been pioneering collections of massive volumes of health data.D...ESystems based on EHRs(Electronic health records)have been in use for many years and their amplified realizations have been felt recently.They still have been pioneering collections of massive volumes of health data.Duplicate detections involve discovering records referring to the same practical components,indicating tasks,which are generally dependent on several input parameters that experts yield.Record linkage specifies the issue of finding identical records across various data sources.The similarity existing between two records is characterized based on domain-based similarity functions over different features.De-duplication of one dataset or the linkage of multiple data sets has become a highly significant operation in the data processing stages of different data mining programmes.The objective is to match all the records associated with the same entity.Various measures have been in use for representing the quality and complexity about data linkage algorithms,and many other novel metrics have been introduced.An outline of the problem existing in themeasurement of data linkage and de-duplication quality and complexity is presented.This article focuses on the reprocessing of health data that is horizontally divided among data custodians,with the purpose of custodians giving similar features to sets of patients.The first step in this technique is about an automatic selection of training examples with superior quality from the compared record pairs and the second step involves training the reciprocal neuro-fuzzy inference system(RANFIS)classifier.Using the Optimal Threshold classifier,it is presumed that there is information about the original match status for all compared record pairs(i.e.,Ant Lion Optimization),and therefore an optimal threshold can be computed based on the respective RANFIS.Febrl,Clinical Decision(CD),and Cork Open Research Archive(CORA)data repository help analyze the proposed method with evaluated benchmarks with current techniques.展开更多
Applications of Wireless Sensor devices are widely used byvarious monitoring sections such as environmental monitoring, industrialsensing, habitat modeling, healthcare and enemy movement detection systems.Researchers ...Applications of Wireless Sensor devices are widely used byvarious monitoring sections such as environmental monitoring, industrialsensing, habitat modeling, healthcare and enemy movement detection systems.Researchers were found that 16 bytes packet size (payload) requires MediaAccess Control (MAC) and globally unique network addresses overheads asmore as the payload itself which is not reasonable in most situations. Theapproach of using a unique address isn’t preferable for most Wireless SensorNetworks (WSNs) applications as well. Based on the mentioned drawbacks,the current work aims to fill the existing gap in the field area by providingtwo strategies. First, name/address solutions that assign unique addresseslocally to clustered topology-based sensor devices, reutilized in a spatialmanner, and reduce name/address size by a noticeable amount of 2.9 basedon conducted simulation test. Second, name/address solutions that assignreutilizing of names/addresses to location-unaware spanning-tree topologyin an event-driven WSNs case (that is providing minimal low latenciesand delivering addressing packet in an efficient manner). Also, to declinethe approach of needing both addresses (MAC and network) separately, itdiscloses how in a spatial manner to reutilize locally unique sensor devicename approach and could be utilized in both contexts and providing anenergy-efficient protocol for location unawareness clustered based WSNs.In comparison, an experimental simulation test performed and given theaddresses solution with less overhead in the header and 62 percent fairpayload efficiency that outperforms 34 percent less effective globally uniqueaddresses. Furthermore, the proposed work provides addresses uniquenessfor network-level without using network-wide Duplicate Address Detection(DAD) algorithm. Consequently, the current study provides a roadmap foraddressing/naming scheme to help researchers in this field of study. In general,some assumptions were taken during the work phases of this study such asnumber of Cluster Head (CH) nodes is 6% of entire sensor nodes, locationunawareness for entire sensor network and 4 bits per node address space whichconsidered as the limitation of the study.展开更多
Detecting duplicates in data streams is an important problem that has a wide range of applications. In general, precisely detecting duplicates in an unbounded data stream is not feasible in most streaming scenarios, a...Detecting duplicates in data streams is an important problem that has a wide range of applications. In general, precisely detecting duplicates in an unbounded data stream is not feasible in most streaming scenarios, and, on the other hand, the elements in data streams are always time sensitive. These make it particular significant approximately detecting duplicates among newly arrived elements of a data stream within a fixed time frame. In this paper, we present a novel data structure, Decaying Bloom Filter (DBF), as an extension of the Counting Bloom Filter, that effectively removes stale elements as new elements continuously arrive over sliding windows. On the DBF basis we present an efficient algorithm to approximately detect duplicates over sliding windows. Our algorithm may produce false positive errors, but not false negative errors as in many previous results. We analyze the time complexity and detection accuracy, and give a tight upper bound of false positive rate. For a given space G bits and sliding window size W, our algorithm has an amortized time complexity of O(√G/W). Both analytical and experimental results on synthetic data demonstrate that our algorithm is superior in both execution time and detection accuracy to the previous results.展开更多
Deduplication technology has been increasingly used to reduce storage costs. Though it has been successfully applied to backup and archival systems, existing techniques can hardly be deployed in primary storage system...Deduplication technology has been increasingly used to reduce storage costs. Though it has been successfully applied to backup and archival systems, existing techniques can hardly be deployed in primary storage systems due to the associated latency cost of detecting duplicated data, where every unit has to be checked against a substantially large fin- gerprint index before it is written. In this paper we introduce Leach, for inline primary storage, a self-learning in-memory fingerprints cache to reduce the writing cost in deduplica- tion system. Leach is motivated by the characteristics of real- world I/O workloads: highly data skew exist in the access patterns of duplicated data. Leach adopts a splay tree to or- ganize the on-disk fingerprint index, automatically learns the access patterns and maintains hot working sets in cache mem- ory, with a goal to service a majority of duplicated data de- tection. Leveraging the working set property, Leach provides optimization to reduce the cost of splay operations on the fin- gerprint index and cache updates. In comprehensive experi- ments on several real-world datasets, Leach outperforms con- ventional LRU (least recently used) cache policy by reducing the number of cache misses, and significantly improves write performance without great impact to cache hits.展开更多
Several tasks related to geographical information retrieval and to the geographical information sciences involve toponym matching,that is,the problem of matching place names that share a common referent.In this articl...Several tasks related to geographical information retrieval and to the geographical information sciences involve toponym matching,that is,the problem of matching place names that share a common referent.In this article,we present the results of a wide-ranging evaluation on the performance of different string similarity metrics over the toponym matching task.We also report on experiments involving the usage of supervised machine learning for combining multiple similarity metrics,which has the natural advantage of avoiding the manual tuning of similarity thresholds.Experiments with a very large dataset show that the performance differences for the individual similarity metrics are relatively small,and that carefully tuning the similarity threshold is important for achieving good results.The methods based on supervised machine learning,particularly when considering ensembles of decision trees,can achieve good results on this task,significantly outperforming the individual similarity metrics.展开更多
Communication and coordination between OSS developers who do not work physically in the same location have always been the challenging issues.The pull-based development model,as the state-of-art collaborative developm...Communication and coordination between OSS developers who do not work physically in the same location have always been the challenging issues.The pull-based development model,as the state-of-art collaborative development mechanism,provides high openness and transparency to improve the visibility of contributors'work.However,duplicate contributions may still be submitted by more than one contributors to solve the same problem due to the parallel and uncoordinated nature of this model.If not detected in time,duplicate pull-requests can cause contributors and reviewers to waste time and energy on redundant work.In this paper,we propose an approach combining textual and change similarities to automatically detect duplicate contributions in pull-based model at submission time.For a new-arriving contribution,we first compute textual similarity and change similarity between it and other existing contributions.And then our method returns a list of candidate duplicate contributions that are most similar with the new contribution in terms of the combined textual and change similarity.The evaluation shows that 83.4%of the duplicates can be found in average when we use the combined textual and change similarity compared to 54.8%using only textual similarity and 78.2%using only change similarity.展开更多
基金supported in part by the National Natural Science Foundation of China under grant No.(61472429,61070192,91018008,61303074,61170240)Beijing Natural Science Foundation under grant No.4122041+1 种基金National High-Tech Research Development Program of China under grant No.2007AA01Z414National Science and Technology Major Project of China under grant No.2012ZX01039-004
文摘Copy-Move Forgery(CMF) is one of the simple and effective operations to create forged digital images.Recently,techniques based on Scale Invariant Features Transform(SIFT) are widely used to detect CMF.Various approaches under the SIFT-based framework are the most acceptable ways to CMF detection due to their robust performance.However,for some CMF images,these approaches cannot produce satisfactory detection results.For instance,the number of the matched keypoints may be too less to prove an image to be a CMF image or to generate an accurate result.Sometimes these approaches may even produce error results.According to our observations,one of the reasons is that detection results produced by the SIFT-based framework depend highly on parameters whose values are often determined with experiences.These values are only applicable to a few images,which limits their application.To solve the problem,a novel approach named as CMF Detection with Particle Swarm Optimization(CMFDPSO) is proposed in this paper.CMFD-PSO integrates the Particle Swarm Optimization(PSO) algorithm into the SIFT-based framework.It utilizes the PSO algorithm to generate customized parameter values for images,which are used for CMF detection under the SIFT-based framework.Experimental results show that CMFD-PSO has good performance.
基金supported by the National Key R&D Program of China(Nos.2018YFB1003905)the National Natural Science Foundation of China under Grant No.61971032,Fundamental Research Funds for the Central Universities(No.FRF-TP-18-008A3).
文摘On-site programming big data refers to the massive data generated in the process of software development with the characteristics of real-time,complexity and high-difficulty for processing.Therefore,data cleaning is essential for on-site programming big data.Duplicate data detection is an important step in data cleaning,which can save storage resources and enhance data consistency.Due to the insufficiency in traditional Sorted Neighborhood Method(SNM)and the difficulty of high-dimensional data detection,an optimized algorithm based on random forests with the dynamic and adaptive window size is proposed.The efficiency of the algorithm can be elevated by improving the method of the key-selection,reducing dimension of data set and using an adaptive variable size sliding window.Experimental results show that the improved SNM algorithm exhibits better performance and achieve higher accuracy.
基金supported by the National Basic Research Program of China under Grant No. 2009CB320505
文摘Discovery of service nodes in flows is a challenging task, especially in large ISPs or campus networks where the amount of traffic across net-work is rmssive. We propose an effective data structure called Round-robin Buddy Bloom Filters (RBBF) to detect duplicate elements in flows. A two-stage approximate algorithm based on RBBF which can be used for detecting service nodes from NetFlow data is also given and the perfonmnce of the algorithm is analyzed. In our case, the proposed algorithm uses about 1% memory of hash table with false positive error rate less than 5%. A proto-type system, which is compatible with both IPv4 and IPv6, using the proposed data structure and al-gorithm is introduced. Some real world case studies based on the prototype system are discussed.
基金This research project was funded by Princess Nourah bint Abdulrahman University Researchers Supporting Project Number(PNURSP2022R234),Princess Nourah bint Abdulrahman University,Riyadh,Saudi Arabia.
文摘ESystems based on EHRs(Electronic health records)have been in use for many years and their amplified realizations have been felt recently.They still have been pioneering collections of massive volumes of health data.Duplicate detections involve discovering records referring to the same practical components,indicating tasks,which are generally dependent on several input parameters that experts yield.Record linkage specifies the issue of finding identical records across various data sources.The similarity existing between two records is characterized based on domain-based similarity functions over different features.De-duplication of one dataset or the linkage of multiple data sets has become a highly significant operation in the data processing stages of different data mining programmes.The objective is to match all the records associated with the same entity.Various measures have been in use for representing the quality and complexity about data linkage algorithms,and many other novel metrics have been introduced.An outline of the problem existing in themeasurement of data linkage and de-duplication quality and complexity is presented.This article focuses on the reprocessing of health data that is horizontally divided among data custodians,with the purpose of custodians giving similar features to sets of patients.The first step in this technique is about an automatic selection of training examples with superior quality from the compared record pairs and the second step involves training the reciprocal neuro-fuzzy inference system(RANFIS)classifier.Using the Optimal Threshold classifier,it is presumed that there is information about the original match status for all compared record pairs(i.e.,Ant Lion Optimization),and therefore an optimal threshold can be computed based on the respective RANFIS.Febrl,Clinical Decision(CD),and Cork Open Research Archive(CORA)data repository help analyze the proposed method with evaluated benchmarks with current techniques.
文摘Applications of Wireless Sensor devices are widely used byvarious monitoring sections such as environmental monitoring, industrialsensing, habitat modeling, healthcare and enemy movement detection systems.Researchers were found that 16 bytes packet size (payload) requires MediaAccess Control (MAC) and globally unique network addresses overheads asmore as the payload itself which is not reasonable in most situations. Theapproach of using a unique address isn’t preferable for most Wireless SensorNetworks (WSNs) applications as well. Based on the mentioned drawbacks,the current work aims to fill the existing gap in the field area by providingtwo strategies. First, name/address solutions that assign unique addresseslocally to clustered topology-based sensor devices, reutilized in a spatialmanner, and reduce name/address size by a noticeable amount of 2.9 basedon conducted simulation test. Second, name/address solutions that assignreutilizing of names/addresses to location-unaware spanning-tree topologyin an event-driven WSNs case (that is providing minimal low latenciesand delivering addressing packet in an efficient manner). Also, to declinethe approach of needing both addresses (MAC and network) separately, itdiscloses how in a spatial manner to reutilize locally unique sensor devicename approach and could be utilized in both contexts and providing anenergy-efficient protocol for location unawareness clustered based WSNs.In comparison, an experimental simulation test performed and given theaddresses solution with less overhead in the header and 62 percent fairpayload efficiency that outperforms 34 percent less effective globally uniqueaddresses. Furthermore, the proposed work provides addresses uniquenessfor network-level without using network-wide Duplicate Address Detection(DAD) algorithm. Consequently, the current study provides a roadmap foraddressing/naming scheme to help researchers in this field of study. In general,some assumptions were taken during the work phases of this study such asnumber of Cluster Head (CH) nodes is 6% of entire sensor nodes, locationunawareness for entire sensor network and 4 bits per node address space whichconsidered as the limitation of the study.
基金supported by the "Hundred Talents Program" of CAS and the National Natural Science Foundation of China under Grant No. 60772034.
文摘Detecting duplicates in data streams is an important problem that has a wide range of applications. In general, precisely detecting duplicates in an unbounded data stream is not feasible in most streaming scenarios, and, on the other hand, the elements in data streams are always time sensitive. These make it particular significant approximately detecting duplicates among newly arrived elements of a data stream within a fixed time frame. In this paper, we present a novel data structure, Decaying Bloom Filter (DBF), as an extension of the Counting Bloom Filter, that effectively removes stale elements as new elements continuously arrive over sliding windows. On the DBF basis we present an efficient algorithm to approximately detect duplicates over sliding windows. Our algorithm may produce false positive errors, but not false negative errors as in many previous results. We analyze the time complexity and detection accuracy, and give a tight upper bound of false positive rate. For a given space G bits and sliding window size W, our algorithm has an amortized time complexity of O(√G/W). Both analytical and experimental results on synthetic data demonstrate that our algorithm is superior in both execution time and detection accuracy to the previous results.
文摘Deduplication technology has been increasingly used to reduce storage costs. Though it has been successfully applied to backup and archival systems, existing techniques can hardly be deployed in primary storage systems due to the associated latency cost of detecting duplicated data, where every unit has to be checked against a substantially large fin- gerprint index before it is written. In this paper we introduce Leach, for inline primary storage, a self-learning in-memory fingerprints cache to reduce the writing cost in deduplica- tion system. Leach is motivated by the characteristics of real- world I/O workloads: highly data skew exist in the access patterns of duplicated data. Leach adopts a splay tree to or- ganize the on-disk fingerprint index, automatically learns the access patterns and maintains hot working sets in cache mem- ory, with a goal to service a majority of duplicated data de- tection. Leveraging the working set property, Leach provides optimization to reduce the cost of splay operations on the fin- gerprint index and cache updates. In comprehensive experi- ments on several real-world datasets, Leach outperforms con- ventional LRU (least recently used) cache policy by reducing the number of cache misses, and significantly improves write performance without great impact to cache hits.
基金the Trans-Atlantic Platform for the Social Sciences and Humanities,through the Digging into Data project with reference HJ-253525also through the Reassembling the Republic of Letters networking programme(EU COST Action IS1310)+1 种基金The researchers from INESC-ID also had financial support from Fundação para a Ciência e a Tecnologia(FCT),through project grants with references PTDC/EEI-SCR/1743/2014(Saturn)CMUP-ERI/TIC/0046/2014(GoLocal),as well as through the INESC-ID multi-annual funding from the PIDDAC programme(UID/CEC/50021/2013).
文摘Several tasks related to geographical information retrieval and to the geographical information sciences involve toponym matching,that is,the problem of matching place names that share a common referent.In this article,we present the results of a wide-ranging evaluation on the performance of different string similarity metrics over the toponym matching task.We also report on experiments involving the usage of supervised machine learning for combining multiple similarity metrics,which has the natural advantage of avoiding the manual tuning of similarity thresholds.Experiments with a very large dataset show that the performance differences for the individual similarity metrics are relatively small,and that carefully tuning the similarity threshold is important for achieving good results.The methods based on supervised machine learning,particularly when considering ensembles of decision trees,can achieve good results on this task,significantly outperforming the individual similarity metrics.
基金This work was supported by the National Key Research and Development Program of China under Grant No. 2018YFB1004202the National Natural Science Foundation of China under Grant No. 61702534.
文摘Communication and coordination between OSS developers who do not work physically in the same location have always been the challenging issues.The pull-based development model,as the state-of-art collaborative development mechanism,provides high openness and transparency to improve the visibility of contributors'work.However,duplicate contributions may still be submitted by more than one contributors to solve the same problem due to the parallel and uncoordinated nature of this model.If not detected in time,duplicate pull-requests can cause contributors and reviewers to waste time and energy on redundant work.In this paper,we propose an approach combining textual and change similarities to automatically detect duplicate contributions in pull-based model at submission time.For a new-arriving contribution,we first compute textual similarity and change similarity between it and other existing contributions.And then our method returns a list of candidate duplicate contributions that are most similar with the new contribution in terms of the combined textual and change similarity.The evaluation shows that 83.4%of the duplicates can be found in average when we use the combined textual and change similarity compared to 54.8%using only textual similarity and 78.2%using only change similarity.