With the advent of the IoT era, the amount of real-time data that is processed in data centers has increased explosively. As a result, stream mining, extracting useful knowledge from a huge amount of data in real time...With the advent of the IoT era, the amount of real-time data that is processed in data centers has increased explosively. As a result, stream mining, extracting useful knowledge from a huge amount of data in real time, is attracting more and more attention. It is said, however, that real- time stream processing will become more difficult in the near future, because the performance of processing applications continues to increase at a rate of 10% - 15% each year, while the amount of data to be processed is increasing exponentially. In this study, we focused on identifying a promising stream mining algorithm, specifically a Frequent Itemset Mining (FIsM) algorithm, then we improved its performance using an FPGA. FIsM algorithms are important and are basic data- mining techniques used to discover association rules from transactional databases. We improved on an approximate FIsM algorithm proposed recently so that it would fit onto hardware architecture efficiently. We then ran experiments on an FPGA. As a result, we have been able to achieve a speed 400% faster than the original algorithm implemented on a CPU. Moreover, our FPGA prototype showed a 20 times speed improvement compared to the CPU version.展开更多
The increasing penetration of renewable energy resources with highly fluctuating outputs has placed increasing concern on the accuracy and timeliness of electric power system state estimation(SE).Meanwhile,we note tha...The increasing penetration of renewable energy resources with highly fluctuating outputs has placed increasing concern on the accuracy and timeliness of electric power system state estimation(SE).Meanwhile,we note that only a fraction of system states fluctuate at the millisecond level and require to be updated.As such,refreshing only those states with significant variation would enhance the computational efficiency of SE and make fast-continuous update of states possible.However,this is difficult to achieve with conventional SE methods,which generally refresh states of the entire system every 4–5 s.In this context,we propose a local hybrid linear SE framework using stream processing,in which synchronized measurements received from phasor measurement units(PMUs),and trigger/timingmode measurements received from remote terminal units(RTUs)are used to update the associated local states.Moreover,the measurement update process efficiency and timeliness are enhanced by proposing a trigger measurement-based fast dynamic partitioning algorithm for determining the areas of the system with states requiring recalculation.In particular,non-iterative hybrid linear formulations with both RTUs and PMUs are employed to solve the local SE problem.The timeliness,accuracy,and computational efficiency of the proposed method are demonstrated by extensive simulations based on IEEE 118-,300-,and 2383-bus systems.展开更多
Most distributed stream processing engines(DSPEs)do not support online task management and cannot adapt to time-varying data flows.Recently,some studies have proposed online task deployment algorithms to solve this pr...Most distributed stream processing engines(DSPEs)do not support online task management and cannot adapt to time-varying data flows.Recently,some studies have proposed online task deployment algorithms to solve this problem.However,these approaches do not guarantee the Quality of Service(QoS)when the task deployment changes at runtime,because the task migrations caused by the change of task deployments will impose an exorbitant cost.We study one of the most popular DSPEs,Apache Storm,and find out that when a task needs to be migrated,Storm has to stop the resource(implemented as a process of Worker in Storm)where the task is deployed.This will lead to the stop and restart of all tasks in the resource,resulting in the poor performance of task migrations.Aiming to solve this problem,in this pa-per,we propose N-Storm(Nonstop Storm),which is a task-resource decoupling DSPE.N-Storm allows tasks allocated to resources to be changed at runtime,which is implemented by a thread-level scheme for task migrations.Particularly,we add a local shared key/value store on each node to make resources aware of the changes in the allocation plan.Thus,each resource can manage its tasks at runtime.Based on N-Storm,we further propose Online Task Deployment(OTD).Differ-ing from traditional task deployment algorithms that deploy all tasks at once without considering the cost of task migra-tions caused by a task re-deployment,OTD can gradually adjust the current task deployment to an optimized one based on the communication cost and the runtime states of resources.We demonstrate that OTD can adapt to different kinds of applications including computation-and communication-intensive applications.The experimental results on a real DSPE cluster show that N-Storm can avoid the system stop and save up to 87%of the performance degradation time,compared with Apache Storm and other state-of-the-art approaches.In addition,OTD can increase the average CPU usage by 51%for computation-intensive applications and reduce network communication costs by 88%for communication-intensive ap-plications.展开更多
Implementations of metadata tend to favor centralized,static metadata.This depiction is at variance with the past decade of focus on big data,cloud native architectures and streaming platforms.Big data velocity can de...Implementations of metadata tend to favor centralized,static metadata.This depiction is at variance with the past decade of focus on big data,cloud native architectures and streaming platforms.Big data velocity can demand a correspondingly dynamic view of metadata.These trends,which include DevOps,CI/CD,DataOps and data fabric,are surveyed.Several specific cloud native tools are reviewed and weaknesses in their current metadata use are identified.Implementations are suggested which better exploit capabilities of streaming platform paradigms,in which metadata is continuously collected in dynamic contexts.Future cloud native software features are identified which could enable streamed metadata to power real time data fusion or fine tune automated reasoning through real time ontology updates.展开更多
The data stream processing framework processes the stream data based on event-time to ensure that the request can be responded to in real-time.In reality,streaming data usually arrives out-of-order due to factors such...The data stream processing framework processes the stream data based on event-time to ensure that the request can be responded to in real-time.In reality,streaming data usually arrives out-of-order due to factors such as network delay.The data stream processing framework commonly adopts the watermark mechanism to address the data disorderedness.Watermark is a special kind of data inserted into the data stream with a timestamp,which helps the framework to decide whether the data received is late and thus be discarded.Traditional watermark generation strategies are periodic;they cannot dynamically adjust the watermark distribution to balance the responsiveness and accuracy.This paper proposes an adaptive watermark generation mechanism based on the time series prediction model to address the above limitation.This mechanism dynamically adjusts the frequency and timing of watermark distribution using the disordered data ratio and other lateness properties of the data stream to improve the system responsiveness while ensuring acceptable result accuracy.We implement the proposed mechanism on top of Flink and evaluate it with realworld datasets.The experiment results show that our mechanism is superior to the existing watermark distribution strategies in terms of both system responsiveness and result accuracy.展开更多
Purpose-The purpose of this paper is to propose a data prediction framework for scenarios which require forecasting demand for large-scale data sources,e.g.,sensor networks,securities exchange,electric power secondary...Purpose-The purpose of this paper is to propose a data prediction framework for scenarios which require forecasting demand for large-scale data sources,e.g.,sensor networks,securities exchange,electric power secondary system,etc.Concretely,the proposed framework should handle several difficult requirements including the management of gigantic data sources,the need for a fast self-adaptive algorithm,the relatively accurate prediction of multiple time series,and the real-time demand.Design/methodology/approach-First,the autoregressive integrated moving average-based prediction algorithm is introduced.Second,the processing framework is designed,which includes a time-series data storage model based on the HBase,and a real-time distributed prediction platform based on Storm.Then,the work principle of this platform is described.Finally,a proof-of-concept testbed is illustrated to verify the proposed framework.Findings-Several tests based on Power Grid monitoring data are provided for the proposed framework.The experimental results indicate that prediction data are basically consistent with actual data,processing efficiency is relatively high,and resources consumption is reasonable.Originality/value-This paper provides a distributed real-time data prediction framework for large-scale time-series data,which can exactly achieve the requirement of the effective management,prediction efficiency,accuracy,and high concurrency for massive data sources.展开更多
Stream processing has emerged as a useful technology for applications which require continuous and low latency computation on infinite streaming data.Since stream processing systems(SPSs)usually require distributed de...Stream processing has emerged as a useful technology for applications which require continuous and low latency computation on infinite streaming data.Since stream processing systems(SPSs)usually require distributed deployment on clusters of servers in face of large-scale of data,it is especially common to meet with failures of processing nodes or communication networks,but should be handled seriously considering service quality.A failed system may produce wrong results or become unavailable,resulting in a decline in user experience or even significant financial loss.Hence,a large amount of fault tolerance approaches have been proposed for SPSs.These approaches often have their own priorities on specific performance concerns,e.g.,runtime overhead and recovery efficiency.Nevertheless,there is a lack of a systematic overview and classification of the state-of-the-art fault tolerance approaches in SPSs,which will become an obstacle for the development of SPSs.Therefore,we investigate the existing achievements and develop a taxonomy of the fault tolerance in SPSs.Furthermore,we propose an evaluation framework tailored for fault tolerance,demonstrate the experimental results on two representative open-sourced SPSs and exposit the possible disadvantages in current designs.Finally,we specify future research directions in this domain.展开更多
In recent years,the demand for real-time data processing has been increasing,and various stream processing systems have emerged.When the amount of data input to the stream processing system fluctuates,the computing re...In recent years,the demand for real-time data processing has been increasing,and various stream processing systems have emerged.When the amount of data input to the stream processing system fluctuates,the computing resources required by the stream processing job will also change.The resources used by stream processing jobs need to be adjusted according to load changes,avoiding the waste of computing resources.At present,existing works adjust stream processing jobs based on the assumption that there is a linear relationship between the operator parallelism and operator resource consumption(e.g.,throughput),which makes a significant deviation when the operator parallelism increases.This paper proposes a nonlinear model to represent operator performance.We divide the operator performance into three stages,the Non-competition stage,the Non-full competition stage,and the Full competition stage.Using our proposed performance model,given the parallelism of the operator,we can accurately predict the CPU utilization and operator throughput.Evaluated with actual experiments,the prediction error of our model is below 5%.We also propose a quick accurate auto-scaling(QAAS)method that uses the operator performance model to implement the auto-scaling of the operator parallelism of the Flink job.Compared to previous work,QAAS is able to maintain stable job performance under load changes,minimizing the number of job adjustments and reducing data backlogs by 50%.展开更多
The integration of cloud and IoT edge devices is of significance in reducing the latency of IoT stream data processing by moving services closer to the edge-end.In this connection,a key issue is to determine when and ...The integration of cloud and IoT edge devices is of significance in reducing the latency of IoT stream data processing by moving services closer to the edge-end.In this connection,a key issue is to determine when and where services should be deployed.Common service deployment strategies used to be static based on the rules defined at the design time.However,dynamically changing IoT environments bring about unexpected situations such as out-of-range stream fluctuation,where the static service deployment solutions are not efficient.In this paper,we propose a dynamic service deployment mechanism based on the prediction of upcoming stream data.To effectively predict upcoming workloads,we combine the online machine learning methods with an online optimization algorithm for service deployment.A simulation-based evaluation demonstrates that,compared with those state-of-the art approaches,the approach proposed in this paper has a lower latency of stream processing.展开更多
A high-performance, distributed, complex-event processing en- gine with improved scalability is proposed. In this new engine, the stateless proeessing node is combined with distributed stor- age so that scale and perf...A high-performance, distributed, complex-event processing en- gine with improved scalability is proposed. In this new engine, the stateless proeessing node is combined with distributed stor- age so that scale and performance can be linearly expanded. This design prevents single node failure and makes the system highly reliable.展开更多
In our previous work, the reactive dividing wall column(RDWC) was proposed and proved to be effective for selective hydrogenation and separation of C3 stream. In the present paper, the dynamics and control of the prop...In our previous work, the reactive dividing wall column(RDWC) was proposed and proved to be effective for selective hydrogenation and separation of C3 stream. In the present paper, the dynamics and control of the proposed RDWC are investigated. Four control structures including composition and temperature controls are proposed. The feed forward controllers are employed in the four control strategies to shorten the dynamic response time, reduce the maximum deviations and offer an immediate adjustment. The control structures are compared by applying them into the RDWC system with 20% disturbances in both the feed flow rate and the feed compositions, and the results are discussed.展开更多
Big data applications in healthcare have provided a variety of solutions to reduce costs,errors,and waste.This work aims to develop a real-time system based on big medical data processing in the cloud for the predicti...Big data applications in healthcare have provided a variety of solutions to reduce costs,errors,and waste.This work aims to develop a real-time system based on big medical data processing in the cloud for the prediction of health issues.In the proposed scalable system,medical parameters are sent to Apache Spark to extract attributes from data and apply the proposed machine learning algorithm.In this way,healthcare risks can be predicted and sent as alerts and recommendations to users and healthcare providers.The proposed work also aims to provide an effective recommendation system by using streaming medical data,historical data on a user’s profile,and a knowledge database to make themost appropriate real-time recommendations and alerts based on the sensor’s measurements.This proposed scalable system works by tweeting the health status attributes of users.Their cloud profile receives the streaming healthcare data in real time by extracting the health attributes via a machine learning prediction algorithm to predict the users’health status.Subsequently,their status can be sent on demand to healthcare providers.Therefore,machine learning algorithms can be applied to stream health care data from wearables and provide users with insights into their health status.These algorithms can help healthcare providers and individuals focus on health risks and health status changes and consequently improve the quality of life.展开更多
The analytical and monitoring capabilities of central event re-positories, such as log servers and intrusion detection sys-tems, are limited by the amount of structured information ex-tracted from the events they rece...The analytical and monitoring capabilities of central event re-positories, such as log servers and intrusion detection sys-tems, are limited by the amount of structured information ex-tracted from the events they receive. Diverse networks and ap-plications log their events in many different formats, and this makes it difficult to identify the type of logs being received by the central repository. The way events are logged by IT systems is problematic for developers of host-based intrusion-detection systems (specifically, host-based systems), develop-ers of security-information systems, and developers of event-management systems. These problems preclude the develop-ment of more accurate, intrusive security solutions that obtain results from data included in the logs being processed. We propose a new method for dynamically normalizing events into a unified super-event that is loosely based on the Common Event Expression standard developed by Mitre Corporation. We explain how our solution can normalize seemingly unrelat-ed events into a single, unified format.展开更多
In this paper, we study the skyline group problem over a data stream. An object can dominate another object if it is not worse than the other object on all attributes and is better than the other object on at least on...In this paper, we study the skyline group problem over a data stream. An object can dominate another object if it is not worse than the other object on all attributes and is better than the other object on at least one attribute. If an object cannot be dominated by any other object, it is a skyline object. The skyline group problem involves finding k-item groups that cannot be dominated by any other k-item group. Existing algorithms designed to find skyline groups can only process static data. However, data changes as a stream with time in many applications,and algorithms should be designed to support skyline group queries on dynamic data. In this paper, we propose new algorithms to find skyline groups over a data stream. We use data structures, namely a hash table, dominance graph, and matrix, to store dominance information and update results incrementally. We conduct experiments on synthetic datasets to evaluate the performance of the proposed algorithms. The experimental results show that our algorithms can efficiently find skyline groups over a data stream.展开更多
Despite long-standing interest,the mechanisms driving aquatic macroinvertebrate drift in tropical streams remain poorly understood.Therefore,the objective of this study was to evaluate which environmental metrics driv...Despite long-standing interest,the mechanisms driving aquatic macroinvertebrate drift in tropical streams remain poorly understood.Therefore,the objective of this study was to evaluate which environmental metrics drive macroinvertebrate drift in neotropical sky island streams.We evaluated whether altitude,the abundance of food resources,and variations in water quality influenced macroinvertebrate drift density,diversity,richness,and functional feeding groups.An hypothesis was developed to test whether increased altitude,lower food availability(particulate organic matter),and discharge would increase the density,taxonomic richness,and diversity of drifting invertebrates.Nine headwater stream sites were sampled in the rainy and dry seasons in the Espinhaço Meridional Mountain Range(EMMR)of southeast Brazil.Samples were collected using drift nets deployed from 5:00 p.m.to 8:00 p.m.The abundance of food resources was assessed through estimates of coarse(CPOM)and fine(FPOM)particulate organic matter,and primary producers.CPOM availability was an important explanatory variable for Gathering-Collectors and Scrapers,Altitude was important for Shredders and Predators,and Filtering-Collectors were linked to water discharge,suggesting that functional group drift masses were linked to different ecosystem components.Water temperature,conductivity,dissolved oxygen,current velocity,FPOM biomass and microbasin elevation range exerted little influence on macroinvertebrate drift.Regarding taxa composition,this study also found that Baetidae and Leptohyphidae(Ephemeroptera)and Chironomidae and Simuliidae(Diptera)were the most abundant groups drifting.展开更多
Real-life events are emerging and evolving in social and news streams.Recent methods have succeeded in capturing designed features of monolingual events,but lack of interpretability and multi-lingual considerations.To...Real-life events are emerging and evolving in social and news streams.Recent methods have succeeded in capturing designed features of monolingual events,but lack of interpretability and multi-lingual considerations.To this end,we propose a multi-lingual event mining model,namely MLEM,to automatically detect events and generate evolution graph in multilingual hybrid-length text streams including English,Chinese,French,German,Russian and Japanese.Specially,we merge the same entities and similar phrases and present multiple similarity measures by incremental word2vec model.We propose an 8-tuple to describe event for correlation analysis and evolution graph generation.We evaluate the MLEM model using a massive human-generated dataset containing real world events.Experimental results show that our new model MLEM outperforms the baseline method both in efficiency and effectiveness.展开更多
This paper describes a dynamically reconfigurable data-flow hardware architecture optimized for the computation of image and video. It is a scalable hierarchically organized parallel architecture that consists of data...This paper describes a dynamically reconfigurable data-flow hardware architecture optimized for the computation of image and video. It is a scalable hierarchically organized parallel architecture that consists of data-flow clusters and finite-state machine (FSM) controllers. Each cluster contains various kinds of ceils that are optimized for video processing. Furthermore, to facilitate the design process, we provide a C-like language for design specification and associated design tools. Some video applications have been implemented in the architecture to demonstrate the applicability and flexibility of the architecture. Experimental results show that the architecture, along with its video applications, can be used in many real-time video processing.展开更多
For aircraft manufacturing industries, the analyses and prediction of part machining error during machining process are very important to control and improve part machining quality. In order to effectively control mac...For aircraft manufacturing industries, the analyses and prediction of part machining error during machining process are very important to control and improve part machining quality. In order to effectively control machining error, the method of integrating multivariate statistical process control (MSPC) and stream of variations (SoV) is proposed. Firstly, machining error is modeled by multi-operation approaches for part machining process. SoV is adopted to establish the mathematic model of the relationship between the error of upstream operations and the error of downstream operations. Here error sources not only include the influence of upstream operations but also include many of other error sources. The standard model and the predicted model about SoV are built respectively by whether the operation is done or not to satisfy different requests during part machining process. Secondly, the method of one-step ahead forecast error (OSFE) is used to eliminate autocorrelativity of the sample data from the SoV model, and the T2 control chart in MSPC is built to realize machining error detection according to the data characteristics of the above error model, which can judge whether the operation is out of control or not. If it is, then feedback is sent to the operations. The error model is modified by adjusting the operation out of control, and continually it is used to monitor operations. Finally, a machining instance containing two operations demonstrates the effectiveness of the machining error control method presented in this paper.展开更多
Fish processing towards production of fillet gives rise to wastewater streams that are ultimately directed to biogas production and/or wastewater treatment.However,these wastewater streams are rich in minerals,fat,and...Fish processing towards production of fillet gives rise to wastewater streams that are ultimately directed to biogas production and/or wastewater treatment.However,these wastewater streams are rich in minerals,fat,and proteins that can be converted to protein-rich feed ingredients through submerged cultivation of edible filamentous fungi.In this study,the origin of wastewater stream,initial pH,cultivation time,and extent of washing during sieving,were found to influence the amount of recovered material from the wastewater streams and its protein content,following cultivation with Aspergillus oryzae.Through culti-vation of the filamentous fungus in sludge,330 kg of material per ton of COD were recovered by sieving,corresponding to 121 kg protein per ton of COD,while through its cultivation in salt brine,210 kg of material were recovered per ton of COD,corresponding to 128 kg protein per ton of COD.Removal ranges of 12-43%,39-92%,and 32-66%for COD,total solids,and nitrogen,respectively,were obtained after A.oryzae growth and harvesting in the wastewater streams.Therefore,the present study shows the versatility that the integration of fungal cultivation provides to fish processing industries,and should be complemented by economic,environmental,and feeding studies,in order to reveal the most promising valorization strategy.展开更多
文摘With the advent of the IoT era, the amount of real-time data that is processed in data centers has increased explosively. As a result, stream mining, extracting useful knowledge from a huge amount of data in real time, is attracting more and more attention. It is said, however, that real- time stream processing will become more difficult in the near future, because the performance of processing applications continues to increase at a rate of 10% - 15% each year, while the amount of data to be processed is increasing exponentially. In this study, we focused on identifying a promising stream mining algorithm, specifically a Frequent Itemset Mining (FIsM) algorithm, then we improved its performance using an FPGA. FIsM algorithms are important and are basic data- mining techniques used to discover association rules from transactional databases. We improved on an approximate FIsM algorithm proposed recently so that it would fit onto hardware architecture efficiently. We then ran experiments on an FPGA. As a result, we have been able to achieve a speed 400% faster than the original algorithm implemented on a CPU. Moreover, our FPGA prototype showed a 20 times speed improvement compared to the CPU version.
基金supported by the National Key Research and Development Program of China under Grant 2018YFB0904500。
文摘The increasing penetration of renewable energy resources with highly fluctuating outputs has placed increasing concern on the accuracy and timeliness of electric power system state estimation(SE).Meanwhile,we note that only a fraction of system states fluctuate at the millisecond level and require to be updated.As such,refreshing only those states with significant variation would enhance the computational efficiency of SE and make fast-continuous update of states possible.However,this is difficult to achieve with conventional SE methods,which generally refresh states of the entire system every 4–5 s.In this context,we propose a local hybrid linear SE framework using stream processing,in which synchronized measurements received from phasor measurement units(PMUs),and trigger/timingmode measurements received from remote terminal units(RTUs)are used to update the associated local states.Moreover,the measurement update process efficiency and timeliness are enhanced by proposing a trigger measurement-based fast dynamic partitioning algorithm for determining the areas of the system with states requiring recalculation.In particular,non-iterative hybrid linear formulations with both RTUs and PMUs are employed to solve the local SE problem.The timeliness,accuracy,and computational efficiency of the proposed method are demonstrated by extensive simulations based on IEEE 118-,300-,and 2383-bus systems.
基金The work was supported by the National Natural Science Foundation of China under Grant Nos.62072419 and 61672479.
文摘Most distributed stream processing engines(DSPEs)do not support online task management and cannot adapt to time-varying data flows.Recently,some studies have proposed online task deployment algorithms to solve this problem.However,these approaches do not guarantee the Quality of Service(QoS)when the task deployment changes at runtime,because the task migrations caused by the change of task deployments will impose an exorbitant cost.We study one of the most popular DSPEs,Apache Storm,and find out that when a task needs to be migrated,Storm has to stop the resource(implemented as a process of Worker in Storm)where the task is deployed.This will lead to the stop and restart of all tasks in the resource,resulting in the poor performance of task migrations.Aiming to solve this problem,in this pa-per,we propose N-Storm(Nonstop Storm),which is a task-resource decoupling DSPE.N-Storm allows tasks allocated to resources to be changed at runtime,which is implemented by a thread-level scheme for task migrations.Particularly,we add a local shared key/value store on each node to make resources aware of the changes in the allocation plan.Thus,each resource can manage its tasks at runtime.Based on N-Storm,we further propose Online Task Deployment(OTD).Differ-ing from traditional task deployment algorithms that deploy all tasks at once without considering the cost of task migra-tions caused by a task re-deployment,OTD can gradually adjust the current task deployment to an optimized one based on the communication cost and the runtime states of resources.We demonstrate that OTD can adapt to different kinds of applications including computation-and communication-intensive applications.The experimental results on a real DSPE cluster show that N-Storm can avoid the system stop and save up to 87%of the performance degradation time,compared with Apache Storm and other state-of-the-art approaches.In addition,OTD can increase the average CPU usage by 51%for computation-intensive applications and reduce network communication costs by 88%for communication-intensive ap-plications.
文摘Implementations of metadata tend to favor centralized,static metadata.This depiction is at variance with the past decade of focus on big data,cloud native architectures and streaming platforms.Big data velocity can demand a correspondingly dynamic view of metadata.These trends,which include DevOps,CI/CD,DataOps and data fabric,are surveyed.Several specific cloud native tools are reviewed and weaknesses in their current metadata use are identified.Implementations are suggested which better exploit capabilities of streaming platform paradigms,in which metadata is continuously collected in dynamic contexts.Future cloud native software features are identified which could enable streamed metadata to power real time data fusion or fine tune automated reasoning through real time ontology updates.
基金This work was supported by National Key Research and Development Program of China(2020YFB1506703)the National Natural Science Foundation of China(Grant No.62072018).
文摘The data stream processing framework processes the stream data based on event-time to ensure that the request can be responded to in real-time.In reality,streaming data usually arrives out-of-order due to factors such as network delay.The data stream processing framework commonly adopts the watermark mechanism to address the data disorderedness.Watermark is a special kind of data inserted into the data stream with a timestamp,which helps the framework to decide whether the data received is late and thus be discarded.Traditional watermark generation strategies are periodic;they cannot dynamically adjust the watermark distribution to balance the responsiveness and accuracy.This paper proposes an adaptive watermark generation mechanism based on the time series prediction model to address the above limitation.This mechanism dynamically adjusts the frequency and timing of watermark distribution using the disordered data ratio and other lateness properties of the data stream to improve the system responsiveness while ensuring acceptable result accuracy.We implement the proposed mechanism on top of Flink and evaluate it with realworld datasets.The experiment results show that our mechanism is superior to the existing watermark distribution strategies in terms of both system responsiveness and result accuracy.
基金supported by“the Fundamental Research Funds for the Central Universities(2015XS72).”。
文摘Purpose-The purpose of this paper is to propose a data prediction framework for scenarios which require forecasting demand for large-scale data sources,e.g.,sensor networks,securities exchange,electric power secondary system,etc.Concretely,the proposed framework should handle several difficult requirements including the management of gigantic data sources,the need for a fast self-adaptive algorithm,the relatively accurate prediction of multiple time series,and the real-time demand.Design/methodology/approach-First,the autoregressive integrated moving average-based prediction algorithm is introduced.Second,the processing framework is designed,which includes a time-series data storage model based on the HBase,and a real-time distributed prediction platform based on Storm.Then,the work principle of this platform is described.Finally,a proof-of-concept testbed is illustrated to verify the proposed framework.Findings-Several tests based on Power Grid monitoring data are provided for the proposed framework.The experimental results indicate that prediction data are basically consistent with actual data,processing efficiency is relatively high,and resources consumption is reasonable.Originality/value-This paper provides a distributed real-time data prediction framework for large-scale time-series data,which can exactly achieve the requirement of the effective management,prediction efficiency,accuracy,and high concurrency for massive data sources.
基金The work was supported by the National Key Research and Development Plan Project(2018YFB1003404)。
文摘Stream processing has emerged as a useful technology for applications which require continuous and low latency computation on infinite streaming data.Since stream processing systems(SPSs)usually require distributed deployment on clusters of servers in face of large-scale of data,it is especially common to meet with failures of processing nodes or communication networks,but should be handled seriously considering service quality.A failed system may produce wrong results or become unavailable,resulting in a decline in user experience or even significant financial loss.Hence,a large amount of fault tolerance approaches have been proposed for SPSs.These approaches often have their own priorities on specific performance concerns,e.g.,runtime overhead and recovery efficiency.Nevertheless,there is a lack of a systematic overview and classification of the state-of-the-art fault tolerance approaches in SPSs,which will become an obstacle for the development of SPSs.Therefore,we investigate the existing achievements and develop a taxonomy of the fault tolerance in SPSs.Furthermore,we propose an evaluation framework tailored for fault tolerance,demonstrate the experimental results on two representative open-sourced SPSs and exposit the possible disadvantages in current designs.Finally,we specify future research directions in this domain.
基金supported by the National Key Research and Development Program of China(2020YFB1506703)the National Natural Science Foundation of China(Grant No.62072018)+1 种基金the State Key Laboratory of Software Development Environment(SKLSDE-2021ZX-06)the Fundamental Research Funds for the Central Universities.
文摘In recent years,the demand for real-time data processing has been increasing,and various stream processing systems have emerged.When the amount of data input to the stream processing system fluctuates,the computing resources required by the stream processing job will also change.The resources used by stream processing jobs need to be adjusted according to load changes,avoiding the waste of computing resources.At present,existing works adjust stream processing jobs based on the assumption that there is a linear relationship between the operator parallelism and operator resource consumption(e.g.,throughput),which makes a significant deviation when the operator parallelism increases.This paper proposes a nonlinear model to represent operator performance.We divide the operator performance into three stages,the Non-competition stage,the Non-full competition stage,and the Full competition stage.Using our proposed performance model,given the parallelism of the operator,we can accurately predict the CPU utilization and operator throughput.Evaluated with actual experiments,the prediction error of our model is below 5%.We also propose a quick accurate auto-scaling(QAAS)method that uses the operator performance model to implement the auto-scaling of the operator parallelism of the Flink job.Compared to previous work,QAAS is able to maintain stable job performance under load changes,minimizing the number of job adjustments and reducing data backlogs by 50%.
基金supported by the General Program of National Natural Science Fouddation of China:Analytical Method Reserach of Loop and Recursion(No.61872262/F020106)the Key Project of the National Natural Science Foundation of China:Research on Big Service Theory and Methods in Big Data Environment(No.61832004).
文摘The integration of cloud and IoT edge devices is of significance in reducing the latency of IoT stream data processing by moving services closer to the edge-end.In this connection,a key issue is to determine when and where services should be deployed.Common service deployment strategies used to be static based on the rules defined at the design time.However,dynamically changing IoT environments bring about unexpected situations such as out-of-range stream fluctuation,where the static service deployment solutions are not efficient.In this paper,we propose a dynamic service deployment mechanism based on the prediction of upcoming stream data.To effectively predict upcoming workloads,we combine the online machine learning methods with an online optimization algorithm for service deployment.A simulation-based evaluation demonstrates that,compared with those state-of-the art approaches,the approach proposed in this paper has a lower latency of stream processing.
文摘A high-performance, distributed, complex-event processing en- gine with improved scalability is proposed. In this new engine, the stateless proeessing node is combined with distributed stor- age so that scale and performance can be linearly expanded. This design prevents single node failure and makes the system highly reliable.
基金Supported by the National Basic Research Program of China(2012CB720500)the National Supporting Research Program of China(2013BAA03B01)+1 种基金the National Natural Science Foundation of China(21176178)China Scholarship Council(CSC[2015]3022)
文摘In our previous work, the reactive dividing wall column(RDWC) was proposed and proved to be effective for selective hydrogenation and separation of C3 stream. In the present paper, the dynamics and control of the proposed RDWC are investigated. Four control structures including composition and temperature controls are proposed. The feed forward controllers are employed in the four control strategies to shorten the dynamic response time, reduce the maximum deviations and offer an immediate adjustment. The control structures are compared by applying them into the RDWC system with 20% disturbances in both the feed flow rate and the feed compositions, and the results are discussed.
基金This study was financially supported by the Korea Health Technology R&D Project through the Korea Health Industry Development Institute(KHIDI),the Ministry of Health and Welfare(HI18C1216),and the Soonchunhyang University Research Fund.
文摘Big data applications in healthcare have provided a variety of solutions to reduce costs,errors,and waste.This work aims to develop a real-time system based on big medical data processing in the cloud for the prediction of health issues.In the proposed scalable system,medical parameters are sent to Apache Spark to extract attributes from data and apply the proposed machine learning algorithm.In this way,healthcare risks can be predicted and sent as alerts and recommendations to users and healthcare providers.The proposed work also aims to provide an effective recommendation system by using streaming medical data,historical data on a user’s profile,and a knowledge database to make themost appropriate real-time recommendations and alerts based on the sensor’s measurements.This proposed scalable system works by tweeting the health status attributes of users.Their cloud profile receives the streaming healthcare data in real time by extracting the health attributes via a machine learning prediction algorithm to predict the users’health status.Subsequently,their status can be sent on demand to healthcare providers.Therefore,machine learning algorithms can be applied to stream health care data from wearables and provide users with insights into their health status.These algorithms can help healthcare providers and individuals focus on health risks and health status changes and consequently improve the quality of life.
文摘The analytical and monitoring capabilities of central event re-positories, such as log servers and intrusion detection sys-tems, are limited by the amount of structured information ex-tracted from the events they receive. Diverse networks and ap-plications log their events in many different formats, and this makes it difficult to identify the type of logs being received by the central repository. The way events are logged by IT systems is problematic for developers of host-based intrusion-detection systems (specifically, host-based systems), develop-ers of security-information systems, and developers of event-management systems. These problems preclude the develop-ment of more accurate, intrusive security solutions that obtain results from data included in the logs being processed. We propose a new method for dynamically normalizing events into a unified super-event that is loosely based on the Common Event Expression standard developed by Mitre Corporation. We explain how our solution can normalize seemingly unrelat-ed events into a single, unified format.
基金supported by the Fundamental Research Funds for the Central Universities (Nos. FRF-TP-14025A1 and FRF-TP-15-025A2)supported by the Key Technologies Research and Development Program of 12th Five-Year Plan of China (No.2013BAI13B06)
文摘In this paper, we study the skyline group problem over a data stream. An object can dominate another object if it is not worse than the other object on all attributes and is better than the other object on at least one attribute. If an object cannot be dominated by any other object, it is a skyline object. The skyline group problem involves finding k-item groups that cannot be dominated by any other k-item group. Existing algorithms designed to find skyline groups can only process static data. However, data changes as a stream with time in many applications,and algorithms should be designed to support skyline group queries on dynamic data. In this paper, we propose new algorithms to find skyline groups over a data stream. We use data structures, namely a hash table, dominance graph, and matrix, to store dominance information and update results incrementally. We conduct experiments on synthetic datasets to evaluate the performance of the proposed algorithms. The experimental results show that our algorithms can efficiently find skyline groups over a data stream.
基金supported by Coordenaç~ao de Aperfeiçoamento de Pessoal de Nível Superior(CAPES)–Finance Code 001.MC was awarded Conselho Nacional de Desenvolvimento Científico e Tecnologico(CNPq)research productivity grant 304,060/2020-8 and Fundaç~ao de Amparoa Pesquisa de Minas Gerais(FAPEMIG)research grant PPM 00104-18.DMPC received a postdoctoral scholarship from P&D Aneel-Cemig GT-611.MSL received a postdoctoral scholarship from P&D Aneel-Cemig GT-599.RMH received a Fulbright Brasil grant.This work was partially supported by the CNPq for funding the Long-Term Ecological Research“PELD Campos Rupestres da Serra do Cipo”(grant number No.442694/2020-2).The authors have no financial or proprietary interests in any material discussed in this article.The authors are grateful to the colleagues of the Laboratorio de Ecologia de Bentos(ICB-UFMG)for field and laboratory assistance.
文摘Despite long-standing interest,the mechanisms driving aquatic macroinvertebrate drift in tropical streams remain poorly understood.Therefore,the objective of this study was to evaluate which environmental metrics drive macroinvertebrate drift in neotropical sky island streams.We evaluated whether altitude,the abundance of food resources,and variations in water quality influenced macroinvertebrate drift density,diversity,richness,and functional feeding groups.An hypothesis was developed to test whether increased altitude,lower food availability(particulate organic matter),and discharge would increase the density,taxonomic richness,and diversity of drifting invertebrates.Nine headwater stream sites were sampled in the rainy and dry seasons in the Espinhaço Meridional Mountain Range(EMMR)of southeast Brazil.Samples were collected using drift nets deployed from 5:00 p.m.to 8:00 p.m.The abundance of food resources was assessed through estimates of coarse(CPOM)and fine(FPOM)particulate organic matter,and primary producers.CPOM availability was an important explanatory variable for Gathering-Collectors and Scrapers,Altitude was important for Shredders and Predators,and Filtering-Collectors were linked to water discharge,suggesting that functional group drift masses were linked to different ecosystem components.Water temperature,conductivity,dissolved oxygen,current velocity,FPOM biomass and microbasin elevation range exerted little influence on macroinvertebrate drift.Regarding taxa composition,this study also found that Baetidae and Leptohyphidae(Ephemeroptera)and Chironomidae and Simuliidae(Diptera)were the most abundant groups drifting.
基金This work was supported by NSFC program(Grant Nos.61872022,61421003,U1636123)SKLSDE-2018ZX-16 and partly by the Beijing Advanced Innovation Center for Big Data and Brain Computing.
文摘Real-life events are emerging and evolving in social and news streams.Recent methods have succeeded in capturing designed features of monolingual events,but lack of interpretability and multi-lingual considerations.To this end,we propose a multi-lingual event mining model,namely MLEM,to automatically detect events and generate evolution graph in multilingual hybrid-length text streams including English,Chinese,French,German,Russian and Japanese.Specially,we merge the same entities and similar phrases and present multiple similarity measures by incremental word2vec model.We propose an 8-tuple to describe event for correlation analysis and evolution graph generation.We evaluate the MLEM model using a massive human-generated dataset containing real world events.Experimental results show that our new model MLEM outperforms the baseline method both in efficiency and effectiveness.
基金Foundation item: the National Natural Science Foundation of China (No. 61136002), the Key Project of Chinese Ministry of Education (No. 211180), and the Shaanxi Provincial Industrial and Technological Project (No. 2011k06-47).
文摘This paper describes a dynamically reconfigurable data-flow hardware architecture optimized for the computation of image and video. It is a scalable hierarchically organized parallel architecture that consists of data-flow clusters and finite-state machine (FSM) controllers. Each cluster contains various kinds of ceils that are optimized for video processing. Furthermore, to facilitate the design process, we provide a C-like language for design specification and associated design tools. Some video applications have been implemented in the architecture to demonstrate the applicability and flexibility of the architecture. Experimental results show that the architecture, along with its video applications, can be used in many real-time video processing.
基金National Natural Science Foundation of China (70931004)
文摘For aircraft manufacturing industries, the analyses and prediction of part machining error during machining process are very important to control and improve part machining quality. In order to effectively control machining error, the method of integrating multivariate statistical process control (MSPC) and stream of variations (SoV) is proposed. Firstly, machining error is modeled by multi-operation approaches for part machining process. SoV is adopted to establish the mathematic model of the relationship between the error of upstream operations and the error of downstream operations. Here error sources not only include the influence of upstream operations but also include many of other error sources. The standard model and the predicted model about SoV are built respectively by whether the operation is done or not to satisfy different requests during part machining process. Secondly, the method of one-step ahead forecast error (OSFE) is used to eliminate autocorrelativity of the sample data from the SoV model, and the T2 control chart in MSPC is built to realize machining error detection according to the data characteristics of the above error model, which can judge whether the operation is out of control or not. If it is, then feedback is sent to the operations. The error model is modified by adjusting the operation out of control, and continually it is used to monitor operations. Finally, a machining instance containing two operations demonstrates the effectiveness of the machining error control method presented in this paper.
基金This work was supported by the Swedish Agency for Economic and Regional Growth(Tillväxtverket)through a European Regional Development Fund.
文摘Fish processing towards production of fillet gives rise to wastewater streams that are ultimately directed to biogas production and/or wastewater treatment.However,these wastewater streams are rich in minerals,fat,and proteins that can be converted to protein-rich feed ingredients through submerged cultivation of edible filamentous fungi.In this study,the origin of wastewater stream,initial pH,cultivation time,and extent of washing during sieving,were found to influence the amount of recovered material from the wastewater streams and its protein content,following cultivation with Aspergillus oryzae.Through culti-vation of the filamentous fungus in sludge,330 kg of material per ton of COD were recovered by sieving,corresponding to 121 kg protein per ton of COD,while through its cultivation in salt brine,210 kg of material were recovered per ton of COD,corresponding to 128 kg protein per ton of COD.Removal ranges of 12-43%,39-92%,and 32-66%for COD,total solids,and nitrogen,respectively,were obtained after A.oryzae growth and harvesting in the wastewater streams.Therefore,the present study shows the versatility that the integration of fungal cultivation provides to fish processing industries,and should be complemented by economic,environmental,and feeding studies,in order to reveal the most promising valorization strategy.