In this paper, we study the problem of efficiently computing k-medians over high-dimensional and high speed data streams. The focus of this paper is on the issue of minimizing CPU time to handle high speed data stream...In this paper, we study the problem of efficiently computing k-medians over high-dimensional and high speed data streams. The focus of this paper is on the issue of minimizing CPU time to handle high speed data streams on top of the requirements of high accuracy and small memory. Our work is motivated by the following observation: the existing algorithms have similar approximation behaviors in practice, even though they make noticeably different worst case theoretical guarantees. The underlying reason is that in order to achieve high approximation level with the smallest possible memory, they need rather complex techniques to maintain a sketch, along time dimension, by using some existing off-line clustering algorithms. Those clustering algorithms cannot guarantee the optimal clustering result over data segments in a data stream but accumulate errors over segments, which makes most algorithms behave the same in terms of approximation level, in practice. We propose a new grid-based approach which divides the entire data set into cells (not along time dimension). We can achieve high approximation level based on a novel concept called (1 - ε)-dominant. We further extend the method to the data stream context, by leveraging a density-based heuristic and frequent item mining techniques over data streams. We only need to apply an existing clustering once to computing k-medians, on demand, which reduces CPU time significantly. We conducted extensive experimental studies, and show that our approaches outperform other well-known approaches.展开更多
With the increasing number of web services, it becomes a difficult task for an ordinary user to select an appropriate service. Hence, it is conventional that users in a digital community network take part in a collabo...With the increasing number of web services, it becomes a difficult task for an ordinary user to select an appropriate service. Hence, it is conventional that users in a digital community network take part in a collaborative mechanism for the purpose of service selection. The participation usually brings unnecessary burdens for users, such as giving opinions, storing service information. Extra communication overhead hinders the performance of the network. Thus, the community administrators are facing a problem of how to obtain an overall service selection result for the whole community readily and effectively. To address this problem, we present a k-median facility location agent model. The model analyzes the procedure of service selection through five entities and six types of messages. Two algorithms are elaborated in pursuit of a global optimization concerning connection costs between users and facilities where services are deployed. To evaluate our model, we conduct extensive simulations and present a detailed analysis of the simulation results.展开更多
The k-median problem has attracted a number of researchers. However,few of them have considered both the dynamic environment and the issue of accuracy. In this paper,a new type of query is studied,called continuous me...The k-median problem has attracted a number of researchers. However,few of them have considered both the dynamic environment and the issue of accuracy. In this paper,a new type of query is studied,called continuous median monitoring (CMM) query. It considers the k-median problem under dynamic environment with an accuracy guarantee. A continuous group nearest neighbor based (CGB) algorithm and an average distance medoid (ADM) algorithm are proposed to solve the CMM problem. ADM is a hill climbing schemed algorithm and achieves a rapid converging speed by checking only qualified candidates. Experiments show that ADM is more efficient than CGB and outperforms the classical PAM (partitioning around medoids) and CLARANS (clustering large applications based on randomized search) algorithms with various parameter settings.展开更多
We study the generalizedk-median version of the warehouse-retailer network design problem(kWRND).We formulate the k-WRND as a binary integer program and propose a 6-approximation randomized algorithm based on Lagrangi...We study the generalizedk-median version of the warehouse-retailer network design problem(kWRND).We formulate the k-WRND as a binary integer program and propose a 6-approximation randomized algorithm based on Lagrangian relaxation.展开更多
文摘In this paper, we study the problem of efficiently computing k-medians over high-dimensional and high speed data streams. The focus of this paper is on the issue of minimizing CPU time to handle high speed data streams on top of the requirements of high accuracy and small memory. Our work is motivated by the following observation: the existing algorithms have similar approximation behaviors in practice, even though they make noticeably different worst case theoretical guarantees. The underlying reason is that in order to achieve high approximation level with the smallest possible memory, they need rather complex techniques to maintain a sketch, along time dimension, by using some existing off-line clustering algorithms. Those clustering algorithms cannot guarantee the optimal clustering result over data segments in a data stream but accumulate errors over segments, which makes most algorithms behave the same in terms of approximation level, in practice. We propose a new grid-based approach which divides the entire data set into cells (not along time dimension). We can achieve high approximation level based on a novel concept called (1 - ε)-dominant. We further extend the method to the data stream context, by leveraging a density-based heuristic and frequent item mining techniques over data streams. We only need to apply an existing clustering once to computing k-medians, on demand, which reduces CPU time significantly. We conducted extensive experimental studies, and show that our approaches outperform other well-known approaches.
基金This work is supported by Program for the Key Program of NSFC-Guangdong Union Foundation (U1135002), Major national S&T program (2011ZX03005-002), National Natural Science Foundation of China (60872041, 61072066), the Fundamental Research Funds for the Central Universities (JY10000903001, JY10000901034, K5051203010) and the GAD Pre-Research Foundation (9140A 15040210HK61 ).
文摘With the increasing number of web services, it becomes a difficult task for an ordinary user to select an appropriate service. Hence, it is conventional that users in a digital community network take part in a collaborative mechanism for the purpose of service selection. The participation usually brings unnecessary burdens for users, such as giving opinions, storing service information. Extra communication overhead hinders the performance of the network. Thus, the community administrators are facing a problem of how to obtain an overall service selection result for the whole community readily and effectively. To address this problem, we present a k-median facility location agent model. The model analyzes the procedure of service selection through five entities and six types of messages. Two algorithms are elaborated in pursuit of a global optimization concerning connection costs between users and facilities where services are deployed. To evaluate our model, we conduct extensive simulations and present a detailed analysis of the simulation results.
文摘The k-median problem has attracted a number of researchers. However,few of them have considered both the dynamic environment and the issue of accuracy. In this paper,a new type of query is studied,called continuous median monitoring (CMM) query. It considers the k-median problem under dynamic environment with an accuracy guarantee. A continuous group nearest neighbor based (CGB) algorithm and an average distance medoid (ADM) algorithm are proposed to solve the CMM problem. ADM is a hill climbing schemed algorithm and achieves a rapid converging speed by checking only qualified candidates. Experiments show that ADM is more efficient than CGB and outperforms the classical PAM (partitioning around medoids) and CLARANS (clustering large applications based on randomized search) algorithms with various parameter settings.
基金supported by National Basic Research Program of China(973 Program)(Grant No.2010CB732501)National Natural Science Foundation of China(Grant No.11071268)China Scholarship Council Scientific Research Common Program of Beijing Municipal Commission of Education(Grant No.KM201210005033)
文摘We study the generalizedk-median version of the warehouse-retailer network design problem(kWRND).We formulate the k-WRND as a binary integer program and propose a 6-approximation randomized algorithm based on Lagrangian relaxation.