The scarcity of bilingual parallel corpus imposes limitations on exploiting the state-of-the-art supervised translation technology.One of the research directions is employing relations among multi-modal data to enhanc...The scarcity of bilingual parallel corpus imposes limitations on exploiting the state-of-the-art supervised translation technology.One of the research directions is employing relations among multi-modal data to enhance perfor-mance.However,the reliance on manually annotated multi-modal datasets results in a high cost of data labeling.In this paper,the topic semantics of images is proposed to alleviate the above problem.First,topic-related images can be auto-matically collected from the Internet by search engines.Second,topic semantics is sufficient to encode the relations be-tween multi-modal data such as texts and images.Specifically,we propose a visual topic semantic enhanced translation(VTSE)model that utilizes topic-related images to construct a cross-lingual and cross-modal semantic space,allowing the VTSE model to simultaneously integrate the syntactic structure and semantic features.In the above process,topic similar texts and images are wrapped into groups so that the model can extract more robust topic semantics from a set of similar images and then further optimize the feature integration.The results show that our model outperforms competitive base-lines by a large margin on the Multi30k and the Ambiguous COCO datasets.Our model can use external images to bring gains to translation,improving data efficiency.展开更多
In this paper, we study the problem of efficiently computing k-medians over high-dimensional and high speed data streams. The focus of this paper is on the issue of minimizing CPU time to handle high speed data stream...In this paper, we study the problem of efficiently computing k-medians over high-dimensional and high speed data streams. The focus of this paper is on the issue of minimizing CPU time to handle high speed data streams on top of the requirements of high accuracy and small memory. Our work is motivated by the following observation: the existing algorithms have similar approximation behaviors in practice, even though they make noticeably different worst case theoretical guarantees. The underlying reason is that in order to achieve high approximation level with the smallest possible memory, they need rather complex techniques to maintain a sketch, along time dimension, by using some existing off-line clustering algorithms. Those clustering algorithms cannot guarantee the optimal clustering result over data segments in a data stream but accumulate errors over segments, which makes most algorithms behave the same in terms of approximation level, in practice. We propose a new grid-based approach which divides the entire data set into cells (not along time dimension). We can achieve high approximation level based on a novel concept called (1 - ε)-dominant. We further extend the method to the data stream context, by leveraging a density-based heuristic and frequent item mining techniques over data streams. We only need to apply an existing clustering once to computing k-medians, on demand, which reduces CPU time significantly. We conducted extensive experimental studies, and show that our approaches outperform other well-known approaches.展开更多
Location privacy receives considerable attentions in emerging location based services.Most current practices however either ignore users' preferences or incompletely fulfill privacy preferences.In this paper,we propo...Location privacy receives considerable attentions in emerging location based services.Most current practices however either ignore users' preferences or incompletely fulfill privacy preferences.In this paper,we propose a privacy protection solution to allow users' preferences in the fundamental query of k nearest neighbors (kNN).Particularly,users are permitted to choose privacy preferences by specifying minimum inferred region.Via Hilbert curve based transformation,the additional workload from users' preferences is alleviated.Furthermore,this transformation reduces time-expensive region queries in 2-D space to range the ones in 1-D space.Therefore,the time efficiency,as well as communication efficiency,is greatly improved due to clustering properties of Hilbert curve.Further,details of choosing anchor points are theoretically elaborated.The empirical studies demonstrate that our implementation delivers both flexibility for users' preferences and scalability for time and communication costs.展开更多
基金supported by the National Natural Science Foundation of China under Grant No.52178034.
文摘The scarcity of bilingual parallel corpus imposes limitations on exploiting the state-of-the-art supervised translation technology.One of the research directions is employing relations among multi-modal data to enhance perfor-mance.However,the reliance on manually annotated multi-modal datasets results in a high cost of data labeling.In this paper,the topic semantics of images is proposed to alleviate the above problem.First,topic-related images can be auto-matically collected from the Internet by search engines.Second,topic semantics is sufficient to encode the relations be-tween multi-modal data such as texts and images.Specifically,we propose a visual topic semantic enhanced translation(VTSE)model that utilizes topic-related images to construct a cross-lingual and cross-modal semantic space,allowing the VTSE model to simultaneously integrate the syntactic structure and semantic features.In the above process,topic similar texts and images are wrapped into groups so that the model can extract more robust topic semantics from a set of similar images and then further optimize the feature integration.The results show that our model outperforms competitive base-lines by a large margin on the Multi30k and the Ambiguous COCO datasets.Our model can use external images to bring gains to translation,improving data efficiency.
文摘In this paper, we study the problem of efficiently computing k-medians over high-dimensional and high speed data streams. The focus of this paper is on the issue of minimizing CPU time to handle high speed data streams on top of the requirements of high accuracy and small memory. Our work is motivated by the following observation: the existing algorithms have similar approximation behaviors in practice, even though they make noticeably different worst case theoretical guarantees. The underlying reason is that in order to achieve high approximation level with the smallest possible memory, they need rather complex techniques to maintain a sketch, along time dimension, by using some existing off-line clustering algorithms. Those clustering algorithms cannot guarantee the optimal clustering result over data segments in a data stream but accumulate errors over segments, which makes most algorithms behave the same in terms of approximation level, in practice. We propose a new grid-based approach which divides the entire data set into cells (not along time dimension). We can achieve high approximation level based on a novel concept called (1 - ε)-dominant. We further extend the method to the data stream context, by leveraging a density-based heuristic and frequent item mining techniques over data streams. We only need to apply an existing clustering once to computing k-medians, on demand, which reduces CPU time significantly. We conducted extensive experimental studies, and show that our approaches outperform other well-known approaches.
基金Supported by the National Natural Science Foundation of China under Grant Nos. 61003057 and 60973023
文摘Location privacy receives considerable attentions in emerging location based services.Most current practices however either ignore users' preferences or incompletely fulfill privacy preferences.In this paper,we propose a privacy protection solution to allow users' preferences in the fundamental query of k nearest neighbors (kNN).Particularly,users are permitted to choose privacy preferences by specifying minimum inferred region.Via Hilbert curve based transformation,the additional workload from users' preferences is alleviated.Furthermore,this transformation reduces time-expensive region queries in 2-D space to range the ones in 1-D space.Therefore,the time efficiency,as well as communication efficiency,is greatly improved due to clustering properties of Hilbert curve.Further,details of choosing anchor points are theoretically elaborated.The empirical studies demonstrate that our implementation delivers both flexibility for users' preferences and scalability for time and communication costs.