In many database applications, ranking queries may reference both text and numeric attributes, where the ranking functions are based on both semantic distances/similarities for text attributes and numeric distances fo...In many database applications, ranking queries may reference both text and numeric attributes, where the ranking functions are based on both semantic distances/similarities for text attributes and numeric distances for numeric attributes. In this paper, we propose a new method for evaluating such type of ranking queries over a relational database. By statistics and training, this method builds a mechanism that combines the semantic and numeric distances, and the mechanism can be used to balance the effects of text attributes and numeric attributes on matching a given query and tuples in database search. The basic idea of the method is to create an index based on WordNet to expand the tuple words semantically for text attributes and on the information of numeric attributes. The candidate results for a query are retrieved by the index and a simple SQL selection statement, and then top-N answers are obtained. The results of extensive experiments indicate that the performance of this new strategy is efficient and effective.展开更多
The smart grid has caught great attentions in recent years, which is poised to transform a centralized, producer-controlled network to a decentralized, consumer- interactive network that's supported by fine-grained m...The smart grid has caught great attentions in recent years, which is poised to transform a centralized, producer-controlled network to a decentralized, consumer- interactive network that's supported by fine-grained monitoring. Large-scale WSNs (Wireless Sensor Networks) have been considered one of the very promising technologies to support the implementation of smart grid. WSNs are applied in almost every aspect of smart grid, including power generation, power transmission, power distribution, power utilization and power dispatch, and the data query processing of 'WSNs in power grid' become an hotspot issue due to the amount of data of power grid is very large and the requirement of response time is very high. To meet the demands, top-k query processing is a good choice, which performs the cooperative query by aggregating the database objects' degree of match for each different query predicate and returning the best k matching objects. In this paper, a framework that can effectively apply top-k query to wireless sensor network in smart grid is proposed, which is based on the cluster-topology sensor network. In the new method, local indices are used to optimize the necessary query routing and process intermediate results inside the cluster to cut down the data traffic, and the hierarchical join query is executed based on the local results.Besides, top-k query results are verified by the clean-up process, and two schemes are taken to deal with the problem of node's dynamicity, which further reduce communication cost. Case studies and experimental results show that our algorithm has outperformed the current existing one with higher quality results and better efficiently.展开更多
Top-k ranking of websites according to traffic volume is important for Internet Service Providers(ISPs) to understand network status and optimize network resources. However, the ranking result always has a big deviati...Top-k ranking of websites according to traffic volume is important for Internet Service Providers(ISPs) to understand network status and optimize network resources. However, the ranking result always has a big deviation with actual rank for the existence of unknown web traffic, which cannot be identified accurately under current techniques. In this paper, we introduce a novel method to approximate the actual rank. This method associates unknown web traffic with websites according to statistical probabilities. Then, we construct a probabilistic top-k query model to rank websites. We conduct several experiments by using real HTTP traffic traces collected from a commercial ISP covering an entire city in northern China. Experimental results show that the proposed techniques can reduce the deviation existing between the ground truth and the ranking results vastly. In addition, we find that the websites providing video service have higher ratio of unknown IP as well as higher ratio of unknown traffic than the websites providing text web page service. Specifically, we find that the top-3 video websites have more than 90% of unknown web traffic. All these findings are helpful for ISPs understanding network status and deploying Content Distributed Network(CDN).展开更多
Web crawlers have evolved from performing a meagre task of collecting statistics,security testing,web indexing and numerous other examples.The size and dynamism of the web are making crawling an interesting and challe...Web crawlers have evolved from performing a meagre task of collecting statistics,security testing,web indexing and numerous other examples.The size and dynamism of the web are making crawling an interesting and challenging task.Researchers have tackled various issues and challenges related to web crawling.One such issue is efficiently discovering hidden web data.Web crawler’s inability to work with form-based data,lack of benchmarks and standards for both performance measures and datasets for evaluation of the web crawlers make it still an immature research domain.The applications like vertical portals and data integration require hidden web crawling.Most of the existing methods are based on returning top k matches that makes exhaustive crawling difficult.The documents which are ranked high will be returned multiple times.The low ranked documents have slim chances of being retrieved.Discovering the hidden web sources and ranking them based on relevance is a core component of hidden web crawlers.The problem of ranking bias,heuristic approach and saturation of ranking algorithm led to low coverage.This research represents an enhanced ranking algorithm based on the triplet formula for prioritizing hidden websites to increase the coverage of the hidden web crawler.展开更多
While China has become the largest online market in the world with approximately 1 billion internet users,Baidu runs the world's largest Chinese search engine serving more than hundreds of millions of daily active...While China has become the largest online market in the world with approximately 1 billion internet users,Baidu runs the world's largest Chinese search engine serving more than hundreds of millions of daily active users and responding to billions of queries per day.To handle the diverse query requests from users at the web-scale,Baidu has made tremendous efforts in understanding users'queries,retrieving relevant content from a pool of trillions of webpages,and ranking the most relevant webpages on the top of the res-ults.Among the components used in Baidu search,learning to rank(LTR)plays a critical role and we need to timely label an extremely large number of queries together with relevant webpages to train and update the online LTR models.To reduce the costs and time con-sumption of query/webpage labelling,we study the problem of active learning to rank(active LTR)that selects unlabeled queries for an-notation and training in this work.Specifically,we first investigate the criterion-Ranking entropy(RE)characterizing the entropy of relevant webpages under a query produced by a sequence of online LTR models updated by different checkpoints,using a query-by-com-mittee(QBC)method.Then,we explore a new criterion namely prediction variances(PV)that measures the variance of prediction res-ults for all relevant webpages under a query.Our empirical studies find that RE may favor low-frequency queries from the pool for la-belling while PV prioritizes high-frequency queries more.Finally,we combine these two complementary criteria as the sample selection strategies for active learning.Extensive experiments with comparisons to baseline algorithms show that the proposed approach could train LTR models to achieve higher discounted cumulative gain(i.e.,the relative improvement DCG4=1.38%)with the same budgeted labellingefforts.展开更多
文摘In many database applications, ranking queries may reference both text and numeric attributes, where the ranking functions are based on both semantic distances/similarities for text attributes and numeric distances for numeric attributes. In this paper, we propose a new method for evaluating such type of ranking queries over a relational database. By statistics and training, this method builds a mechanism that combines the semantic and numeric distances, and the mechanism can be used to balance the effects of text attributes and numeric attributes on matching a given query and tuples in database search. The basic idea of the method is to create an index based on WordNet to expand the tuple words semantically for text attributes and on the information of numeric attributes. The candidate results for a query are retrieved by the index and a simple SQL selection statement, and then top-N answers are obtained. The results of extensive experiments indicate that the performance of this new strategy is efficient and effective.
文摘The smart grid has caught great attentions in recent years, which is poised to transform a centralized, producer-controlled network to a decentralized, consumer- interactive network that's supported by fine-grained monitoring. Large-scale WSNs (Wireless Sensor Networks) have been considered one of the very promising technologies to support the implementation of smart grid. WSNs are applied in almost every aspect of smart grid, including power generation, power transmission, power distribution, power utilization and power dispatch, and the data query processing of 'WSNs in power grid' become an hotspot issue due to the amount of data of power grid is very large and the requirement of response time is very high. To meet the demands, top-k query processing is a good choice, which performs the cooperative query by aggregating the database objects' degree of match for each different query predicate and returning the best k matching objects. In this paper, a framework that can effectively apply top-k query to wireless sensor network in smart grid is proposed, which is based on the cluster-topology sensor network. In the new method, local indices are used to optimize the necessary query routing and process intermediate results inside the cluster to cut down the data traffic, and the hierarchical join query is executed based on the local results.Besides, top-k query results are verified by the clean-up process, and two schemes are taken to deal with the problem of node's dynamicity, which further reduce communication cost. Case studies and experimental results show that our algorithm has outperformed the current existing one with higher quality results and better efficiently.
基金supported by 111 Project of China under Grant No.B08004
文摘Top-k ranking of websites according to traffic volume is important for Internet Service Providers(ISPs) to understand network status and optimize network resources. However, the ranking result always has a big deviation with actual rank for the existence of unknown web traffic, which cannot be identified accurately under current techniques. In this paper, we introduce a novel method to approximate the actual rank. This method associates unknown web traffic with websites according to statistical probabilities. Then, we construct a probabilistic top-k query model to rank websites. We conduct several experiments by using real HTTP traffic traces collected from a commercial ISP covering an entire city in northern China. Experimental results show that the proposed techniques can reduce the deviation existing between the ground truth and the ranking results vastly. In addition, we find that the websites providing video service have higher ratio of unknown IP as well as higher ratio of unknown traffic than the websites providing text web page service. Specifically, we find that the top-3 video websites have more than 90% of unknown web traffic. All these findings are helpful for ISPs understanding network status and deploying Content Distributed Network(CDN).
基金Taif University Researchers Supporting Project number(TURSP-2020/98),Taif University,Taif,Saudi Arabia.
文摘Web crawlers have evolved from performing a meagre task of collecting statistics,security testing,web indexing and numerous other examples.The size and dynamism of the web are making crawling an interesting and challenging task.Researchers have tackled various issues and challenges related to web crawling.One such issue is efficiently discovering hidden web data.Web crawler’s inability to work with form-based data,lack of benchmarks and standards for both performance measures and datasets for evaluation of the web crawlers make it still an immature research domain.The applications like vertical portals and data integration require hidden web crawling.Most of the existing methods are based on returning top k matches that makes exhaustive crawling difficult.The documents which are ranked high will be returned multiple times.The low ranked documents have slim chances of being retrieved.Discovering the hidden web sources and ranking them based on relevance is a core component of hidden web crawlers.The problem of ranking bias,heuristic approach and saturation of ranking algorithm led to low coverage.This research represents an enhanced ranking algorithm based on the triplet formula for prioritizing hidden websites to increase the coverage of the hidden web crawler.
基金This work was supported in part by the National Key R&D Program of China(No.2021ZD0110303).
文摘While China has become the largest online market in the world with approximately 1 billion internet users,Baidu runs the world's largest Chinese search engine serving more than hundreds of millions of daily active users and responding to billions of queries per day.To handle the diverse query requests from users at the web-scale,Baidu has made tremendous efforts in understanding users'queries,retrieving relevant content from a pool of trillions of webpages,and ranking the most relevant webpages on the top of the res-ults.Among the components used in Baidu search,learning to rank(LTR)plays a critical role and we need to timely label an extremely large number of queries together with relevant webpages to train and update the online LTR models.To reduce the costs and time con-sumption of query/webpage labelling,we study the problem of active learning to rank(active LTR)that selects unlabeled queries for an-notation and training in this work.Specifically,we first investigate the criterion-Ranking entropy(RE)characterizing the entropy of relevant webpages under a query produced by a sequence of online LTR models updated by different checkpoints,using a query-by-com-mittee(QBC)method.Then,we explore a new criterion namely prediction variances(PV)that measures the variance of prediction res-ults for all relevant webpages under a query.Our empirical studies find that RE may favor low-frequency queries from the pool for la-belling while PV prioritizes high-frequency queries more.Finally,we combine these two complementary criteria as the sample selection strategies for active learning.Extensive experiments with comparisons to baseline algorithms show that the proposed approach could train LTR models to achieve higher discounted cumulative gain(i.e.,the relative improvement DCG4=1.38%)with the same budgeted labellingefforts.