期刊文献+
共找到1篇文章
< 1 >
每页显示 20 50 100
Stratified sampling for data mining on the deep web 被引量:4
1
作者 tantanliu fan、wang gaganagrawal 《Frontiers of Computer Science》 SCIE EI CSCD 2012年第2期179-196,共18页
In recent years, the deep web has become ex- tremely popular. Like any other data source, data mining on the deep web can produce important insights or summaries of results. However, data mining on the deep web is cha... In recent years, the deep web has become ex- tremely popular. Like any other data source, data mining on the deep web can produce important insights or summaries of results. However, data mining on the deep web is chal- lenging because the databases cannot be accessed directly, and therefore, data mining must be performed by sampling the datasets. The samples, in turn, can only be obtained by querying deep web databases with specific inputs. In this pa- per, we target two related data mining problems, association mining and differential rule mining. These are proposed to ex- tract high-level summaries of the differences in data provided by different deep web data sources in the same domain. We develop stratified sampling methods to perform these min- ing tasks on a deep web source. Our contributions include a novel greedy stratification approach, which recursively pro- cesses the query space of a deep web data source, and con- siders both the estimation error and the sampling costs. We have also developed an optimized sample allocation method that integrates estimation error and sampling costs. Our ex- perimental results show that our algorithms effectively and consistently reduce sampling costs, compared with a strat- ified sampling method that only considers estimation error. In addition, compared with simple random sampling, our al- gorithm has higher sampling accuracy and lower sampling costs. 展开更多
关键词 deep web associate rule mining stratified sam-piing
原文传递
上一页 1 下一页 到第
使用帮助 返回顶部