期刊文献+
共找到11篇文章
< 1 >
每页显示 20 50 100
iHDFS: A Distributed File System Supporting Incremental Computing
1
作者 Zhenhua Wang Qingsong Ding +2 位作者 Fuxiang Gao derong shen Ge Yu 《国际计算机前沿大会会议论文集》 2015年第1期44-45,共2页
Big data are always processed repeatedly with small changes, which is a major form of big data processing. The feature of incremental change of big data shows that incremental computing mode can improve the performanc... Big data are always processed repeatedly with small changes, which is a major form of big data processing. The feature of incremental change of big data shows that incremental computing mode can improve the performance greatly. HDFS is a distributed file system on Hadoop which is the most popular platform for big data analytics. And HDFS adopts fixed-size chunking policy, which is inefficient facing incremental computing. Therefore, in this paper, we proposed iHDFS (incremental HDFS), a distributed file system, which can provide basic guarantee for big data parallel processing. The iHDFS is implemented as an extension to HDFS. In iHDFS, Rabin fingerprint algorithm is applied to achieve content defined chunking. This policy make data chunking has much higher stability, and the intermediate processing results can be reused efficiently, so the performance of incremental data processing can be improved significantly. The effectiveness and efficiency of iHDFS have been demonstrated by the experimental results. 展开更多
关键词 INCREMENTAL COMPUTING distributed FILE system BIG data HDFS
下载PDF
基于层次化混合特征图的链路预测方法 被引量:6
2
作者 李冬 申德荣 +3 位作者 寇月 林梦儿 聂铁铮 于戈 《中国科学:信息科学》 CSCD 北大核心 2020年第2期221-238,共18页
现实世界中的实体连同关联关系构成了一种网络关系结构即异构信息网络.利用链路预测技术可以预测出异构信息网络中存在但未被观察到,或者未来可能会出现的链路,更好地帮助用户理解网络的结构生成和演化规律.然而,目前链路预测技术缺乏... 现实世界中的实体连同关联关系构成了一种网络关系结构即异构信息网络.利用链路预测技术可以预测出异构信息网络中存在但未被观察到,或者未来可能会出现的链路,更好地帮助用户理解网络的结构生成和演化规律.然而,目前链路预测技术缺乏对多种特征的有效融合而影响预测准确性,且难以适应异构信息网络的异构性和动态性.本文提出了一种层次化混合特征图模型(hierarchical hybrid feature graph, HHFG),充分考虑了异构信息网络的拓扑特征、语义特征和时序特征.提出了一种基于HHFG的链路预测算法,基于混合特征在HHFG上做随机游走,并采用梯度下降法学习特征权重,转移系数等参数,有效地保证了链路预测的准确性.通过实验验证了本文所提出的关键技术的可行性和有效性. 展开更多
关键词 链路预测 层次化混合特征图 异构信息网络 随机游走 参数学习
原文传递
A genetic algorithm based entity resolution approach with active learning 被引量:1
3
作者 Chenchen SUN derong shen +2 位作者 Yue KOU Tiezheng NIE Ge YU 《Frontiers of Computer Science》 SCIE EI CSCD 2017年第1期147-159,共13页
Entity resolution is a key aspect in data quality and data integration, identifying which records correspond to the same real world entity in data sources. Many existing ap- proaches require manually designed match ru... Entity resolution is a key aspect in data quality and data integration, identifying which records correspond to the same real world entity in data sources. Many existing ap- proaches require manually designed match rules to solve the problem, which always needs domain knowledge and is time consuming. We propose a novel genetic algorithm based en- tity resolution approach via active learning. It is able to learn effective match rules by logically combining several different attributes' comparisons with proper thresholds. We use ac- tive learning to reduce manually labeled data and speed up the learning process. The extensive evaluation shows that the proposed approach outperforms the sate-of-the-art entity res- olution approaches in accuracy. 展开更多
关键词 entity resolution genetic algorithm active learning data quality data integration
原文传递
Discovering context-aware conditional functional dependencies 被引量:1
4
作者 Yuefeng DU derong shen +2 位作者 Tiezheng NIE Yue KOU Ge YU 《Frontiers of Computer Science》 SCIE EI CSCD 2017年第4期688-701,共14页
Conditional functional dependencies(CFDs) are important techniques for data consistency. However, CFDs are limited to 1) provide the reasonable values for consistency repairing and 2) detect potential errors. This... Conditional functional dependencies(CFDs) are important techniques for data consistency. However, CFDs are limited to 1) provide the reasonable values for consistency repairing and 2) detect potential errors. This paper presents context-aware conditional functional dependencies(CCFDs) which contribute to provide reasonable values and detect po- tential errors. Especially, we focus on automatically discov- ering minimal CCFDs. In this paper, we present context rela- tivity to measure the relationship of CFDs. The overlap of the related CFDs can provide reasonable values which result in more accuracy consistency repairing, and some related CFDs are combined into CCFDs. Moreover, we prove that discover- ing minimal CCFDs is NP-complete and we design the pre- cise method and the heuristic method. We also present the dominating value to facilitate the process in both the precise method and the heuristic method. Additionally, the context relativity of the CFDs affects the cleaning results. We will give an approximate threshold of context relativity accord- ing to data distribution for suggestion. The repairing results are approved more accuracy, even evidenced by our empirical evaluation. 展开更多
关键词 conditional functional dependencies contextaware rules discovery
原文传递
Strongly connected components based efficient computation of page rank 被引量:3
5
作者 Hongguo YANG derong shen +2 位作者 Yue KOU Tiezheng NIE Ge YU 《Frontiers of Computer Science》 SCIE EI CSCD 2018年第6期1208-1219,共12页
原文传递
Information networks fusion based on multi-task coordination
6
作者 Dong LI derong shen +1 位作者 Yue KOU Tiezheng NIE 《Frontiers of Computer Science》 SCIE EI CSCD 2021年第4期121-132,共12页
Information networks provide a powerful representation of entities and the relationships between them.Information networks fusion is a technique for information fusion that jointly reasons about entities,links and rel... Information networks provide a powerful representation of entities and the relationships between them.Information networks fusion is a technique for information fusion that jointly reasons about entities,links and relations in the presence of various sources.However,existing methods for information networks fusion tend to rely on a single task which might not get enough evidence for reasoning.In order to solve this issue,in this paper,we present a novel model called MC-INFM(information networks fusion model based on multi-task coordination).Different from traditional models,MC-INFM casts the fusion problem as a probabilistic inference problem,and collectively performs multiple tasks(including entity resolution,link prediction and relation matching)to infer the final result of fusion.First,we define the intra-features and the inter-features respectively and model them as factor graphs,which can provide abundant evidence to infer.Then,we use conditional random field(CRF)to learn the weight of each feature and infer the results of these tasks simultaneously by performing the maximum probabilistic inference.Experiments demonstrate the effectiveness of our proposed model. 展开更多
关键词 information networks fusion multi-task coordination conditional random field INFERENCE
原文传递
Disk based pay-as-you-go record linkage
7
作者 Chenchen Sun derong shen 《Frontiers of Computer Science》 SCIE EI CSCD 2022年第4期227-229,共3页
1Introduction Record linkage(RL)groups_records corresponding to the same entities in datasets,and is a long-standing topic in data management and mining communities[1-2].In big data era,real-time data applications bec... 1Introduction Record linkage(RL)groups_records corresponding to the same entities in datasets,and is a long-standing topic in data management and mining communities[1-2].In big data era,real-time data applications become popular,and callfor payas-you-go RL(PRL),which produces as many match pairs as possible in very limited time(much shorter than the overall RLruntime). 展开更多
关键词 RECORD LINKAGE MINING
原文传递
Biomedical entity linking based on less labeled data
8
作者 Yu HU derong shen +2 位作者 Tiezheng NIE Yue KOU Ge YU 《Frontiers of Computer Science》 SCIE EI CSCD 2022年第3期219-221,共3页
1 Introduction Biomedical entity linking aims to align the natural language entity to the knowledge base concept referring to the same real-world object.Recent solutions in the biomedical field focus on the embedding-... 1 Introduction Biomedical entity linking aims to align the natural language entity to the knowledge base concept referring to the same real-world object.Recent solutions in the biomedical field focus on the embedding-based method to jointly model the texts and the knowledge graph into a multi-dimensional entity space.However,the task of a biomedical entity linking is still facing tough challenges:1)The ambiguity of natural language descriptions,including polysemy and abbreviation. 展开更多
关键词 ABBREVIATION TOUGH AMBIGUITY
原文传递
An integrated pipeline model for biomedical entity alignment
9
作者 Yu HU Tiezheng NIE +2 位作者 derong shen Yue KOU Ge YU 《Frontiers of Computer Science》 SCIE EI CSCD 2021年第3期81-95,共15页
Biomedical entity alignment,composed of two subtasks:entity identification and entity-concept mapping,is of great research value in biomedical text mining while these techniques are widely used for name entity standar... Biomedical entity alignment,composed of two subtasks:entity identification and entity-concept mapping,is of great research value in biomedical text mining while these techniques are widely used for name entity standardization,information retrieval,knowledge acquisition and ontology construc-tion.Previous works made many efforts on feature engineering to employ feature-based models for entity identification and alignment.However,the models depended on subjective feature selection may suffer error propagation and are not able to uti-lize the hidden information.With rapid development in health-related research,researchers need an effective method to explore the large amount of available biomedical literatures.Therefore,we propose a two-stage entity alignment process,biomedical entity exploring model,to identify biomedical entities and align them to the knowledge base interactively.The model aims to automatically obtain semantic information for extracting biomedical entities and mining semantic relations through the standard biomedical knowledge base.The experiments show that the proposed method achieves better performance on entity alignment.The proposed model dramatically improves the FI scores of the task by about 4.5%in entity identification and 2.5%in entity-concept mapping. 展开更多
关键词 entity alignment biomedical text mining neural network model
原文传递
Efficient private multi-party numerical records matching
10
作者 Shumin Han derong shen +2 位作者 Tiezheng Nie Yue Kou Ge Yu 《Frontiers of Computer Science》 SCIE EI CSCD 2020年第5期233-235,共3页
1 Introduction and main contributions Private entity matching(PEM)[1]is to find records from two or more data sources that refer to the same or similar individuals,without revealing other information besides the match... 1 Introduction and main contributions Private entity matching(PEM)[1]is to find records from two or more data sources that refer to the same or similar individuals,without revealing other information besides the matched records.There have been numerous work done for PEM. 展开更多
关键词 MATCHING matched REVEALING
原文传递
HPPQ: A Parallel Package Queries Processing Approach for Large-Scale Data
11
作者 Meihui Shi derong shen +2 位作者 Tiezheng Nie Yue Kou Ge Yu 《Big Data Mining and Analytics》 2018年第2期146-159,共14页
A lot of scholars have focused on developing effective techniques for package queries, and a lot of excellent approaches have been proposed. Unfortunately, most of the existing methods focus on a small volume of data.... A lot of scholars have focused on developing effective techniques for package queries, and a lot of excellent approaches have been proposed. Unfortunately, most of the existing methods focus on a small volume of data. The rapid increase in data volume means that traditional methods of package queries find it difficult to meet the increasing requirements. To solve this problem, a novel optimization method of package queries(HPPQ) is proposed in this paper. First, the data is preprocessed into regions. Data preprocessing segments the dataset into multiple subsets and the centroid of the subsets is used for package queries, this effectively reduces the volume of candidate results. Furthermore, an efficient heuristic algorithm is proposed(namely IPOL-HS) based on the preprocessing results. This improves the quality of the candidate results in the iterative stage and improves the convergence rate of the heuristic algorithm. Finally, a strategy called HPR is proposed, which relies on a greedy algorithm and parallel processing to accelerate the rate of query. The experimental results show that our method can significantly reduce time consumption compared with existing methods. 展开更多
关键词 PACKAGE QUERIES HEURISTIC algorithms PARALLEL processing opposition-based learning
原文传递
上一页 1 下一页 到第
使用帮助 返回顶部