The leaching performance and leaching kinetics of LiFePO_(4)(LFP)and Al in Al-bearing spent LFP cathode powder were systematically studied.The effects of temperature(273−368 K),stirring speed(200−950 r/min),reaction t...The leaching performance and leaching kinetics of LiFePO_(4)(LFP)and Al in Al-bearing spent LFP cathode powder were systematically studied.The effects of temperature(273−368 K),stirring speed(200−950 r/min),reaction time(0−240 min),acid-to-material ratio(0.1:1−1:1 mL/g)and liquid-to-solid ratio(3:1−9:1 mL/g)on the leaching process were investigated.The results show that the concentration of reactants and the temperature have a greater impact on the leaching of Al.Under the optimal conditions,leaching efficiencies of LFP and Al are 91.53%and 15.98%,respectively.The kinetic study shows that the leaching of LFP is kinetically controlled by mixed surface reaction and diffusion,with an activation energy of 22.990 kJ/mol;whereas the leaching of Al is only controlled by surface chemical reaction,with an activation energy of 46.581 kJ/mol.A low leaching temperature can effectively suppress the dissolving of Al during the acid leaching of the spent LFP cathode material.展开更多
Heterogeneous computing (HC) environment utilizes diverse resources with different computational capabilities to solve computing-intensive applications having diverse computational requirements and constraints. The ta...Heterogeneous computing (HC) environment utilizes diverse resources with different computational capabilities to solve computing-intensive applications having diverse computational requirements and constraints. The task assignment problem in HC environment can be formally defined as for a given set of tasks and machines, assigning tasks to machines to achieve the minimum makespan. In this paper we propose a new task scheduling heuristic, high standard deviation first (HSTDF), which considers the standard deviation of the expected execution time of a task as a selection criterion. Standard deviation of the ex- pected execution time of a task represents the amount of variation in task execution time on different machines. Our conclusion is that tasks having high standard deviation must be assigned first for scheduling. A large number of experiments were carried out to check the effectiveness of the proposed heuristic in different scenarios, and the comparison with the existing heuristics (Max-min, Sufferage, Segmented Min-average, Segmented Min-min, and Segmented Max-min) clearly reveals that the proposed heuristic outperforms all existing heuristics in terms of average makespan.展开更多
It is of great significance to automatically generate code from structured flowchart. There are some deficiencies in existing researches, and their key algorithms and technologies are not elaborated, also there are ve...It is of great significance to automatically generate code from structured flowchart. There are some deficiencies in existing researches, and their key algorithms and technologies are not elaborated, also there are very few full-featured integrated development platforms that can generate code automatically based on structured flowchart. By analyzing the characteristics of structured flowchart, a structure identification algorithm for structured flowchart is put forward. The correctness of algorithm is verified by enumeration iteration. Then taking the identified flowchart as input, an automatic code generation algorithm is proposed. Also the correctness is verified by enumeration iteration. Finally an integrated development platform is developed using those algorithms, including flowchart modeling, code automatic generation, CDT\GCC\GDB etc. The correctness and effectiveness of algorithms proposed are verified through practical operations.展开更多
Henoch-Schnlein purpura(HSP) is a small-vessel vasculitis mediated by IgA-immune complex deposition.It is characterized by the clinical tetrad of non-thrombocytopenic palpable purpura,abdominal pain,arthritis and rena...Henoch-Schnlein purpura(HSP) is a small-vessel vasculitis mediated by IgA-immune complex deposition.It is characterized by the clinical tetrad of non-thrombocytopenic palpable purpura,abdominal pain,arthritis and renal involvement.The diagnosis of HSP is difficult,especially when abdominal symptoms precede cutaneous lesions.We report a rare case of paroxysmal drastic abdominal pain with gastrointestinal bleeding presented in HSP.The diagnosis was verified by renal damage and the occurrence of purpura.展开更多
Myeloid sarcomas(MS)involve extramedullary blast proliferation from one or more myeloid lineages thatreplace the original tissue architecture,and these neoplasias are called granulocytic sarcomas,chloromas or extramed...Myeloid sarcomas(MS)involve extramedullary blast proliferation from one or more myeloid lineages thatreplace the original tissue architecture,and these neoplasias are called granulocytic sarcomas,chloromas or extramedullary myeloid tumors.Such tumors develop in lymphoid organs,bones(e.g.,skulls and orbits),skin,soft tissue,various mucosae,organs,and the central nervous system.Gastrointestinal(GI)involvement is rare,while the occurrence of myeloid sarcomas in patients without leukemia is even rare.Here,we report a case of a 38-year-old man who presented with epigastric pain and progressive jaundice.An upper GI endoscopy had shown extensive multifocal hyperemic fold thickening and the spread of nodular lesions in the body of the stomach.Biopsies from the gastric lesions indicated myeloid sarcoma of the stomach.However,concurrent peripheral blood and bone marrow examinations showed no evidence of acute myeloid leukemia.For diagnosis,the immunohistochemical markers must be checked when evaluating a suspected myeloid sarcoma case.Accurate MS diagnosis determines the appropriate therapy and prognosis.展开更多
Primary natural killer/T-cell(NK/T-cell) lymphoma of the gastrointestinal tract is a very rare disease with a poor prognosis, and the duodenum is quite extraordinary as a primary lesion site. Here, we describe a uniqu...Primary natural killer/T-cell(NK/T-cell) lymphoma of the gastrointestinal tract is a very rare disease with a poor prognosis, and the duodenum is quite extraordinary as a primary lesion site. Here, we describe a unique case of a primary duodenal NK/T-cell lymphoma in a 26-year-old man who presented with abdominal painand weight loss. Abdominal computed tomography scan demonstrated a hypodense tumor in the duodenum. Because of massive upper gastrointestinal tract bleeding during hospitalization, the patient was examined by emergency upper gastrointestinal endoscopy. Under endoscopy, an irregular ulcer with mucosal edema, destruction, necrosis, a hyperplastic nodule and active bleeding was observed on the duodenal posterior wall. Following endoscopic hemostasis, a biopsy was obtained for pathological evaluation. The lesion was subsequently confirmed to be a duodenal NK/T-cell lymphoma. The presenting symptoms of primary duodenal NK-/T-cell lymphoma in this patient were abdominal pain and gastrointestinal bleeding, and endoscopy was important for diagnosis. Despite aggressive treatments, the prognosis was very poor.展开更多
Angioimmunoblastic T-cell lymphoma(AITL)is a unique type of peripheral T-cell lymphoma with a constellation of clinical symptoms and signs,including weight loss,fever,chills,anemia,skin rash,hepatosplenomegaly,lymphad...Angioimmunoblastic T-cell lymphoma(AITL)is a unique type of peripheral T-cell lymphoma with a constellation of clinical symptoms and signs,including weight loss,fever,chills,anemia,skin rash,hepatosplenomegaly,lymphadenopathy,thrombocytopenia and polyclonal hypergammaglobulinemia.The histological features of AITL are also distinctive.Pure red cell aplasia is a bone marrow failure characterized by progressive normocytic anemia and reticulocytopenia without leucopenia or thrombocytopenia.However,AITL with abdominal pain and pure red cell aplasia has rarely been reported.Here,we report a rare case of AITL-associated pure red cell aplasia with abdominal pain.The diagnosis was verified by a biopsy of the enlarged abdominal lymph nodes with immunohistochemical staining.展开更多
Recently there is an increasing need for interactive human-driven analysis on large volumes of data. Online aggregation (OLA), which provides a quick sketch of massive data before a long wait of the final accurate q...Recently there is an increasing need for interactive human-driven analysis on large volumes of data. Online aggregation (OLA), which provides a quick sketch of massive data before a long wait of the final accurate query result, has drawn significant research attention. However, the direct processing of OLA on duplicate data will lead to incorrect query answers, since sampling from duplicate records leads to an over representation of the duplicate data in the sample. This violates the prerequisite of uniform distributions in most statistical theories. In this paper, we propose CrowdOLA, a novel framework for integrating online aggregation processing with deduplication. Instead of cleaning the whole dataset, Crow~ dOLA retrieves block-level samples continuously from the dataset, and employs a crowd-based entity resolution approach to detect duplicates in the sample in a pay-as-you-go fashion. After cleaning the sample, an unbiased estimator is provided to address the error bias that is introduced by the duplication. We evaluate CrowdOLA on both real-world and synthetic workloads. Experimental results show that CrowdOLA provides a good balance between efficiency and accuracy.展开更多
Data quality is important in many data-driven applications, such as decision making, data analysis, and data mining. Recent studies focus on data cleaning techniques by deleting or repairing the dirty data, which may ...Data quality is important in many data-driven applications, such as decision making, data analysis, and data mining. Recent studies focus on data cleaning techniques by deleting or repairing the dirty data, which may cause information loss and bring new inconsistencies. To avoid these problems, we propose EntityManager, a general system to manage dirty data without data cleaning. This system takes real-world entity as the basic storage unit and retrieves query results according to the quality requirement of users. The system is able to handle all kinds of inconsistencies recognized by entity resolution. We elaborate the EntityManager system, covering its architecture, data model, and query processing techniques. To process queries efficiently, our system adopts novel indices, similarity operator and query optimization techniques. Finally, we verify the efficiency and effectiveness of this system and present future research challenges.展开更多
Incomplete data has been a longstanding issue in the database community, and the subject is yet poorly handled by both theories and practices. One common way to cope with missing values is to complete their imputation...Incomplete data has been a longstanding issue in the database community, and the subject is yet poorly handled by both theories and practices. One common way to cope with missing values is to complete their imputation (filling in) as a preprocessing step before analyses. Unfortunately, not a single imputation method could impute all missing values correctly in all cases. Users could hardly trust the query result on such complete data without any confidence guarantee. In this paper, we propose to directly estimate the aggregate query result on incomplete data, rather than to impute the missing values. An interval estimation, composed of the upper and the lower bound of aggregate query results among all possible interpretations of missing values, is presented to the end users. The ground-truth aggregate result is guaranteed to be among the interval. We believe that decision support applications could benefit significantly from the estimation, since they can tolerate inexact answers, as long as there are clearly defined semantics and guarantees associated with the results. Our main techniques are parameter-free and do not assume prior knowledge about the distribution and missingness mechanisms. Experimental results are consistent with the theoretical results and suggest that the estimation is invaluable to better assess the results of aggregate queries on incomplete data.展开更多
Time intervals are often associated with tuples to represent their valid time in temporal relations, where overlap join is crucial for various kinds of queries. Many existing overlap join algorithms use indices based ...Time intervals are often associated with tuples to represent their valid time in temporal relations, where overlap join is crucial for various kinds of queries. Many existing overlap join algorithms use indices based on tree structures such as quad-tree, B+-tree and interval tree. These algorithms usually have high CPU cost since deep path traversals are unavoidable, which makes them not so competitive as data-partition or plane-sweep based algorithms. This paper proposes an efficient overlap join algorithm based on a new two-layer flat index named as Overlap Interval Inverted Index (i.e., O2i Index). It uses an array to record the end points of intervals and approximates the nesting structures of intervals via two functions in the first layer, and the second layer uses inverted lists to trace all intervals satisfying the approximated nesting structures. With the help of the new index, the join algorithm only visits the must-be-scanned lists and skips all others. Analyses and experiments on both real and synthetic datasets show that the proposed algorithm is as competitive as the state-of-the-art algorithms.展开更多
A new scolopendra-type polymer of polydodecyloxybenzoyl[1,5]-diazocine(PDBD) was designed and prepared using 2,5-bis(4-(dodecyloxy)-benzoyl)terephthaloyl azide with trifluoroacetic acid(TFA) via one-pot reacti...A new scolopendra-type polymer of polydodecyloxybenzoyl[1,5]-diazocine(PDBD) was designed and prepared using 2,5-bis(4-(dodecyloxy)-benzoyl)terephthaloyl azide with trifluoroacetic acid(TFA) via one-pot reaction in good yields. The structure of polymer was characterized using ~1 H-NMR, ^(13) C-NMR and MALDI-TOF spectra. The polymer PDBD exhibits good thermal stability as measured by TGA and DSC, and can be dissolved well in common organic solvents such as chloroform and tetrahydrofuran. In addition, UV-Vis spectral studies indicate that the polymer PDBD shows unique optical property changes(protonation/deprotonation) in the different trifluoroacetic acid environments. The new polymer is expected to be utilized as an optical functional material for fabricating optical sensors in environmental and biological fields.展开更多
Missing value imputation with crowdsourcing is a novel method in data cleaning to capture missing values that could hardly be filled with automatic approaches. However, the time cost and overhead in crowdsourcing are ...Missing value imputation with crowdsourcing is a novel method in data cleaning to capture missing values that could hardly be filled with automatic approaches. However, the time cost and overhead in crowdsourcing are high. Therefore, we have to reduce cost and guarantee the accuracy of crowdsourced imputation. To achieve the optimization goal, we present COSSET+, a crowdsourced framework optimized by knowledge base. We combine the advantages of both knowledge-based filter and crowdsourcing platform to capture missing values. Since the amount of crowd values will affect the cost of COSSET+, we aim to select partial missing values to be crowdsourced. We prove that the crowd value selection problem is an NP-hard problem and develop an approximation algorithm for this problem. Extensive experimental results demonstrate the efficiency and effectiveness of the proposed approaches.展开更多
Low quality of data is a serious problem in the new era of big data, which can severely reduce the usability of data, mislead or bias the querying, analyzing and mining, and leads to huge loss. Incomplete data is comm...Low quality of data is a serious problem in the new era of big data, which can severely reduce the usability of data, mislead or bias the querying, analyzing and mining, and leads to huge loss. Incomplete data is common in low quality data, and it is necessary to determine the data completeness of a dataset to provide hints for follow-up operations on it.Little existing work focuses on the completeness of a dataset, and such work views all missing values as unknown values. In this paper, we study how to determine real data completeness of a relational dataset. By taking advantage of given functional dependencies, we aim to determine some missing attribute values by other tuples and capture the really missing attribute cells. We propose a data completeness model, formalize the problem of determining the real data completeness of a relational dataset, and give a lower bound of the time complexity of this problem. Two optimal algorithms to determine the data completeness of a dataset for different cases are proposed. We empirically show the effectiveness and the scalability of our algorithms on both real-world data and synthetic data.展开更多
String similarity join(SSJ) is essential for many applications where near-duplicate objects need to be found. This paper targets SSJ with edit distance constraints. The existing algorithms usually adopt the filter-and...String similarity join(SSJ) is essential for many applications where near-duplicate objects need to be found. This paper targets SSJ with edit distance constraints. The existing algorithms usually adopt the filter-andrefine framework. They cannot catch the dissimilarity between string subsets, and do not fully exploit the statistics such as the frequencies of characters. We investigate to develop a partition-based algorithm by using such statistics.The frequency vectors are used to partition datasets into data chunks with dissimilarity between them being caught easily. A novel algorithm is designed to accelerate SSJ via the partitioned data. A new filter is proposed to leverage the statistics to avoid computing edit distances for a noticeable proportion of candidate pairs which survive the existing filters. Our algorithm outperforms alternative methods notably on real datasets.展开更多
Array partitioning is an important research problem in array management area,since the partitioning strategies have important influence on storage,query evaluation,and other components in array management systems.Mean...Array partitioning is an important research problem in array management area,since the partitioning strategies have important influence on storage,query evaluation,and other components in array management systems.Meanwhile,compression is highly needed for the array data due to its growing volume.Observing that the array partitioning can affect the compression performance significantly,this paper aims to design the efficient partitioning method for array data to optimize the compression performance.As far as we know,there still lacks research efforts on this problem.In this paper,the problem of array partitioning for optimizing the compression performance(PPCP for short)is firstly proposed.We adopt a popular compression technique which allows to process queries on the compressed data without decompression.Secondly,because the above problem is NP-hard,two essential principles for exploring the partitioning solution are introduced,which can explain the core idea of the partitioning algorithms proposed by us.The first principle shows that the compression performance can be improved if an array can be partitioned into two parts with different sparsities.The second principle introduces a greedy strategy which can well support the selection of the partitioning positions heuristically.Supported by the two principles,two greedy strategy based array partitioning algorithms are designed for the independent case and the dependent case respectively.Observing the expensive cost of the algorithm for the dependent case,a further optimization based on random sampling and dimension grouping is proposed to achieve linear time cost.Finally,the experiments are conducted on both synthetic and real-life data,and the results show that the two proposed partitioning algorithms achieve better performance on both compression and query evaluation.展开更多
Kernel is a kind of data summary which is elaborately extracted from a large dataset.Given a problem,the solution obtained from the kernel is an approximate version of the solution obtained from the whole dataset with...Kernel is a kind of data summary which is elaborately extracted from a large dataset.Given a problem,the solution obtained from the kernel is an approximate version of the solution obtained from the whole dataset with a provable approximate ratio.It is widely used in geometric optimization,clustering,and approximate query processing,etc.,for scaling them up to massive data.In this paper,we focus on the minimumε-kernel(MK)computation that asks for a kernel of the smallest size for large-scale data processing.For the open problem presented by Wang et al.that whether the minimumε-coreset(MC)problem and the MK problem can be reduced to each other,we first formalize the MK problem and analyze its complexity.Due to the NP-hardness of the MK problem in three or higher dimensions,an approximate algorithm,namely Set Cover-Based Minimumε-Kernel algorithm(SCMK),is developed to solve it.We prove that the MC problem and the MK problem can be Turing-reduced to each other.Then,we discuss the update of MK under insertion and deletion operations,respectively.Finally,a randomized algorithm,called the Randomized Algorithm of Set Cover-Based Minimumε-Kernel algorithm(RA-SCMK),is utilized to further reduce the complexity of SCMK.The efficiency and effectiveness of SCMK and RA-SCMK are verified by experimental results on real-world and synthetic datasets.Experiments show that the kernel sizes of SCMK are 2x and 17.6x smaller than those of an ANN-based method on real-world and synthetic datasets,respectively.The speedup ratio of SCMK over the ANN-based method is 5.67 on synthetic datasets.RA-SCMK runs up to three times faster than SCMK on synthetic datasets.展开更多
基金The authors acknowledge the funding support from the Key Deployment Projects of Chinese Academy of Sciences(ZDRW_CN_2020-1)the Sino-German Cooperation Research Project under the Natural Science Foundation of China(51761135108)+1 种基金the German Research Foundation(392417756)the CAS Interdisciplinary Innovation Team.
文摘The leaching performance and leaching kinetics of LiFePO_(4)(LFP)and Al in Al-bearing spent LFP cathode powder were systematically studied.The effects of temperature(273−368 K),stirring speed(200−950 r/min),reaction time(0−240 min),acid-to-material ratio(0.1:1−1:1 mL/g)and liquid-to-solid ratio(3:1−9:1 mL/g)on the leaching process were investigated.The results show that the concentration of reactants and the temperature have a greater impact on the leaching of Al.Under the optimal conditions,leaching efficiencies of LFP and Al are 91.53%and 15.98%,respectively.The kinetic study shows that the leaching of LFP is kinetically controlled by mixed surface reaction and diffusion,with an activation energy of 22.990 kJ/mol;whereas the leaching of Al is only controlled by surface chemical reaction,with an activation energy of 46.581 kJ/mol.A low leaching temperature can effectively suppress the dissolving of Al during the acid leaching of the spent LFP cathode material.
基金Project supported by the National Natural Science Foundation of China (No. 60703012)the National Basic Research Program (973) of China (No. 2006CB303000)the Heilongjiang Provincial Scientific and Technological Special Fund for Young Scholars (No. QC06C033),China
文摘Heterogeneous computing (HC) environment utilizes diverse resources with different computational capabilities to solve computing-intensive applications having diverse computational requirements and constraints. The task assignment problem in HC environment can be formally defined as for a given set of tasks and machines, assigning tasks to machines to achieve the minimum makespan. In this paper we propose a new task scheduling heuristic, high standard deviation first (HSTDF), which considers the standard deviation of the expected execution time of a task as a selection criterion. Standard deviation of the ex- pected execution time of a task represents the amount of variation in task execution time on different machines. Our conclusion is that tasks having high standard deviation must be assigned first for scheduling. A large number of experiments were carried out to check the effectiveness of the proposed heuristic in different scenarios, and the comparison with the existing heuristics (Max-min, Sufferage, Segmented Min-average, Segmented Min-min, and Segmented Max-min) clearly reveals that the proposed heuristic outperforms all existing heuristics in terms of average makespan.
文摘It is of great significance to automatically generate code from structured flowchart. There are some deficiencies in existing researches, and their key algorithms and technologies are not elaborated, also there are very few full-featured integrated development platforms that can generate code automatically based on structured flowchart. By analyzing the characteristics of structured flowchart, a structure identification algorithm for structured flowchart is put forward. The correctness of algorithm is verified by enumeration iteration. Then taking the identified flowchart as input, an automatic code generation algorithm is proposed. Also the correctness is verified by enumeration iteration. Finally an integrated development platform is developed using those algorithms, including flowchart modeling, code automatic generation, CDT\GCC\GDB etc. The correctness and effectiveness of algorithms proposed are verified through practical operations.
基金Supported by The Major Projects Incubator Program of Sun Yat-Sen University,No.10ykjc25One Hundred Talents Program of Sun Yat-Sen University,No.82000-3171310+1 种基金Guangdong Science and Technology Program,No.2009B060300001National Natural Science Foundation of China,No.30971357
文摘Henoch-Schnlein purpura(HSP) is a small-vessel vasculitis mediated by IgA-immune complex deposition.It is characterized by the clinical tetrad of non-thrombocytopenic palpable purpura,abdominal pain,arthritis and renal involvement.The diagnosis of HSP is difficult,especially when abdominal symptoms precede cutaneous lesions.We report a rare case of paroxysmal drastic abdominal pain with gastrointestinal bleeding presented in HSP.The diagnosis was verified by renal damage and the occurrence of purpura.
基金Supported by the National Natural Science Foundation of China,No.81370511the Natural Science Foundation of Guangdong Province,No.S2011020002348Fundamental Research Funds for the Central Universities,No.82000-3281901
文摘Myeloid sarcomas(MS)involve extramedullary blast proliferation from one or more myeloid lineages thatreplace the original tissue architecture,and these neoplasias are called granulocytic sarcomas,chloromas or extramedullary myeloid tumors.Such tumors develop in lymphoid organs,bones(e.g.,skulls and orbits),skin,soft tissue,various mucosae,organs,and the central nervous system.Gastrointestinal(GI)involvement is rare,while the occurrence of myeloid sarcomas in patients without leukemia is even rare.Here,we report a case of a 38-year-old man who presented with epigastric pain and progressive jaundice.An upper GI endoscopy had shown extensive multifocal hyperemic fold thickening and the spread of nodular lesions in the body of the stomach.Biopsies from the gastric lesions indicated myeloid sarcoma of the stomach.However,concurrent peripheral blood and bone marrow examinations showed no evidence of acute myeloid leukemia.For diagnosis,the immunohistochemical markers must be checked when evaluating a suspected myeloid sarcoma case.Accurate MS diagnosis determines the appropriate therapy and prognosis.
基金Supported by The Major Projects Incubator National Natural Science Foundation of China,No.30971357 and No.82000-3171310Guangdong Science and Technology Program,No.2009B060300001+1 种基金Program of Sun Yat-Sen University,No.10ykjc25One Hundred Talents Program of Sun Yat-Sen University
文摘Primary natural killer/T-cell(NK/T-cell) lymphoma of the gastrointestinal tract is a very rare disease with a poor prognosis, and the duodenum is quite extraordinary as a primary lesion site. Here, we describe a unique case of a primary duodenal NK/T-cell lymphoma in a 26-year-old man who presented with abdominal painand weight loss. Abdominal computed tomography scan demonstrated a hypodense tumor in the duodenum. Because of massive upper gastrointestinal tract bleeding during hospitalization, the patient was examined by emergency upper gastrointestinal endoscopy. Under endoscopy, an irregular ulcer with mucosal edema, destruction, necrosis, a hyperplastic nodule and active bleeding was observed on the duodenal posterior wall. Following endoscopic hemostasis, a biopsy was obtained for pathological evaluation. The lesion was subsequently confirmed to be a duodenal NK/T-cell lymphoma. The presenting symptoms of primary duodenal NK-/T-cell lymphoma in this patient were abdominal pain and gastrointestinal bleeding, and endoscopy was important for diagnosis. Despite aggressive treatments, the prognosis was very poor.
文摘Angioimmunoblastic T-cell lymphoma(AITL)is a unique type of peripheral T-cell lymphoma with a constellation of clinical symptoms and signs,including weight loss,fever,chills,anemia,skin rash,hepatosplenomegaly,lymphadenopathy,thrombocytopenia and polyclonal hypergammaglobulinemia.The histological features of AITL are also distinctive.Pure red cell aplasia is a bone marrow failure characterized by progressive normocytic anemia and reticulocytopenia without leucopenia or thrombocytopenia.However,AITL with abdominal pain and pure red cell aplasia has rarely been reported.Here,we report a rare case of AITL-associated pure red cell aplasia with abdominal pain.The diagnosis was verified by a biopsy of the enlarged abdominal lymph nodes with immunohistochemical staining.
基金This work was supported by the National Natural Science Foundation of China under Grant Nos. 61502121, 61472099, and 61602129.
文摘Recently there is an increasing need for interactive human-driven analysis on large volumes of data. Online aggregation (OLA), which provides a quick sketch of massive data before a long wait of the final accurate query result, has drawn significant research attention. However, the direct processing of OLA on duplicate data will lead to incorrect query answers, since sampling from duplicate records leads to an over representation of the duplicate data in the sample. This violates the prerequisite of uniform distributions in most statistical theories. In this paper, we propose CrowdOLA, a novel framework for integrating online aggregation processing with deduplication. Instead of cleaning the whole dataset, Crow~ dOLA retrieves block-level samples continuously from the dataset, and employs a crowd-based entity resolution approach to detect duplicates in the sample in a pay-as-you-go fashion. After cleaning the sample, an unbiased estimator is provided to address the error bias that is introduced by the duplication. We evaluate CrowdOLA on both real-world and synthetic workloads. Experimental results show that CrowdOLA provides a good balance between efficiency and accuracy.
基金This work was partially supported by the National Key Technology Research and Development Program of the Ministry of Science and Technology of China under Grant No. 2015BAH10F01, the National Natural Science Foundation of China under Grant Nos. U1509216, 61472099, and 61133002, the Scientific Research Foundation for the Returned Overseas Chinese Scholars of Heilongjiang Province of China under Grant No. LC2016026, and the Ministry of Education (MOE)-Microsoft Key Laboratory of Natural Language Processing and Speech, Harbin Institute of Technology.
文摘Data quality is important in many data-driven applications, such as decision making, data analysis, and data mining. Recent studies focus on data cleaning techniques by deleting or repairing the dirty data, which may cause information loss and bring new inconsistencies. To avoid these problems, we propose EntityManager, a general system to manage dirty data without data cleaning. This system takes real-world entity as the basic storage unit and retrieves query results according to the quality requirement of users. The system is able to handle all kinds of inconsistencies recognized by entity resolution. We elaborate the EntityManager system, covering its architecture, data model, and query processing techniques. To process queries efficiently, our system adopts novel indices, similarity operator and query optimization techniques. Finally, we verify the efficiency and effectiveness of this system and present future research challenges.
基金This work was supported by the Young Scientists Fund of the National Natural Science Foundation of China under Grant No.61702344。
文摘Incomplete data has been a longstanding issue in the database community, and the subject is yet poorly handled by both theories and practices. One common way to cope with missing values is to complete their imputation (filling in) as a preprocessing step before analyses. Unfortunately, not a single imputation method could impute all missing values correctly in all cases. Users could hardly trust the query result on such complete data without any confidence guarantee. In this paper, we propose to directly estimate the aggregate query result on incomplete data, rather than to impute the missing values. An interval estimation, composed of the upper and the lower bound of aggregate query results among all possible interpretations of missing values, is presented to the end users. The ground-truth aggregate result is guaranteed to be among the interval. We believe that decision support applications could benefit significantly from the estimation, since they can tolerate inexact answers, as long as there are clearly defined semantics and guarantees associated with the results. Our main techniques are parameter-free and do not assume prior knowledge about the distribution and missingness mechanisms. Experimental results are consistent with the theoretical results and suggest that the estimation is invaluable to better assess the results of aggregate queries on incomplete data.
文摘Time intervals are often associated with tuples to represent their valid time in temporal relations, where overlap join is crucial for various kinds of queries. Many existing overlap join algorithms use indices based on tree structures such as quad-tree, B+-tree and interval tree. These algorithms usually have high CPU cost since deep path traversals are unavoidable, which makes them not so competitive as data-partition or plane-sweep based algorithms. This paper proposes an efficient overlap join algorithm based on a new two-layer flat index named as Overlap Interval Inverted Index (i.e., O2i Index). It uses an array to record the end points of intervals and approximates the nesting structures of intervals via two functions in the first layer, and the second layer uses inverted lists to trace all intervals satisfying the approximated nesting structures. With the help of the new index, the join algorithm only visits the must-be-scanned lists and skips all others. Analyses and experiments on both real and synthetic datasets show that the proposed algorithm is as competitive as the state-of-the-art algorithms.
基金financially supported by the National Natural Science Foundation of China(No.21404066)the Qingdao Independent Innovation Found(No.15-9-1-16-jch)
文摘A new scolopendra-type polymer of polydodecyloxybenzoyl[1,5]-diazocine(PDBD) was designed and prepared using 2,5-bis(4-(dodecyloxy)-benzoyl)terephthaloyl azide with trifluoroacetic acid(TFA) via one-pot reaction in good yields. The structure of polymer was characterized using ~1 H-NMR, ^(13) C-NMR and MALDI-TOF spectra. The polymer PDBD exhibits good thermal stability as measured by TGA and DSC, and can be dissolved well in common organic solvents such as chloroform and tetrahydrofuran. In addition, UV-Vis spectral studies indicate that the polymer PDBD shows unique optical property changes(protonation/deprotonation) in the different trifluoroacetic acid environments. The new polymer is expected to be utilized as an optical functional material for fabricating optical sensors in environmental and biological fields.
文摘Missing value imputation with crowdsourcing is a novel method in data cleaning to capture missing values that could hardly be filled with automatic approaches. However, the time cost and overhead in crowdsourcing are high. Therefore, we have to reduce cost and guarantee the accuracy of crowdsourced imputation. To achieve the optimization goal, we present COSSET+, a crowdsourced framework optimized by knowledge base. We combine the advantages of both knowledge-based filter and crowdsourcing platform to capture missing values. Since the amount of crowd values will affect the cost of COSSET+, we aim to select partial missing values to be crowdsourced. We prove that the crowd value selection problem is an NP-hard problem and develop an approximation algorithm for this problem. Extensive experimental results demonstrate the efficiency and effectiveness of the proposed approaches.
基金The work was supported by the National Basic Research 973 Program of China under Grant No. 2011CB036202 and the National Natural Science Foundation of China under Grant No. 61532015.
文摘Low quality of data is a serious problem in the new era of big data, which can severely reduce the usability of data, mislead or bias the querying, analyzing and mining, and leads to huge loss. Incomplete data is common in low quality data, and it is necessary to determine the data completeness of a dataset to provide hints for follow-up operations on it.Little existing work focuses on the completeness of a dataset, and such work views all missing values as unknown values. In this paper, we study how to determine real data completeness of a relational dataset. By taking advantage of given functional dependencies, we aim to determine some missing attribute values by other tuples and capture the really missing attribute cells. We propose a data completeness model, formalize the problem of determining the real data completeness of a relational dataset, and give a lower bound of the time complexity of this problem. Two optimal algorithms to determine the data completeness of a dataset for different cases are proposed. We empirically show the effectiveness and the scalability of our algorithms on both real-world data and synthetic data.
文摘String similarity join(SSJ) is essential for many applications where near-duplicate objects need to be found. This paper targets SSJ with edit distance constraints. The existing algorithms usually adopt the filter-andrefine framework. They cannot catch the dissimilarity between string subsets, and do not fully exploit the statistics such as the frequencies of characters. We investigate to develop a partition-based algorithm by using such statistics.The frequency vectors are used to partition datasets into data chunks with dissimilarity between them being caught easily. A novel algorithm is designed to accelerate SSJ via the partitioned data. A new filter is proposed to leverage the statistics to avoid computing edit distances for a noticeable proportion of candidate pairs which survive the existing filters. Our algorithm outperforms alternative methods notably on real datasets.
基金This work was supported by the National Natural Science Foundation of China under Grant Nos.61832003 and U1811461.
文摘Array partitioning is an important research problem in array management area,since the partitioning strategies have important influence on storage,query evaluation,and other components in array management systems.Meanwhile,compression is highly needed for the array data due to its growing volume.Observing that the array partitioning can affect the compression performance significantly,this paper aims to design the efficient partitioning method for array data to optimize the compression performance.As far as we know,there still lacks research efforts on this problem.In this paper,the problem of array partitioning for optimizing the compression performance(PPCP for short)is firstly proposed.We adopt a popular compression technique which allows to process queries on the compressed data without decompression.Secondly,because the above problem is NP-hard,two essential principles for exploring the partitioning solution are introduced,which can explain the core idea of the partitioning algorithms proposed by us.The first principle shows that the compression performance can be improved if an array can be partitioned into two parts with different sparsities.The second principle introduces a greedy strategy which can well support the selection of the partitioning positions heuristically.Supported by the two principles,two greedy strategy based array partitioning algorithms are designed for the independent case and the dependent case respectively.Observing the expensive cost of the algorithm for the dependent case,a further optimization based on random sampling and dimension grouping is proposed to achieve linear time cost.Finally,the experiments are conducted on both synthetic and real-life data,and the results show that the two proposed partitioning algorithms achieve better performance on both compression and query evaluation.
基金the National Natural Science Foundation of China under Grant Nos.61732003,61832003,61972110 and U19A2059the National Key Research and Development Program of China under Grant No.2019YFB2101902the CCF-Baidu Open Fund CCF-BAIDU under Grant No.OF2021011.
文摘Kernel is a kind of data summary which is elaborately extracted from a large dataset.Given a problem,the solution obtained from the kernel is an approximate version of the solution obtained from the whole dataset with a provable approximate ratio.It is widely used in geometric optimization,clustering,and approximate query processing,etc.,for scaling them up to massive data.In this paper,we focus on the minimumε-kernel(MK)computation that asks for a kernel of the smallest size for large-scale data processing.For the open problem presented by Wang et al.that whether the minimumε-coreset(MC)problem and the MK problem can be reduced to each other,we first formalize the MK problem and analyze its complexity.Due to the NP-hardness of the MK problem in three or higher dimensions,an approximate algorithm,namely Set Cover-Based Minimumε-Kernel algorithm(SCMK),is developed to solve it.We prove that the MC problem and the MK problem can be Turing-reduced to each other.Then,we discuss the update of MK under insertion and deletion operations,respectively.Finally,a randomized algorithm,called the Randomized Algorithm of Set Cover-Based Minimumε-Kernel algorithm(RA-SCMK),is utilized to further reduce the complexity of SCMK.The efficiency and effectiveness of SCMK and RA-SCMK are verified by experimental results on real-world and synthetic datasets.Experiments show that the kernel sizes of SCMK are 2x and 17.6x smaller than those of an ANN-based method on real-world and synthetic datasets,respectively.The speedup ratio of SCMK over the ANN-based method is 5.67 on synthetic datasets.RA-SCMK runs up to three times faster than SCMK on synthetic datasets.