In this paper,we introduce the censored composite conditional quantile coefficient(cC-CQC)to rank the relative importance of each predictor in high-dimensional censored regression.The cCCQC takes advantage of all usef...In this paper,we introduce the censored composite conditional quantile coefficient(cC-CQC)to rank the relative importance of each predictor in high-dimensional censored regression.The cCCQC takes advantage of all useful information across quantiles and can detect nonlinear effects including interactions and heterogeneity,effectively.Furthermore,the proposed screening method based on cCCQC is robust to the existence of outliers and enjoys the sure screening property.Simulation results demonstrate that the proposed method performs competitively on survival datasets of high-dimensional predictors,particularly when the variables are highly correlated.展开更多
The estimation of covariance matrices is very important in many fields, such as statistics. In real applications, data are frequently influenced by high dimensions and noise. However, most relevant studies are based o...The estimation of covariance matrices is very important in many fields, such as statistics. In real applications, data are frequently influenced by high dimensions and noise. However, most relevant studies are based on complete data. This paper studies the optimal estimation of high-dimensional covariance matrices based on missing and noisy sample under the norm. First, the model with sub-Gaussian additive noise is presented. The generalized sample covariance is then modified to define a hard thresholding estimator , and the minimax upper bound is derived. After that, the minimax lower bound is derived, and it is concluded that the estimator presented in this article is rate-optimal. Finally, numerical simulation analysis is performed. The result shows that for missing samples with sub-Gaussian noise, if the true covariance matrix is sparse, the hard thresholding estimator outperforms the traditional estimate method.展开更多
Recently a new clustering algorithm called 'affinity propagation' (AP) has been proposed, which efficiently clustered sparsely related data by passing messages between data points. However, we want to cluster ...Recently a new clustering algorithm called 'affinity propagation' (AP) has been proposed, which efficiently clustered sparsely related data by passing messages between data points. However, we want to cluster large scale data where the similarities are not sparse in many cases. This paper presents two variants of AP for grouping large scale data with a dense similarity matrix. The local approach is partition affinity propagation (PAP) and the global method is landmark affinity propagation (LAP). PAP passes messages in the subsets of data first and then merges them as the number of initial step of iterations; it can effectively reduce the number of iterations of clustering. LAP passes messages between the landmark data points first and then clusters non-landmark data points; it is a large global approximation method to speed up clustering. Experiments are conducted on many datasets, such as random data points, manifold subspaces, images of faces and Chinese calligraphy, and the results demonstrate that the two ap-proaches are feasible and practicable.展开更多
The performance of conventional similarity measurement methods is affected seriously by the curse of dimensionality of high-dimensional data.The reason is that data difference between sparse and noisy dimensionalities...The performance of conventional similarity measurement methods is affected seriously by the curse of dimensionality of high-dimensional data.The reason is that data difference between sparse and noisy dimensionalities occupies a large proportion of the similarity,leading to the dissimilarities between any results.A similarity measurement method of high-dimensional data based on normalized net lattice subspace is proposed.The data range of each dimension is divided into several intervals,and the components in different dimensions are mapped onto the corresponding interval.Only the component in the same or adjacent interval is used to calculate the similarity.To validate this method,three data types are used,and seven common similarity measurement methods are compared.The experimental result indicates that the relative difference of the method is increasing with the dimensionality and is approximately two or three orders of magnitude higher than the conventional method.In addition,the similarity range of this method in different dimensions is [0,1],which is fit for similarity analysis after dimensionality reduction.展开更多
Regional healthcare platforms collect clinical data from hospitals in specific areas for the purpose of healthcare management.It is a common requirement to reuse the data for clinical research.However,we have to face ...Regional healthcare platforms collect clinical data from hospitals in specific areas for the purpose of healthcare management.It is a common requirement to reuse the data for clinical research.However,we have to face challenges like the inconsistence of terminology in electronic health records (EHR) and the complexities in data quality and data formats in regional healthcare platform.In this paper,we propose methodology and process on constructing large scale cohorts which forms the basis of causality and comparative effectiveness relationship in epidemiology.We firstly constructed a Chinese terminology knowledge graph to deal with the diversity of vocabularies on regional platform.Secondly,we built special disease case repositories (i.e.,heart failure repository) that utilize the graph to search the related patients and to normalize the data.Based on the requirements of the clinical research which aimed to explore the effectiveness of taking statin on 180-days readmission in patients with heart failure,we built a large-scale retrospective cohort with 29647 cases of heart failure patients from the heart failure repository.After the propensity score matching,the study group (n=6346) and the control group (n=6346) with parallel clinical characteristics were acquired.Logistic regression analysis showed that taking statins had a negative correlation with 180-days readmission in heart failure patients.This paper presents the workflow and application example of big data mining based on regional EHR data.展开更多
In the process of constructing domain-specific knowledge graphs,the task of relational triple extraction plays a critical role in transforming unstructured text into structured information.Existing relational triple e...In the process of constructing domain-specific knowledge graphs,the task of relational triple extraction plays a critical role in transforming unstructured text into structured information.Existing relational triple extraction models facemultiple challenges when processing domain-specific data,including insufficient utilization of semantic interaction information between entities and relations,difficulties in handling challenging samples,and the scarcity of domain-specific datasets.To address these issues,our study introduces three innovative components:Relation semantic enhancement,data augmentation,and a voting strategy,all designed to significantly improve the model’s performance in tackling domain-specific relational triple extraction tasks.We first propose an innovative attention interaction module.This method significantly enhances the semantic interaction capabilities between entities and relations by integrating semantic information fromrelation labels.Second,we propose a voting strategy that effectively combines the strengths of large languagemodels(LLMs)and fine-tuned small pre-trained language models(SLMs)to reevaluate challenging samples,thereby improving the model’s adaptability in specific domains.Additionally,we explore the use of LLMs for data augmentation,aiming to generate domain-specific datasets to alleviate the scarcity of domain data.Experiments conducted on three domain-specific datasets demonstrate that our model outperforms existing comparative models in several aspects,with F1 scores exceeding the State of the Art models by 2%,1.6%,and 0.6%,respectively,validating the effectiveness and generalizability of our approach.展开更多
Large Language Models (LLMs) have revolutionized Generative Artificial Intelligence (GenAI) tasks, becoming an integral part of various applications in society, including text generation, translation, summarization, a...Large Language Models (LLMs) have revolutionized Generative Artificial Intelligence (GenAI) tasks, becoming an integral part of various applications in society, including text generation, translation, summarization, and more. However, their widespread usage emphasizes the critical need to enhance their security posture to ensure the integrity and reliability of their outputs and minimize harmful effects. Prompt injections and training data poisoning attacks are two of the most prominent vulnerabilities in LLMs, which could potentially lead to unpredictable and undesirable behaviors, such as biased outputs, misinformation propagation, and even malicious content generation. The Common Vulnerability Scoring System (CVSS) framework provides a standardized approach to capturing the principal characteristics of vulnerabilities, facilitating a deeper understanding of their severity within the security and AI communities. By extending the current CVSS framework, we generate scores for these vulnerabilities such that organizations can prioritize mitigation efforts, allocate resources effectively, and implement targeted security measures to defend against potential risks.展开更多
In this paper,we study the large-time behavior of periodic solutions for parabolic conservation laws.There is no smallness assumption on the initial data.We firstly get the local existence of the solution by the itera...In this paper,we study the large-time behavior of periodic solutions for parabolic conservation laws.There is no smallness assumption on the initial data.We firstly get the local existence of the solution by the iterative scheme,then we get the exponential decay estimates for the solution by energy method and maximum principle,and obtain the global solution in the same time.展开更多
Problems existin similarity measurement and index tree construction which affect the performance of nearest neighbor search of high-dimensional data. The equidistance problem is solved using NPsim function to calculat...Problems existin similarity measurement and index tree construction which affect the performance of nearest neighbor search of high-dimensional data. The equidistance problem is solved using NPsim function to calculate similarity. And a sequential NPsim matrix is built to improve indexing performance. To sum up the above innovations,a nearest neighbor search algorithm of high-dimensional data based on sequential NPsim matrix is proposed in comparison with the nearest neighbor search algorithms based on KD-tree or SR-tree on Munsell spectral data set. Experimental results show that the proposed algorithm similarity is better than that of other algorithms and searching speed is more than thousands times of others. In addition,the slow construction speed of sequential NPsim matrix can be increased by using parallel computing.展开更多
With volume size increasing, it is necessary to develop a highly efficient compression algorithm, which is suitable for progressive refinement between the data server and the browsing client. For three-dimensional lar...With volume size increasing, it is necessary to develop a highly efficient compression algorithm, which is suitable for progressive refinement between the data server and the browsing client. For three-dimensional large volume data, an efficient hierarchical algorithm based on wavelet compression was presented, using intra-band dependencies of wavelet coefficients. Firstly, after applying blockwise hierarchical wavelet decomposition to large volume data, the block significance map was obtained by using one bit to indicate significance or insignificance of the block. Secondly, the coefficient block was further subdivided into eight sub-blocks if any significant coefficient existed in it, and the process was repeated, resulting in an incomplete octree. One bit was used to indicate significance or insignificance, and only significant coefficients were stored in the data stream. Finally, the significant coefficients were quantified and compressed by arithmetic coding. The experimental results show that the proposed algorithm achieves good compression ratios and is suited for random access of data blocks. The results also show that the proposed algorithm can be applied to progressive transmission of 3D volume data.展开更多
This paper makes astudy on the interactive digital gener-alization, where map generalizationcan be divided into intellective reason-ing procedure and operational proce-dure, which are done by human andcomputer, respec...This paper makes astudy on the interactive digital gener-alization, where map generalizationcan be divided into intellective reason-ing procedure and operational proce-dure, which are done by human andcomputer, respectively. And an inter-active map generalization environmentfor large scale topographic map is thendesigned and realized. This researchfocuses on: ① the significance of re-searching an interactive map generali-zation environment, ② the features oflarge scale topographic map and inter-active map generalization, ③ the con-struction of map generalization-orien-ted database platform.展开更多
A method is presented in this work that integrates both emerging and mature data sources to estimate the operational travel demand in fine spatial and temporal resolutions.By analyzing individuals’mobility patterns r...A method is presented in this work that integrates both emerging and mature data sources to estimate the operational travel demand in fine spatial and temporal resolutions.By analyzing individuals’mobility patterns revealed from their mobile phones,researchers and practitioners are now equipped to derive the largest trip samples for a region.Because of its ubiquitous use,extensive coverage of telecommunication services and high penetration rates,travel demand can be studied continuously in fine spatial and temporal resolutions.The derived sample or seed trip matrices are coupled with surveyed commute flow data and prevalent travel demand modeling techniques to provide estimates of the total regional travel demand in the form of origindestination(OD)matrices.The methodology is evaluated in a series of real world transportation planning studies and proved its potentials in application areas such as dynamic traffic assignment modeling,integrated corridor management and online traffic simulations.展开更多
Viticulturists traditionally have a keen interest in studying the relationship between the biochemistry of grapevines’ leaves/petioles and their associated spectral reflectance in order to understand the fruit ripeni...Viticulturists traditionally have a keen interest in studying the relationship between the biochemistry of grapevines’ leaves/petioles and their associated spectral reflectance in order to understand the fruit ripening rate, water status, nutrient levels, and disease risk. In this paper, we implement imaging spectroscopy (hyperspectral) reflectance data, for the reflective 330 - 2510 nm wavelength region (986 total spectral bands), to assess vineyard nutrient status;this constitutes a high dimensional dataset with a covariance matrix that is ill-conditioned. The identification of the variables (wavelength bands) that contribute useful information for nutrient assessment and prediction, plays a pivotal role in multivariate statistical modeling. In recent years, researchers have successfully developed many continuous, nearly unbiased, sparse and accurate variable selection methods to overcome this problem. This paper compares four regularized and one functional regression methods: Elastic Net, Multi-Step Adaptive Elastic Net, Minimax Concave Penalty, iterative Sure Independence Screening, and Functional Data Analysis for wavelength variable selection. Thereafter, the predictive performance of these regularized sparse models is enhanced using the stepwise regression. This comparative study of regression methods using a high-dimensional and highly correlated grapevine hyperspectral dataset revealed that the performance of Elastic Net for variable selection yields the best predictive ability.展开更多
Large high-dimensional data have posed great challenges to existing algorithms for frequent itemsets mining.To solve the problem,a hybrid method,consisting of a novel row enumeration algorithm and a column enumeration...Large high-dimensional data have posed great challenges to existing algorithms for frequent itemsets mining.To solve the problem,a hybrid method,consisting of a novel row enumeration algorithm and a column enumeration algorithm,is proposed.The intention of the hybrid method is to decompose the mining task into two subtasks and then choose appropriate algorithms to solve them respectively.The novel algorithm,i.e.,Inter-transaction is based on the characteristic that there are few common items between or among long transactions.In addition,an optimization technique is adopted to improve the performance of the intersection of bit-vectors.Experiments on synthetic data show that our method achieves high performance in large high-dimensional data.展开更多
A big scale data poses a great challenge to data storage, management and data analysis. This article analyzes the basic concepts of large data, and mainly used on large data makes the simple contrast. And paper put fo...A big scale data poses a great challenge to data storage, management and data analysis. This article analyzes the basic concepts of large data, and mainly used on large data makes the simple contrast. And paper put forward a platform of regional characteristics based on electronic business information publishing system. Finally the paper gives general model and the realization of the platform structure, key technology and process. The platform uses conversion technology of StrutsCX framework based on J2EE platform and the XSLT parsing template of XML document tree that generates and provide automation platform construction features site for the user, it can quickly set up a tourism industry application component with plug-in manner.展开更多
With the rapid development of China' s economy and the accelerating pace of economic globalization, the rapid expansion of trade appears in goods, materials; space to move also will expand in breadth and depth, and t...With the rapid development of China' s economy and the accelerating pace of economic globalization, the rapid expansion of trade appears in goods, materials; space to move also will expand in breadth and depth, and thus the efficiency of the logistics activities, rapid response capabilities and the level of information logistics put forward higher requirements. Meanwhile, the logistics needs of personalization, diversification and sophistication, require that logistics service companies must constantly improve and optimize enterprise business model, and target to develop new logistics services to adapt to changes in the logistics market, and improve the competitiveness of enterprises.Modem logistics enterprises refer to the concept of modem logistics as a guide, and the use of modem logistics and organization of modem logistics technology, it is to provide customers help and reduce logistics costs and improve the level of integrated logistics services to rationalize logistics enterprises. Modem logistics enterprises are in terms of philosophy, mode of operation, services, information technology degree, logistics technology, enterprise systems and others have higher requirements to react quickly, service serialization, standardization of operations, the target of systematic, modem means are the features of the organization which are different from the traditional network of logistics enterprises.展开更多
Movie trailers originate from the movies themselves. Although compared with the two-hour-long films, the trailers are insignificant. It takes certain skills to make moderate spoilers and sufficient gimmicks in this sh...Movie trailers originate from the movies themselves. Although compared with the two-hour-long films, the trailers are insignificant. It takes certain skills to make moderate spoilers and sufficient gimmicks in this short period of time, while at the same time to hoist the audience appetite. Films are a result of cutting, so are trailers. But it is obviously insufficient for us to analyze their relationship only from the angle of editing techniques and art. This study is based on the perspective of large data, in which the clips of the trailers are abstracted in visual scenes by comparing the lenses of different tracks in editing software. In this way, we can get the time code of the lenses of the feature. Furthermore, when putting back the time code of the trailers into the movies again, we can have a vivid diagram of the trailers and film by using these time code and directly reveal the relationship between the trailers and the features.展开更多
Making accurate forecast or prediction is a challenging task in the big data era, in particular for those datasets involving high-dimensional variables but short-term time series points,which are generally available f...Making accurate forecast or prediction is a challenging task in the big data era, in particular for those datasets involving high-dimensional variables but short-term time series points,which are generally available from real-world systems.To address this issue, Prof.展开更多
基金Outstanding Youth Foundation of Hunan Provincial Department of Education(Grant No.22B0911)。
文摘In this paper,we introduce the censored composite conditional quantile coefficient(cC-CQC)to rank the relative importance of each predictor in high-dimensional censored regression.The cCCQC takes advantage of all useful information across quantiles and can detect nonlinear effects including interactions and heterogeneity,effectively.Furthermore,the proposed screening method based on cCCQC is robust to the existence of outliers and enjoys the sure screening property.Simulation results demonstrate that the proposed method performs competitively on survival datasets of high-dimensional predictors,particularly when the variables are highly correlated.
文摘The estimation of covariance matrices is very important in many fields, such as statistics. In real applications, data are frequently influenced by high dimensions and noise. However, most relevant studies are based on complete data. This paper studies the optimal estimation of high-dimensional covariance matrices based on missing and noisy sample under the norm. First, the model with sub-Gaussian additive noise is presented. The generalized sample covariance is then modified to define a hard thresholding estimator , and the minimax upper bound is derived. After that, the minimax lower bound is derived, and it is concluded that the estimator presented in this article is rate-optimal. Finally, numerical simulation analysis is performed. The result shows that for missing samples with sub-Gaussian noise, if the true covariance matrix is sparse, the hard thresholding estimator outperforms the traditional estimate method.
基金the National Natural Science Foundation of China (Nos. 60533090 and 60603096)the National Hi-Tech Research and Development Program (863) of China (No. 2006AA010107)+2 种基金the Key Technology R&D Program of China (No. 2006BAH02A13-4)the Program for Changjiang Scholars and Innovative Research Team in University of China (No. IRT0652)the Cultivation Fund of the Key Scientific and Technical Innovation Project of MOE, China (No. 706033)
文摘Recently a new clustering algorithm called 'affinity propagation' (AP) has been proposed, which efficiently clustered sparsely related data by passing messages between data points. However, we want to cluster large scale data where the similarities are not sparse in many cases. This paper presents two variants of AP for grouping large scale data with a dense similarity matrix. The local approach is partition affinity propagation (PAP) and the global method is landmark affinity propagation (LAP). PAP passes messages in the subsets of data first and then merges them as the number of initial step of iterations; it can effectively reduce the number of iterations of clustering. LAP passes messages between the landmark data points first and then clusters non-landmark data points; it is a large global approximation method to speed up clustering. Experiments are conducted on many datasets, such as random data points, manifold subspaces, images of faces and Chinese calligraphy, and the results demonstrate that the two ap-proaches are feasible and practicable.
基金Supported by the National Natural Science Foundation of China(No.61502475)the Importation and Development of High-Caliber Talents Project of the Beijing Municipal Institutions(No.CIT&TCD201504039)
文摘The performance of conventional similarity measurement methods is affected seriously by the curse of dimensionality of high-dimensional data.The reason is that data difference between sparse and noisy dimensionalities occupies a large proportion of the similarity,leading to the dissimilarities between any results.A similarity measurement method of high-dimensional data based on normalized net lattice subspace is proposed.The data range of each dimension is divided into several intervals,and the components in different dimensions are mapped onto the corresponding interval.Only the component in the same or adjacent interval is used to calculate the similarity.To validate this method,three data types are used,and seven common similarity measurement methods are compared.The experimental result indicates that the relative difference of the method is increasing with the dimensionality and is approximately two or three orders of magnitude higher than the conventional method.In addition,the similarity range of this method in different dimensions is [0,1],which is fit for similarity analysis after dimensionality reduction.
基金Supported by the National Major Scientific and Technological Special Project for"Significant New Drugs Development’’(No.2018ZX09201008)Special Fund Project for Information Development from Shanghai Municipal Commission of Economy and Information(No.201701013)
文摘Regional healthcare platforms collect clinical data from hospitals in specific areas for the purpose of healthcare management.It is a common requirement to reuse the data for clinical research.However,we have to face challenges like the inconsistence of terminology in electronic health records (EHR) and the complexities in data quality and data formats in regional healthcare platform.In this paper,we propose methodology and process on constructing large scale cohorts which forms the basis of causality and comparative effectiveness relationship in epidemiology.We firstly constructed a Chinese terminology knowledge graph to deal with the diversity of vocabularies on regional platform.Secondly,we built special disease case repositories (i.e.,heart failure repository) that utilize the graph to search the related patients and to normalize the data.Based on the requirements of the clinical research which aimed to explore the effectiveness of taking statin on 180-days readmission in patients with heart failure,we built a large-scale retrospective cohort with 29647 cases of heart failure patients from the heart failure repository.After the propensity score matching,the study group (n=6346) and the control group (n=6346) with parallel clinical characteristics were acquired.Logistic regression analysis showed that taking statins had a negative correlation with 180-days readmission in heart failure patients.This paper presents the workflow and application example of big data mining based on regional EHR data.
基金Science and Technology Innovation 2030-Major Project of“New Generation Artificial Intelligence”granted by Ministry of Science and Technology,Grant Number 2020AAA0109300.
文摘In the process of constructing domain-specific knowledge graphs,the task of relational triple extraction plays a critical role in transforming unstructured text into structured information.Existing relational triple extraction models facemultiple challenges when processing domain-specific data,including insufficient utilization of semantic interaction information between entities and relations,difficulties in handling challenging samples,and the scarcity of domain-specific datasets.To address these issues,our study introduces three innovative components:Relation semantic enhancement,data augmentation,and a voting strategy,all designed to significantly improve the model’s performance in tackling domain-specific relational triple extraction tasks.We first propose an innovative attention interaction module.This method significantly enhances the semantic interaction capabilities between entities and relations by integrating semantic information fromrelation labels.Second,we propose a voting strategy that effectively combines the strengths of large languagemodels(LLMs)and fine-tuned small pre-trained language models(SLMs)to reevaluate challenging samples,thereby improving the model’s adaptability in specific domains.Additionally,we explore the use of LLMs for data augmentation,aiming to generate domain-specific datasets to alleviate the scarcity of domain data.Experiments conducted on three domain-specific datasets demonstrate that our model outperforms existing comparative models in several aspects,with F1 scores exceeding the State of the Art models by 2%,1.6%,and 0.6%,respectively,validating the effectiveness and generalizability of our approach.
文摘Large Language Models (LLMs) have revolutionized Generative Artificial Intelligence (GenAI) tasks, becoming an integral part of various applications in society, including text generation, translation, summarization, and more. However, their widespread usage emphasizes the critical need to enhance their security posture to ensure the integrity and reliability of their outputs and minimize harmful effects. Prompt injections and training data poisoning attacks are two of the most prominent vulnerabilities in LLMs, which could potentially lead to unpredictable and undesirable behaviors, such as biased outputs, misinformation propagation, and even malicious content generation. The Common Vulnerability Scoring System (CVSS) framework provides a standardized approach to capturing the principal characteristics of vulnerabilities, facilitating a deeper understanding of their severity within the security and AI communities. By extending the current CVSS framework, we generate scores for these vulnerabilities such that organizations can prioritize mitigation efforts, allocate resources effectively, and implement targeted security measures to defend against potential risks.
基金Supported by National Natural Science Foundation of China(60675039)National High Technology Research and Development Program of China(863 Program)(2006AA04Z217)Hundred Talents Program of Chinese Academy of Sciences
基金Foundation item: Supported by the National Science Foundation of China(1107116)
文摘In this paper,we study the large-time behavior of periodic solutions for parabolic conservation laws.There is no smallness assumption on the initial data.We firstly get the local existence of the solution by the iterative scheme,then we get the exponential decay estimates for the solution by energy method and maximum principle,and obtain the global solution in the same time.
基金Supported by the National Natural Science Foundation of China(No.61300078)the Importation and Development of High-Caliber Talents Project of Beijing Municipal Institutions(No.CIT&TCD201504039)+1 种基金Funding Project for Academic Human Resources Development in Beijing Union University(No.BPHR2014A03,Rk100201510)"New Start"Academic Research Projects of Beijing Union University(No.Hzk10201501)
文摘Problems existin similarity measurement and index tree construction which affect the performance of nearest neighbor search of high-dimensional data. The equidistance problem is solved using NPsim function to calculate similarity. And a sequential NPsim matrix is built to improve indexing performance. To sum up the above innovations,a nearest neighbor search algorithm of high-dimensional data based on sequential NPsim matrix is proposed in comparison with the nearest neighbor search algorithms based on KD-tree or SR-tree on Munsell spectral data set. Experimental results show that the proposed algorithm similarity is better than that of other algorithms and searching speed is more than thousands times of others. In addition,the slow construction speed of sequential NPsim matrix can be increased by using parallel computing.
基金Supported by Natural Science Foundation of China (No. 60373061).
文摘With volume size increasing, it is necessary to develop a highly efficient compression algorithm, which is suitable for progressive refinement between the data server and the browsing client. For three-dimensional large volume data, an efficient hierarchical algorithm based on wavelet compression was presented, using intra-band dependencies of wavelet coefficients. Firstly, after applying blockwise hierarchical wavelet decomposition to large volume data, the block significance map was obtained by using one bit to indicate significance or insignificance of the block. Secondly, the coefficient block was further subdivided into eight sub-blocks if any significant coefficient existed in it, and the process was repeated, resulting in an incomplete octree. One bit was used to indicate significance or insignificance, and only significant coefficients were stored in the data stream. Finally, the significant coefficients were quantified and compressed by arithmetic coding. The experimental results show that the proposed algorithm achieves good compression ratios and is suited for random access of data blocks. The results also show that the proposed algorithm can be applied to progressive transmission of 3D volume data.
文摘This paper makes astudy on the interactive digital gener-alization, where map generalizationcan be divided into intellective reason-ing procedure and operational proce-dure, which are done by human andcomputer, respectively. And an inter-active map generalization environmentfor large scale topographic map is thendesigned and realized. This researchfocuses on: ① the significance of re-searching an interactive map generali-zation environment, ② the features oflarge scale topographic map and inter-active map generalization, ③ the con-struction of map generalization-orien-ted database platform.
文摘A method is presented in this work that integrates both emerging and mature data sources to estimate the operational travel demand in fine spatial and temporal resolutions.By analyzing individuals’mobility patterns revealed from their mobile phones,researchers and practitioners are now equipped to derive the largest trip samples for a region.Because of its ubiquitous use,extensive coverage of telecommunication services and high penetration rates,travel demand can be studied continuously in fine spatial and temporal resolutions.The derived sample or seed trip matrices are coupled with surveyed commute flow data and prevalent travel demand modeling techniques to provide estimates of the total regional travel demand in the form of origindestination(OD)matrices.The methodology is evaluated in a series of real world transportation planning studies and proved its potentials in application areas such as dynamic traffic assignment modeling,integrated corridor management and online traffic simulations.
文摘Viticulturists traditionally have a keen interest in studying the relationship between the biochemistry of grapevines’ leaves/petioles and their associated spectral reflectance in order to understand the fruit ripening rate, water status, nutrient levels, and disease risk. In this paper, we implement imaging spectroscopy (hyperspectral) reflectance data, for the reflective 330 - 2510 nm wavelength region (986 total spectral bands), to assess vineyard nutrient status;this constitutes a high dimensional dataset with a covariance matrix that is ill-conditioned. The identification of the variables (wavelength bands) that contribute useful information for nutrient assessment and prediction, plays a pivotal role in multivariate statistical modeling. In recent years, researchers have successfully developed many continuous, nearly unbiased, sparse and accurate variable selection methods to overcome this problem. This paper compares four regularized and one functional regression methods: Elastic Net, Multi-Step Adaptive Elastic Net, Minimax Concave Penalty, iterative Sure Independence Screening, and Functional Data Analysis for wavelength variable selection. Thereafter, the predictive performance of these regularized sparse models is enhanced using the stepwise regression. This comparative study of regression methods using a high-dimensional and highly correlated grapevine hyperspectral dataset revealed that the performance of Elastic Net for variable selection yields the best predictive ability.
基金The work was supported in part by Research Fund for the Doctoral Program of Higher Education of China(No.20060255006)
文摘Large high-dimensional data have posed great challenges to existing algorithms for frequent itemsets mining.To solve the problem,a hybrid method,consisting of a novel row enumeration algorithm and a column enumeration algorithm,is proposed.The intention of the hybrid method is to decompose the mining task into two subtasks and then choose appropriate algorithms to solve them respectively.The novel algorithm,i.e.,Inter-transaction is based on the characteristic that there are few common items between or among long transactions.In addition,an optimization technique is adopted to improve the performance of the intersection of bit-vectors.Experiments on synthetic data show that our method achieves high performance in large high-dimensional data.
基金Supported by National Basic Research Program of China (973 Program) (2009CB320601), National Natural Science Foundation of China (60774048, 60821063), the Program for Cheung Kong Scholars, and the Research Fund for the Doctoral Program of China Higher Education (20070145015)
文摘这份报纸学习样品数据的问题为有变化时间的延期的不明确的连续时间的模糊大规模系统的可靠 H 夸张控制。第一,模糊夸张模型( FHM )被用来为某些复杂大规模系统建立模型,然后根据 Lyapunov 指导方法和大规模系统的分散的控制理论,线性 matrixine 质量( LMI )基于条件 arederived toguarantee H 性能不仅当所有控制部件正在操作很好时,而且面对一些可能的致动器失败。而且,致动器的精确失败参数没被要求,并且要求仅仅是失败参数的更低、上面的界限。条件依赖于时间延期的上面的界限,并且不依赖于变化时间的延期的衍生物。因此,获得的结果是不太保守的。最后,二个例子被提供说明设计过程和它的有效性。
文摘A big scale data poses a great challenge to data storage, management and data analysis. This article analyzes the basic concepts of large data, and mainly used on large data makes the simple contrast. And paper put forward a platform of regional characteristics based on electronic business information publishing system. Finally the paper gives general model and the realization of the platform structure, key technology and process. The platform uses conversion technology of StrutsCX framework based on J2EE platform and the XSLT parsing template of XML document tree that generates and provide automation platform construction features site for the user, it can quickly set up a tourism industry application component with plug-in manner.
文摘With the rapid development of China' s economy and the accelerating pace of economic globalization, the rapid expansion of trade appears in goods, materials; space to move also will expand in breadth and depth, and thus the efficiency of the logistics activities, rapid response capabilities and the level of information logistics put forward higher requirements. Meanwhile, the logistics needs of personalization, diversification and sophistication, require that logistics service companies must constantly improve and optimize enterprise business model, and target to develop new logistics services to adapt to changes in the logistics market, and improve the competitiveness of enterprises.Modem logistics enterprises refer to the concept of modem logistics as a guide, and the use of modem logistics and organization of modem logistics technology, it is to provide customers help and reduce logistics costs and improve the level of integrated logistics services to rationalize logistics enterprises. Modem logistics enterprises are in terms of philosophy, mode of operation, services, information technology degree, logistics technology, enterprise systems and others have higher requirements to react quickly, service serialization, standardization of operations, the target of systematic, modem means are the features of the organization which are different from the traditional network of logistics enterprises.
文摘Movie trailers originate from the movies themselves. Although compared with the two-hour-long films, the trailers are insignificant. It takes certain skills to make moderate spoilers and sufficient gimmicks in this short period of time, while at the same time to hoist the audience appetite. Films are a result of cutting, so are trailers. But it is obviously insufficient for us to analyze their relationship only from the angle of editing techniques and art. This study is based on the perspective of large data, in which the clips of the trailers are abstracted in visual scenes by comparing the lenses of different tracks in editing software. In this way, we can get the time code of the lenses of the feature. Furthermore, when putting back the time code of the trailers into the movies again, we can have a vivid diagram of the trailers and film by using these time code and directly reveal the relationship between the trailers and the features.
基金supported by the grants from CASthe National Key R&D Program of Chinathe National Natural Science Foundation of China
文摘Making accurate forecast or prediction is a challenging task in the big data era, in particular for those datasets involving high-dimensional variables but short-term time series points,which are generally available from real-world systems.To address this issue, Prof.