Online learning is a very important means of study, and has been adopted in many countries worldwide. However, only recently are researchers able to collect and analyze massive online learning datasets due to the COVI...Online learning is a very important means of study, and has been adopted in many countries worldwide. However, only recently are researchers able to collect and analyze massive online learning datasets due to the COVID-19 epidemic. In this article, we analyze the difference between online learner groups by using an unsupervised machine learning technique, i.e., k-prototypes clustering. Specifically, we use questionnaires designed by domain experts to collect various online learning data, and investigate students’ online learning behavior and learning outcomes through analyzing the collected questionnaire data. Our analysis results suggest that students with better learning media generally have better online learning behavior and learning result than those with poor online learning media. In addition, both in economically developed or undeveloped regions, the number of students with better learning media is less than the number of students with poor learning media. Finally, the results presented here show that whether in an economically developed or an economically undeveloped region, the number of students who are enriched with learning media available is an important factor that affects online learning behavior and learning outcomes.展开更多
The clustering of objects(individuals or variables)is one of the most used approaches to exploring multivariate data.The two most common unsupervised clustering strategies are hierarchical ascending clustering(HAC)and...The clustering of objects(individuals or variables)is one of the most used approaches to exploring multivariate data.The two most common unsupervised clustering strategies are hierarchical ascending clustering(HAC)and k-means partitioning used to identify groups of similar objects in a dataset to divide it into homogeneous groups.The proposed topological clustering of variables,called TCV,studies an homogeneous set of variables defined on the same set of individuals,based on the notion of neighborhood graphs,some of these variables are more-or-less correlated or linked according to the type quantitative or qualitative of the variables.This topological data analysis approach can then be useful for dimension reduction and variable selection.It’s a topological hierarchical clustering analysis of a set of variables which can be quantitative,qualitative or a mixture of both.It arranges variables into homogeneous groups according to their correlations or associations studied in a topological context of principal component analysis(PCA)or multiple correspondence analysis(MCA).The proposed TCV is adapted to the type of data considered,its principle is presented and illustrated using simple real datasets with quantitative,qualitative and mixed variables.The results of these illustrative examples are compared to those of other variables clustering approaches.展开更多
Data mining and analytics involve inspecting and modeling large pre-existing datasets to discover decision-making information.Precision agriculture uses datamining to advance agricultural developments.Many farmers are...Data mining and analytics involve inspecting and modeling large pre-existing datasets to discover decision-making information.Precision agriculture uses datamining to advance agricultural developments.Many farmers aren’t getting the most out of their land because they don’t use precision agriculture.They harvest crops without a well-planned recommendation system.Future crop production is calculated by combining environmental conditions and management behavior,yielding numerical and categorical data.Most existing research still needs to address data preprocessing and crop categorization/classification.Furthermore,statistical analysis receives less attention,despite producing more accurate and valid results.The study was conducted on a dataset about Karnataka state,India,with crops of eight parameters taken into account,namely the minimum amount of fertilizers required,such as nitrogen,phosphorus,potassium,and pH values.The research considers rainfall,season,soil type,and temperature parameters to provide precise cultivation recommendations for high productivity.The presented algorithm converts discrete numerals to factors first,then reduces levels.Second,the algorithm generates six datasets,two fromCase-1(dataset withmany numeric variables),two from Case-2(dataset with many categorical variables),and one from Case-3(dataset with reduced factor variables).Finally,the algorithm outputs a class membership allocation based on an extended version of the K-means partitioning method with lambda estimation.The presented work produces mixed-type datasets with precisely categorized crops by organizing data based on environmental conditions,soil nutrients,and geo-location.Finally,the prepared dataset solves the classification problem,leading to a model evaluation that selects the best dataset for precise crop prediction.展开更多
Today, Linear Mixed Models (LMMs) are fitted, mostly, by assuming that random effects and errors have Gaussian distributions, therefore using Maximum Likelihood (ML) or REML estimation. However, for many data sets, th...Today, Linear Mixed Models (LMMs) are fitted, mostly, by assuming that random effects and errors have Gaussian distributions, therefore using Maximum Likelihood (ML) or REML estimation. However, for many data sets, that double assumption is unlikely to hold, particularly for the random effects, a crucial component </span></span><span style="font-family:Verdana;"><span style="font-family:Verdana;"><span style="font-family:Verdana;">in </span></span></span><span style="font-family:Verdana;"><span style="font-family:Verdana;"><span style="font-family:Verdana;">which assessment of magnitude is key in such modeling. Alternative fitting methods not relying on that assumption (as ANOVA ones and Rao</span></span></span><span style="font-family:Verdana;"><span style="font-family:Verdana;"><span style="font-family:Verdana;">’</span></span></span><span style="font-family:Verdana;"><span style="font-family:Verdana;"><span style="font-family:Verdana;">s MINQUE) apply, quite often, only to the very constrained class of variance components models. In this paper, a new computationally feasible estimation methodology is designed, first for the widely used class of 2-level (or longitudinal) LMMs with only assumption (beyond the usual basic ones) that residual errors are uncorrelated and homoscedastic, with no distributional assumption imposed on the random effects. A major asset of this new approach is that it yields nonnegative variance estimates and covariance matrices estimates which are symmetric and, at least, positive semi-definite. Furthermore, it is shown that when the LMM is, indeed, Gaussian, this new methodology differs from ML just through a slight variation in the denominator of the residual variance estimate. The new methodology actually generalizes to LMMs a well known nonparametric fitting procedure for standard Linear Models. Finally, the methodology is also extended to ANOVA LMMs, generalizing an old method by Henderson for ML estimation in such models under normality.展开更多
近年来,混合型数据的聚类问题受到广泛关注。作为处理混合型数据的一种有效方法,K-prototype聚类算法在初始化聚类中心时通常采用随机选取的策略,然而这种策略在很多实际应用中难以保证聚类结果的质量。针对上述问题,采用基于离群点检...近年来,混合型数据的聚类问题受到广泛关注。作为处理混合型数据的一种有效方法,K-prototype聚类算法在初始化聚类中心时通常采用随机选取的策略,然而这种策略在很多实际应用中难以保证聚类结果的质量。针对上述问题,采用基于离群点检测的策略来为K-prototype算法选择初始中心,并提出一种新的混合型数据聚类初始化算法(initialization of K-prototype clustering based on outlier detection and density,IKP-ODD)。给定一个候选对象,IKP-ODD通过计算其距离离群因子、加权密度以及与已有初始中心之间的加权距离来判断候选对象是否是一个初始中心。IKP-ODD通过采用距离离群因子和加权密度,防止选择离群点作为初始中心。在计算对象的加权密度以及对象之间的加权距离时,采用邻域粗糙集中的粒度邻域熵来计算每一个属性的重要性,并根据属性重要性的大小为不同属性赋予不同的权重,有效地反映不同属性之间的差异性。在多个UCI数据集上的实验表明,相对于现有的初始化方法,IKP-ODD能够更好地解决K-prototype聚类的初始化问题。展开更多
文摘Online learning is a very important means of study, and has been adopted in many countries worldwide. However, only recently are researchers able to collect and analyze massive online learning datasets due to the COVID-19 epidemic. In this article, we analyze the difference between online learner groups by using an unsupervised machine learning technique, i.e., k-prototypes clustering. Specifically, we use questionnaires designed by domain experts to collect various online learning data, and investigate students’ online learning behavior and learning outcomes through analyzing the collected questionnaire data. Our analysis results suggest that students with better learning media generally have better online learning behavior and learning result than those with poor online learning media. In addition, both in economically developed or undeveloped regions, the number of students with better learning media is less than the number of students with poor learning media. Finally, the results presented here show that whether in an economically developed or an economically undeveloped region, the number of students who are enriched with learning media available is an important factor that affects online learning behavior and learning outcomes.
文摘The clustering of objects(individuals or variables)is one of the most used approaches to exploring multivariate data.The two most common unsupervised clustering strategies are hierarchical ascending clustering(HAC)and k-means partitioning used to identify groups of similar objects in a dataset to divide it into homogeneous groups.The proposed topological clustering of variables,called TCV,studies an homogeneous set of variables defined on the same set of individuals,based on the notion of neighborhood graphs,some of these variables are more-or-less correlated or linked according to the type quantitative or qualitative of the variables.This topological data analysis approach can then be useful for dimension reduction and variable selection.It’s a topological hierarchical clustering analysis of a set of variables which can be quantitative,qualitative or a mixture of both.It arranges variables into homogeneous groups according to their correlations or associations studied in a topological context of principal component analysis(PCA)or multiple correspondence analysis(MCA).The proposed TCV is adapted to the type of data considered,its principle is presented and illustrated using simple real datasets with quantitative,qualitative and mixed variables.The results of these illustrative examples are compared to those of other variables clustering approaches.
基金This research work was funded by the Institutional Fund Projects under Grant No.(IFPIP:959-611-1443)The authors gratefully acknowledge the technical and financial support provided by the Ministry of Education and King Abdulaziz University,DSR,Jeddah,Saudi Arabia.
文摘Data mining and analytics involve inspecting and modeling large pre-existing datasets to discover decision-making information.Precision agriculture uses datamining to advance agricultural developments.Many farmers aren’t getting the most out of their land because they don’t use precision agriculture.They harvest crops without a well-planned recommendation system.Future crop production is calculated by combining environmental conditions and management behavior,yielding numerical and categorical data.Most existing research still needs to address data preprocessing and crop categorization/classification.Furthermore,statistical analysis receives less attention,despite producing more accurate and valid results.The study was conducted on a dataset about Karnataka state,India,with crops of eight parameters taken into account,namely the minimum amount of fertilizers required,such as nitrogen,phosphorus,potassium,and pH values.The research considers rainfall,season,soil type,and temperature parameters to provide precise cultivation recommendations for high productivity.The presented algorithm converts discrete numerals to factors first,then reduces levels.Second,the algorithm generates six datasets,two fromCase-1(dataset withmany numeric variables),two from Case-2(dataset with many categorical variables),and one from Case-3(dataset with reduced factor variables).Finally,the algorithm outputs a class membership allocation based on an extended version of the K-means partitioning method with lambda estimation.The presented work produces mixed-type datasets with precisely categorized crops by organizing data based on environmental conditions,soil nutrients,and geo-location.Finally,the prepared dataset solves the classification problem,leading to a model evaluation that selects the best dataset for precise crop prediction.
文摘Today, Linear Mixed Models (LMMs) are fitted, mostly, by assuming that random effects and errors have Gaussian distributions, therefore using Maximum Likelihood (ML) or REML estimation. However, for many data sets, that double assumption is unlikely to hold, particularly for the random effects, a crucial component </span></span><span style="font-family:Verdana;"><span style="font-family:Verdana;"><span style="font-family:Verdana;">in </span></span></span><span style="font-family:Verdana;"><span style="font-family:Verdana;"><span style="font-family:Verdana;">which assessment of magnitude is key in such modeling. Alternative fitting methods not relying on that assumption (as ANOVA ones and Rao</span></span></span><span style="font-family:Verdana;"><span style="font-family:Verdana;"><span style="font-family:Verdana;">’</span></span></span><span style="font-family:Verdana;"><span style="font-family:Verdana;"><span style="font-family:Verdana;">s MINQUE) apply, quite often, only to the very constrained class of variance components models. In this paper, a new computationally feasible estimation methodology is designed, first for the widely used class of 2-level (or longitudinal) LMMs with only assumption (beyond the usual basic ones) that residual errors are uncorrelated and homoscedastic, with no distributional assumption imposed on the random effects. A major asset of this new approach is that it yields nonnegative variance estimates and covariance matrices estimates which are symmetric and, at least, positive semi-definite. Furthermore, it is shown that when the LMM is, indeed, Gaussian, this new methodology differs from ML just through a slight variation in the denominator of the residual variance estimate. The new methodology actually generalizes to LMMs a well known nonparametric fitting procedure for standard Linear Models. Finally, the methodology is also extended to ANOVA LMMs, generalizing an old method by Henderson for ML estimation in such models under normality.
文摘近年来,混合型数据的聚类问题受到广泛关注。作为处理混合型数据的一种有效方法,K-prototype聚类算法在初始化聚类中心时通常采用随机选取的策略,然而这种策略在很多实际应用中难以保证聚类结果的质量。针对上述问题,采用基于离群点检测的策略来为K-prototype算法选择初始中心,并提出一种新的混合型数据聚类初始化算法(initialization of K-prototype clustering based on outlier detection and density,IKP-ODD)。给定一个候选对象,IKP-ODD通过计算其距离离群因子、加权密度以及与已有初始中心之间的加权距离来判断候选对象是否是一个初始中心。IKP-ODD通过采用距离离群因子和加权密度,防止选择离群点作为初始中心。在计算对象的加权密度以及对象之间的加权距离时,采用邻域粗糙集中的粒度邻域熵来计算每一个属性的重要性,并根据属性重要性的大小为不同属性赋予不同的权重,有效地反映不同属性之间的差异性。在多个UCI数据集上的实验表明,相对于现有的初始化方法,IKP-ODD能够更好地解决K-prototype聚类的初始化问题。