This paper provides a brief introduction to the methods for generating fuzzy categorical maps from remotely sensed images (in graphical and digital forms).This is followed by a description of the slicing process for d...This paper provides a brief introduction to the methods for generating fuzzy categorical maps from remotely sensed images (in graphical and digital forms).This is followed by a description of the slicing process for deriving fuzzy boundaries from fuzzy categorical maps,which can be based on the maximum fuzzy membership values,confusion index,or measure of entropy.Results from an empirical test preformed in an Edinburgh suburb show that fuzzy boundaries of land cover can be derived from aerial photographs and satellite images by using the three criteria with small differences,and that slicing based on the maximum fuzzy membership values is the easiest and most straightforward solution.This,in turn,implies the suitability of maintaining both a crisp classification and its underlying certainty map for deriving fuzzy boundaries at different thresholds,which is a flexible and compact management of categorical map data and their uncertainty.展开更多
This paper focuses on the issues of categorical database gen-eralization and emphasizes the roles ofsupporting data model, integrated datamodel, spatial analysis and semanticanalysis in database generalization.The fra...This paper focuses on the issues of categorical database gen-eralization and emphasizes the roles ofsupporting data model, integrated datamodel, spatial analysis and semanticanalysis in database generalization.The framework contents of categoricaldatabase generalization transformationare defined. This paper presents an in-tegrated spatial supporting data struc-ture, a semantic supporting model andsimilarity model for the categorical da-tabase generalization. The concept oftransformation unit is proposed in generalization.展开更多
In order to provide predictable runtime performante for text categorization (TC) systems, an innovative system design method is proposed for soft real time TC systems. An analyzable mathematical model is established...In order to provide predictable runtime performante for text categorization (TC) systems, an innovative system design method is proposed for soft real time TC systems. An analyzable mathematical model is established to approximately describe the nonlinear and time-varying TC systems. According to this mathematical model, the feedback control theory is adopted to prove the system's stableness and zero steady state error. The experiments result shows that the error of deadline satisfied ratio in the system is kept within 4 of the desired value. And the number of classifiers can be dynamically adjusted by the system itself to save the computa tion resources. The proposed methodology enables the theo retical analysis and evaluation to the TC systems, leading to a high-quality and low cost implementation approach.展开更多
Simple linear regression analysis has been used to map QTL for quantitative traits. Many traits of biological interest and/or economical importance in various species show binary phenotypic distributions (e.g., presen...Simple linear regression analysis has been used to map QTL for quantitative traits. Many traits of biological interest and/or economical importance in various species show binary phenotypic distributions (e.g., presence or absence). It has been shown that such a binary trait also can be analyzed with the simple linear regression, subject to virtually no loss in power compared to the generalized linear model analysis. Binary trait is a special case of a multiple categorical trait (e.g., low, medium or high). We propose a mechanism to decompose a multiple categorical trait into an array of correlated binary variables. The categorical trait turned multiple binary traits are analyzed with a multivariate linear regression method. Turning the problem of categorical trait mapping into that of multivariate mapping allows the exploration of pleiotropic effects of QTL for different categories. Efficiency of the method is verified through a series of simulation experiments.展开更多
In this paper a novel coupled attribute similarity learning method is proposed with the basis on the multi-label categorical data(CASonMLCD).The CASonMLCD method not only computes the correlations between different ...In this paper a novel coupled attribute similarity learning method is proposed with the basis on the multi-label categorical data(CASonMLCD).The CASonMLCD method not only computes the correlations between different attributes and multi-label sets using information gain,which can be regarded as the important degree of each attribute in the attribute learning method,but also further analyzes the intra-coupled and inter-coupled interactions between an attribute value pair for different attributes and multiple labels.The paper compared the CASonMLCD method with the OF distance and Jaccard similarity,which is based on the MLKNN algorithm according to 5common evaluation criteria.The experiment results demonstrated that the CASonMLCD method can mine the similarity relationship more accurately and comprehensively,it can obtain better performance than compared methods.展开更多
In this paper, a new approach for visualizing multivariate categorical data is presented. The approach uses a graph to represent multivariate categorical data and draws the graph in such a way that we can identify pat...In this paper, a new approach for visualizing multivariate categorical data is presented. The approach uses a graph to represent multivariate categorical data and draws the graph in such a way that we can identify patterns, trends and relationship within the data. A mathematical model for the graph layout problem is deduced and a spectral graph drawing algorithm for visualizing multivariate categorical data is proposed. The experiments show that the drawings by the algorithm well capture the structures of multivariate categorical data and the computing speed is fast.展开更多
The clustering on categorical variables has received intensive attention. In dataset with categorical features, some features show the superior performance on clustering procedure. In this paper, we propose a simple m...The clustering on categorical variables has received intensive attention. In dataset with categorical features, some features show the superior performance on clustering procedure. In this paper, we propose a simple method to find such distinctive features by comparing pooled within-cluster mean relative difference and then partition the data upon such features and give subspace of the subgroups. The applications on zoo data and soybean data illustrate the performance of the proposed method.展开更多
On the basis of extension architectonics,this paper researches the process of extension categorical data mining for extension interior design. In accordance with the theory of extension data mining,the extension categ...On the basis of extension architectonics,this paper researches the process of extension categorical data mining for extension interior design. In accordance with the theory of extension data mining,the extension categorical data mining for the extension interior design can be divided into data preparation,the operation of mining and knowledge application. The paper expatiates the main content and cohesive relations of each link,and emphatically discusses extension acquisition,analysis extension,categorical mining extension,knowledge application extension and other several core nodes that are related with data. Through the knowledge fusion of extension architectonics and data mining,the paper discusses the process of knowledge requirements with multiple classification under different mining targets. The purpose of this paper is to explore a whole categorical data mining process of interior design from extension design data to the design of knowledge discovery and extension application.展开更多
Clustering categorical data, an integral part of data mining,has attracted much attention recently. In this paper, the authors formally define the categorical data clustering problem as an optimization problem from th...Clustering categorical data, an integral part of data mining,has attracted much attention recently. In this paper, the authors formally define the categorical data clustering problem as an optimization problem from the viewpoint of cluster ensemble, and apply cluster ensemble approach for clustering categorical data. Experimental results on real datasets show that better clustering accuracy can be obtained by comparing with existing categorical data clustering algorithms.展开更多
BACKGROUND Premenstrual syndrome(PMS)is the constellation of physical and psychological symptoms before menstruation.Premenstrual dysphoric disorder(PMDD)is a severe form of PMS with more depressive and anxiety sympto...BACKGROUND Premenstrual syndrome(PMS)is the constellation of physical and psychological symptoms before menstruation.Premenstrual dysphoric disorder(PMDD)is a severe form of PMS with more depressive and anxiety symptoms.The Mini international neuropsychiatric interview,module U(MINI-U),assesses the diagnostic criteria for probable PMDD.The Premenstrual Symptoms screening tool(PSST)measures the severity of these symptoms.AIM To compare the PSST ordinal scores with the corresponding dichotomous MINI-U answers.METHODS Arab women(n=194)residing in Doha,Qatar,received the MINI-U and PSST.Receiver Operating Characteristics(ROC)analyses provided the cut-off scores on the PSST using MINI-U as a gold standard.RESULTS All PSST ratings were higher in participants with positive responses on MINI-U.In addition,ROC analyses showed that all areas under the curves were significant with the cutoff scores on PSST.CONCLUSION This study confirms that the severity measures from PSST can recognize patients with moderate/severe PMS and PMDD who would benefit from immediate treatment.展开更多
Among the huge diversity of ideas that show up while studying graph theory,one that has obtained a lot of popularity is the concept of labelings of graphs.Graph labelings give valuable mathematical models for a wide s...Among the huge diversity of ideas that show up while studying graph theory,one that has obtained a lot of popularity is the concept of labelings of graphs.Graph labelings give valuable mathematical models for a wide scope of applications in high technologies(cryptography,astronomy,data security,various coding theory problems,communication networks,etc.).A labeling or a valuation of a graph is any mapping that sends a certain set of graph elements to a certain set of numbers subject to certain conditions.Graph labeling is a mapping of elements of the graph,i.e.,vertex and for edges to a set of numbers(usually positive integers),called labels.If the domain is the vertex-set or the edge-set,the labelings are called vertex labelings or edge labelings respectively.Similarly,if the domain is V(G)[E(G)],then the labeling is called total labeling.A reflexive edge irregular k-labeling of graph introduced by Tanna et al.:A total labeling of graph such that for any two different edges ab and a'b'of the graph their weights has wt_(x)(ab)=x(a)+x(ab)+x(b) and wt_(x)(a'b')=x(a')+x(a'b')+x(b') are distinct.The smallest value of k for which such labeling exist is called the reflexive edge strength of the graph and is denoted by res(G).In this paper we have found the exact value of the reflexive edge irregularity strength of the categorical product of two paths (P_(a)×P_(b))for any choice of a≥3 and b≥3.展开更多
To classify DNA sequences, k-mer frequency is widely used since it can convert variable-length sequences into fixed-length and numerical feature vectors. However, in case of fixed-length DNA sequence classification, s...To classify DNA sequences, k-mer frequency is widely used since it can convert variable-length sequences into fixed-length and numerical feature vectors. However, in case of fixed-length DNA sequence classification, subsequences starting at a specific position of the given sequence can also be used as categorical features. Through the performance evaluation on six datasets of fixed-length DNA sequences, our algorithm based on the above idea achieved comparable or better performance than other state-of-the art algorithms.展开更多
This paper proposes two new algorithms for classifying objects with categorical attributes. These algorithms are derived from the assumption that the attributes of different object classes have different probability d...This paper proposes two new algorithms for classifying objects with categorical attributes. These algorithms are derived from the assumption that the attributes of different object classes have different probability distributions. One algorithm classifies objects based on the distribution of the attribute frequencies, and the other classifies objects based on the distribution of the pairwise attribute frequencies described using a matrix of pairwise frequencies. Both algorithms are based on the method of invariants, which offers the simplest dependencies for estimating the probabilities of objects in each class by an average frequency of their attributes. The estimated object class corresponds to the maximum probability. This method reflects the sensory process models of animals and is aimed at recognizing an object class by searching for a prototype in information accumulated in the brain. Because these matrices may be sparse, the solution cannot be determined for some objects. For these objects, an analog of the k-nearest neighbors method is provided in which for each attribute value, the class to which the majority of the k-nearest objects in the training sample belong is determined, and the most likely class value is calculated. The efficiencies of these two algorithms were confirmed on five databases.展开更多
Statistics is a powerful tool for data measurement. Statistical techniques properly planned and executed give meaning to meaningless data. The difficulty some practitioners encounter hinges on the fact that though the...Statistics is a powerful tool for data measurement. Statistical techniques properly planned and executed give meaning to meaningless data. The difficulty some practitioners encounter hinges on the fact that though there are numerous statistical methods available for use in analysis, the extent of their understanding and ease of using these tools for analysis is limited. This study has twofold purpose: firstly, literature on categorical data commonly used in research w</span><span style="font-family:Verdana;">as</span><span style="font-family:Verdana;"> reviewed</span><span style="font-family:Verdana;">;</span><span style="font-family:""><span style="font-family:Verdana;"> next, we reported the results of a survey we designed and executed. Categorical data was collected via questionnaire and analyzed to serve as a backbone of the robustness of categorical data. Several conjec</span><span style="font-family:Verdana;">tures about the independence of the socio-economic variables and e-commence</span><span style="font-family:Verdana;"> were tested. Some of the factors influencing patronage of e-commerce were </span><span style="font-family:Verdana;">identified. It is clear from the literature that as one’s academic qualification</span><span style="font-family:Verdana;"> improves</span></span><span style="font-family:Verdana;">, </span><span style="font-family:""><span style="font-family:Verdana;">there is an associated improvement in their preference for e-commerce, but the results revealed otherwise. Size of family was found to influence e-commerce. Both income and social status positively affected pa</span><span style="font-family:Verdana;">tronage in e-commerce. Gender also appeared to affect patronage in e-commerce</span><span style="font-family:Verdana;">. 62.3% of staff had patronized e-commerce</span></span><span style="font-family:Verdana;">.</span><span style="font-family:Verdana;"> This shows that e-commerce patronage was gradually increasing. It is therefore our considered view that policy documents regulating and monitoring the use of e-commerce be developed to increase e-commerce participation across the globe</span><span style="font-family:Verdana;">. </span><span style="font-family:Verdana;">It is also recommended that the bottlenecks which obstruct patronage in e-commence be addressed so that a lot more staff will develop a positive attitude towards e-commerce.展开更多
Machine learning methods are increasingly used for spatially predicting a categorical target variable when spatially exhaustive predictor variables are available within the study region.Even though these methods exhib...Machine learning methods are increasingly used for spatially predicting a categorical target variable when spatially exhaustive predictor variables are available within the study region.Even though these methods exhibit competitive spatial prediction performance,they do not exactly honor the categorical target variable's observed values at sampling locations by construction.On the other side,competitor geostatistical methods perfectly match the categorical target variable's observed values at sampling locations by essence.In many geoscience applications,it is often desirable to perfectly match the observed values of the categorical target variable at sampling locations,especially when the categorical target variable's measurements can be reasonably considered error-free.This paper addresses the problem of exact conditioning of machine learning methods for the spatial prediction of categorical variables.It introduces a classification random forest-based approach in which the categorical target variable is exactly conditioned to the data,thus having the exact conditioning property like competitor geostatistical methods.The proposed method extends a previous work dedicated to continuous target variables by using an implicit representation of the categorical target variable.The basic idea consists of transforming the ensemble of classification tree predictors'(categorical)resulting from the traditional classification random forest into an ensemble of signed distances(continuous)associated with each category of the categorical target variable.Then,an orthogonal representation of the ensemble of signed distances is created through the principal component analysis,thus allowing to reformulate the exact conditioning problem as a system of linear inequalities on principal component scores.Then,the sampling of new principal component scores ensuring the data's exact conditioning is performed via randomized quadratic programming.The resulting conditional signed distances are turned out into an ensemble of categorical outputs,which perfectly honor the categorical target variable's observed values at sampling locations.Then,the majority vote is used to aggregate the ensemble of categorical outputs.The effectiveness of the proposed method is illustrated on a simulated dataset for which ground-truth is available and showcased on a real-world dataset,including geochemical data.A comparison with geostatistical and traditional machine learning methods show that the proposed technique can perfectly match the categorical target variable's observed values at sampling locations while maintaining competitive out-of-sample predictive performance.展开更多
This article presents an innovative approach to automatic rule discovery for data transformation tasks leveraging XGBoost,a machine learning algorithm renowned for its efficiency and performance.The framework proposed...This article presents an innovative approach to automatic rule discovery for data transformation tasks leveraging XGBoost,a machine learning algorithm renowned for its efficiency and performance.The framework proposed herein utilizes the fusion of diversified feature formats,specifically,metadata,textual,and pattern features.The goal is to enhance the system’s ability to discern and generalize transformation rules fromsource to destination formats in varied contexts.Firstly,the article delves into the methodology for extracting these distinct features from raw data and the pre-processing steps undertaken to prepare the data for the model.Subsequent sections expound on the mechanism of feature optimization using Recursive Feature Elimination(RFE)with linear regression,aiming to retain the most contributive features and eliminate redundant or less significant ones.The core of the research revolves around the deployment of the XGBoostmodel for training,using the prepared and optimized feature sets.The article presents a detailed overview of the mathematical model and algorithmic steps behind this procedure.Finally,the process of rule discovery(prediction phase)by the trained XGBoost model is explained,underscoring its role in real-time,automated data transformations.By employingmachine learning and particularly,the XGBoost model in the context of Business Rule Engine(BRE)data transformation,the article underscores a paradigm shift towardsmore scalable,efficient,and less human-dependent data transformation systems.This research opens doors for further exploration into automated rule discovery systems and their applications in various sectors.展开更多
As digital technologies have advanced more rapidly,the number of paper documents recently converted into a digital format has exponentially increased.To respond to the urgent need to categorize the growing number of d...As digital technologies have advanced more rapidly,the number of paper documents recently converted into a digital format has exponentially increased.To respond to the urgent need to categorize the growing number of digitized documents,the classification of digitized documents in real time has been identified as the primary goal of our study.A paper classification is the first stage in automating document control and efficient knowledge discovery with no or little human involvement.Artificial intelligence methods such as Deep Learning are now combined with segmentation to study and interpret those traits,which were not conceivable ten years ago.Deep learning aids in comprehending input patterns so that object classes may be predicted.The segmentation process divides the input image into separate segments for a more thorough image study.This study proposes a deep learning-enabled framework for automated document classification,which can be implemented in higher education.To further this goal,a dataset was developed that includes seven categories:Diplomas,Personal documents,Journal of Accounting of higher education diplomas,Service letters,Orders,Production orders,and Student orders.Subsequently,a deep learning model based on Conv2D layers is proposed for the document classification process.In the final part of this research,the proposed model is evaluated and compared with other machine-learning techniques.The results demonstrate that the proposed deep learning model shows high results in document categorization overtaking the other machine learning models by reaching 94.84%,94.79%,94.62%,94.43%,94.07%in accuracy,precision,recall,F-score,and AUC-ROC,respectively.The achieved results prove that the proposed deep model is acceptable to use in practice as an assistant to an office worker.展开更多
The school placement processes of students from immigrant backgrounds considered to be in“difficulty”is an international concern at the intersection of works relating to special education and those concerning the sc...The school placement processes of students from immigrant backgrounds considered to be in“difficulty”is an international concern at the intersection of works relating to special education and those concerning the school experiences of students from immigrant backgrounds or racialized groups.The research problem of this article concerns the identification of these students as disabled or as having adjustment or learning difficulties.From a perspective anchored in Disability Critical Race Studies,this ethnographic study documents different interpretations of perceived difficulties made by school actors with regard to seven primary school students from immigrant backgrounds.Five interpretation types are presented:(1)medicalization by dismissal of cultural markers,(2)medicalization by professional constraint,(3)medicalization by cultural deficit,(4)precautionary wait,and(5)cultural differentialism.Our results help to shed light on the special education overrepresentation phenomenon regarding these students and to understand how ableism and(neo)racism contribute to it.展开更多
To promote behavioral change among adolescents in Zambia, the National HIV/AIDS/STI/TB Council, in collaboration with UNICEF, developed the Zambia U-Report platform. This platform provides young people with improved a...To promote behavioral change among adolescents in Zambia, the National HIV/AIDS/STI/TB Council, in collaboration with UNICEF, developed the Zambia U-Report platform. This platform provides young people with improved access to information on various Sexual Reproductive Health topics through Short Messaging Service (SMS) messages. Over the years, the platform has accumulated millions of incoming and outgoing messages, which need to be categorized into key thematic areas for better tracking of sexual reproductive health knowledge gaps among young people. The current manual categorization process of these text messages is inefficient and time-consuming and this study aims to automate the process for improved analysis using text-mining techniques. Firstly, the study investigates the current text message categorization process and identifies a list of categories adopted by counselors over time which are then used to build and train a categorization model. Secondly, the study presents a proof of concept tool that automates the categorization of U-report messages into key thematic areas using the developed categorization model. Finally, it compares the performance and effectiveness of the developed proof of concept tool against the manual system. The study used a dataset comprising 206,625 text messages. The current process would take roughly 2.82 years to categorise this dataset whereas the trained SVM model would require only 6.4 minutes while achieving an accuracy of 70.4% demonstrating that the automated method is significantly faster, more scalable, and consistent when compared to the current manual categorization. These advantages make the SVM model a more efficient and effective tool for categorizing large unstructured text datasets. These results and the proof-of-concept tool developed demonstrate the potential for enhancing the efficiency and accuracy of message categorization on the Zambia U-report platform and other similar text messages-based platforms.展开更多
文摘This paper provides a brief introduction to the methods for generating fuzzy categorical maps from remotely sensed images (in graphical and digital forms).This is followed by a description of the slicing process for deriving fuzzy boundaries from fuzzy categorical maps,which can be based on the maximum fuzzy membership values,confusion index,or measure of entropy.Results from an empirical test preformed in an Edinburgh suburb show that fuzzy boundaries of land cover can be derived from aerial photographs and satellite images by using the three criteria with small differences,and that slicing based on the maximum fuzzy membership values is the easiest and most straightforward solution.This,in turn,implies the suitability of maintaining both a crisp classification and its underlying certainty map for deriving fuzzy boundaries at different thresholds,which is a flexible and compact management of categorical map data and their uncertainty.
基金the National Natural Science Foundation (No. 40271088) the Research Fund of International Institute of Geo-information Science and Earth Observation.
文摘This paper focuses on the issues of categorical database gen-eralization and emphasizes the roles ofsupporting data model, integrated datamodel, spatial analysis and semanticanalysis in database generalization.The framework contents of categoricaldatabase generalization transformationare defined. This paper presents an in-tegrated spatial supporting data struc-ture, a semantic supporting model andsimilarity model for the categorical da-tabase generalization. The concept oftransformation unit is proposed in generalization.
基金Supported by the National Natural Science Foun-dation of China (90104032) ,the National High-Tech Research andDevelopment Plan of China (2003AA1Z2090)
文摘In order to provide predictable runtime performante for text categorization (TC) systems, an innovative system design method is proposed for soft real time TC systems. An analyzable mathematical model is established to approximately describe the nonlinear and time-varying TC systems. According to this mathematical model, the feedback control theory is adopted to prove the system's stableness and zero steady state error. The experiments result shows that the error of deadline satisfied ratio in the system is kept within 4 of the desired value. And the number of classifiers can be dynamically adjusted by the system itself to save the computa tion resources. The proposed methodology enables the theo retical analysis and evaluation to the TC systems, leading to a high-quality and low cost implementation approach.
基金Item supported by national natural sciencefoundation( No.30471236)
文摘Simple linear regression analysis has been used to map QTL for quantitative traits. Many traits of biological interest and/or economical importance in various species show binary phenotypic distributions (e.g., presence or absence). It has been shown that such a binary trait also can be analyzed with the simple linear regression, subject to virtually no loss in power compared to the generalized linear model analysis. Binary trait is a special case of a multiple categorical trait (e.g., low, medium or high). We propose a mechanism to decompose a multiple categorical trait into an array of correlated binary variables. The categorical trait turned multiple binary traits are analyzed with a multivariate linear regression method. Turning the problem of categorical trait mapping into that of multivariate mapping allows the exploration of pleiotropic effects of QTL for different categories. Efficiency of the method is verified through a series of simulation experiments.
基金Supported by Australian Research Council Discovery(DP130102691)the National Science Foundation of China(61302157)+1 种基金China National 863 Project(2012AA12A308)China Pre-research Project of Nuclear Industry(FZ1402-08)
文摘In this paper a novel coupled attribute similarity learning method is proposed with the basis on the multi-label categorical data(CASonMLCD).The CASonMLCD method not only computes the correlations between different attributes and multi-label sets using information gain,which can be regarded as the important degree of each attribute in the attribute learning method,but also further analyzes the intra-coupled and inter-coupled interactions between an attribute value pair for different attributes and multiple labels.The paper compared the CASonMLCD method with the OF distance and Jaccard similarity,which is based on the MLKNN algorithm according to 5common evaluation criteria.The experiment results demonstrated that the CASonMLCD method can mine the similarity relationship more accurately and comprehensively,it can obtain better performance than compared methods.
基金Supported by the National Natural Science Foundation of China (601133010)
文摘In this paper, a new approach for visualizing multivariate categorical data is presented. The approach uses a graph to represent multivariate categorical data and draws the graph in such a way that we can identify patterns, trends and relationship within the data. A mathematical model for the graph layout problem is deduced and a spectral graph drawing algorithm for visualizing multivariate categorical data is proposed. The experiments show that the drawings by the algorithm well capture the structures of multivariate categorical data and the computing speed is fast.
文摘The clustering on categorical variables has received intensive attention. In dataset with categorical features, some features show the superior performance on clustering procedure. In this paper, we propose a simple method to find such distinctive features by comparing pooled within-cluster mean relative difference and then partition the data upon such features and give subspace of the subgroups. The applications on zoo data and soybean data illustrate the performance of the proposed method.
基金Sponsored by the National Natural Science Foundation of China(Grant No.51178132)"Thirteenth Five-year" Social Science Research Project of the Education Department in Jilin Province(Grant No.Ji UNESCO co word[2016]No.382th)
文摘On the basis of extension architectonics,this paper researches the process of extension categorical data mining for extension interior design. In accordance with the theory of extension data mining,the extension categorical data mining for the extension interior design can be divided into data preparation,the operation of mining and knowledge application. The paper expatiates the main content and cohesive relations of each link,and emphatically discusses extension acquisition,analysis extension,categorical mining extension,knowledge application extension and other several core nodes that are related with data. Through the knowledge fusion of extension architectonics and data mining,the paper discusses the process of knowledge requirements with multiple classification under different mining targets. The purpose of this paper is to explore a whole categorical data mining process of interior design from extension design data to the design of knowledge discovery and extension application.
文摘Clustering categorical data, an integral part of data mining,has attracted much attention recently. In this paper, the authors formally define the categorical data clustering problem as an optimization problem from the viewpoint of cluster ensemble, and apply cluster ensemble approach for clustering categorical data. Experimental results on real datasets show that better clustering accuracy can be obtained by comparing with existing categorical data clustering algorithms.
基金Supported by the Qatar National Research Fund,No. UREP 10-022-3-005
文摘BACKGROUND Premenstrual syndrome(PMS)is the constellation of physical and psychological symptoms before menstruation.Premenstrual dysphoric disorder(PMDD)is a severe form of PMS with more depressive and anxiety symptoms.The Mini international neuropsychiatric interview,module U(MINI-U),assesses the diagnostic criteria for probable PMDD.The Premenstrual Symptoms screening tool(PSST)measures the severity of these symptoms.AIM To compare the PSST ordinal scores with the corresponding dichotomous MINI-U answers.METHODS Arab women(n=194)residing in Doha,Qatar,received the MINI-U and PSST.Receiver Operating Characteristics(ROC)analyses provided the cut-off scores on the PSST using MINI-U as a gold standard.RESULTS All PSST ratings were higher in participants with positive responses on MINI-U.In addition,ROC analyses showed that all areas under the curves were significant with the cutoff scores on PSST.CONCLUSION This study confirms that the severity measures from PSST can recognize patients with moderate/severe PMS and PMDD who would benefit from immediate treatment.
文摘Among the huge diversity of ideas that show up while studying graph theory,one that has obtained a lot of popularity is the concept of labelings of graphs.Graph labelings give valuable mathematical models for a wide scope of applications in high technologies(cryptography,astronomy,data security,various coding theory problems,communication networks,etc.).A labeling or a valuation of a graph is any mapping that sends a certain set of graph elements to a certain set of numbers subject to certain conditions.Graph labeling is a mapping of elements of the graph,i.e.,vertex and for edges to a set of numbers(usually positive integers),called labels.If the domain is the vertex-set or the edge-set,the labelings are called vertex labelings or edge labelings respectively.Similarly,if the domain is V(G)[E(G)],then the labeling is called total labeling.A reflexive edge irregular k-labeling of graph introduced by Tanna et al.:A total labeling of graph such that for any two different edges ab and a'b'of the graph their weights has wt_(x)(ab)=x(a)+x(ab)+x(b) and wt_(x)(a'b')=x(a')+x(a'b')+x(b') are distinct.The smallest value of k for which such labeling exist is called the reflexive edge strength of the graph and is denoted by res(G).In this paper we have found the exact value of the reflexive edge irregularity strength of the categorical product of two paths (P_(a)×P_(b))for any choice of a≥3 and b≥3.
文摘To classify DNA sequences, k-mer frequency is widely used since it can convert variable-length sequences into fixed-length and numerical feature vectors. However, in case of fixed-length DNA sequence classification, subsequences starting at a specific position of the given sequence can also be used as categorical features. Through the performance evaluation on six datasets of fixed-length DNA sequences, our algorithm based on the above idea achieved comparable or better performance than other state-of-the art algorithms.
文摘This paper proposes two new algorithms for classifying objects with categorical attributes. These algorithms are derived from the assumption that the attributes of different object classes have different probability distributions. One algorithm classifies objects based on the distribution of the attribute frequencies, and the other classifies objects based on the distribution of the pairwise attribute frequencies described using a matrix of pairwise frequencies. Both algorithms are based on the method of invariants, which offers the simplest dependencies for estimating the probabilities of objects in each class by an average frequency of their attributes. The estimated object class corresponds to the maximum probability. This method reflects the sensory process models of animals and is aimed at recognizing an object class by searching for a prototype in information accumulated in the brain. Because these matrices may be sparse, the solution cannot be determined for some objects. For these objects, an analog of the k-nearest neighbors method is provided in which for each attribute value, the class to which the majority of the k-nearest objects in the training sample belong is determined, and the most likely class value is calculated. The efficiencies of these two algorithms were confirmed on five databases.
文摘Statistics is a powerful tool for data measurement. Statistical techniques properly planned and executed give meaning to meaningless data. The difficulty some practitioners encounter hinges on the fact that though there are numerous statistical methods available for use in analysis, the extent of their understanding and ease of using these tools for analysis is limited. This study has twofold purpose: firstly, literature on categorical data commonly used in research w</span><span style="font-family:Verdana;">as</span><span style="font-family:Verdana;"> reviewed</span><span style="font-family:Verdana;">;</span><span style="font-family:""><span style="font-family:Verdana;"> next, we reported the results of a survey we designed and executed. Categorical data was collected via questionnaire and analyzed to serve as a backbone of the robustness of categorical data. Several conjec</span><span style="font-family:Verdana;">tures about the independence of the socio-economic variables and e-commence</span><span style="font-family:Verdana;"> were tested. Some of the factors influencing patronage of e-commerce were </span><span style="font-family:Verdana;">identified. It is clear from the literature that as one’s academic qualification</span><span style="font-family:Verdana;"> improves</span></span><span style="font-family:Verdana;">, </span><span style="font-family:""><span style="font-family:Verdana;">there is an associated improvement in their preference for e-commerce, but the results revealed otherwise. Size of family was found to influence e-commerce. Both income and social status positively affected pa</span><span style="font-family:Verdana;">tronage in e-commerce. Gender also appeared to affect patronage in e-commerce</span><span style="font-family:Verdana;">. 62.3% of staff had patronized e-commerce</span></span><span style="font-family:Verdana;">.</span><span style="font-family:Verdana;"> This shows that e-commerce patronage was gradually increasing. It is therefore our considered view that policy documents regulating and monitoring the use of e-commerce be developed to increase e-commerce participation across the globe</span><span style="font-family:Verdana;">. </span><span style="font-family:Verdana;">It is also recommended that the bottlenecks which obstruct patronage in e-commence be addressed so that a lot more staff will develop a positive attitude towards e-commerce.
文摘Machine learning methods are increasingly used for spatially predicting a categorical target variable when spatially exhaustive predictor variables are available within the study region.Even though these methods exhibit competitive spatial prediction performance,they do not exactly honor the categorical target variable's observed values at sampling locations by construction.On the other side,competitor geostatistical methods perfectly match the categorical target variable's observed values at sampling locations by essence.In many geoscience applications,it is often desirable to perfectly match the observed values of the categorical target variable at sampling locations,especially when the categorical target variable's measurements can be reasonably considered error-free.This paper addresses the problem of exact conditioning of machine learning methods for the spatial prediction of categorical variables.It introduces a classification random forest-based approach in which the categorical target variable is exactly conditioned to the data,thus having the exact conditioning property like competitor geostatistical methods.The proposed method extends a previous work dedicated to continuous target variables by using an implicit representation of the categorical target variable.The basic idea consists of transforming the ensemble of classification tree predictors'(categorical)resulting from the traditional classification random forest into an ensemble of signed distances(continuous)associated with each category of the categorical target variable.Then,an orthogonal representation of the ensemble of signed distances is created through the principal component analysis,thus allowing to reformulate the exact conditioning problem as a system of linear inequalities on principal component scores.Then,the sampling of new principal component scores ensuring the data's exact conditioning is performed via randomized quadratic programming.The resulting conditional signed distances are turned out into an ensemble of categorical outputs,which perfectly honor the categorical target variable's observed values at sampling locations.Then,the majority vote is used to aggregate the ensemble of categorical outputs.The effectiveness of the proposed method is illustrated on a simulated dataset for which ground-truth is available and showcased on a real-world dataset,including geochemical data.A comparison with geostatistical and traditional machine learning methods show that the proposed technique can perfectly match the categorical target variable's observed values at sampling locations while maintaining competitive out-of-sample predictive performance.
文摘This article presents an innovative approach to automatic rule discovery for data transformation tasks leveraging XGBoost,a machine learning algorithm renowned for its efficiency and performance.The framework proposed herein utilizes the fusion of diversified feature formats,specifically,metadata,textual,and pattern features.The goal is to enhance the system’s ability to discern and generalize transformation rules fromsource to destination formats in varied contexts.Firstly,the article delves into the methodology for extracting these distinct features from raw data and the pre-processing steps undertaken to prepare the data for the model.Subsequent sections expound on the mechanism of feature optimization using Recursive Feature Elimination(RFE)with linear regression,aiming to retain the most contributive features and eliminate redundant or less significant ones.The core of the research revolves around the deployment of the XGBoostmodel for training,using the prepared and optimized feature sets.The article presents a detailed overview of the mathematical model and algorithmic steps behind this procedure.Finally,the process of rule discovery(prediction phase)by the trained XGBoost model is explained,underscoring its role in real-time,automated data transformations.By employingmachine learning and particularly,the XGBoost model in the context of Business Rule Engine(BRE)data transformation,the article underscores a paradigm shift towardsmore scalable,efficient,and less human-dependent data transformation systems.This research opens doors for further exploration into automated rule discovery systems and their applications in various sectors.
文摘As digital technologies have advanced more rapidly,the number of paper documents recently converted into a digital format has exponentially increased.To respond to the urgent need to categorize the growing number of digitized documents,the classification of digitized documents in real time has been identified as the primary goal of our study.A paper classification is the first stage in automating document control and efficient knowledge discovery with no or little human involvement.Artificial intelligence methods such as Deep Learning are now combined with segmentation to study and interpret those traits,which were not conceivable ten years ago.Deep learning aids in comprehending input patterns so that object classes may be predicted.The segmentation process divides the input image into separate segments for a more thorough image study.This study proposes a deep learning-enabled framework for automated document classification,which can be implemented in higher education.To further this goal,a dataset was developed that includes seven categories:Diplomas,Personal documents,Journal of Accounting of higher education diplomas,Service letters,Orders,Production orders,and Student orders.Subsequently,a deep learning model based on Conv2D layers is proposed for the document classification process.In the final part of this research,the proposed model is evaluated and compared with other machine-learning techniques.The results demonstrate that the proposed deep learning model shows high results in document categorization overtaking the other machine learning models by reaching 94.84%,94.79%,94.62%,94.43%,94.07%in accuracy,precision,recall,F-score,and AUC-ROC,respectively.The achieved results prove that the proposed deep model is acceptable to use in practice as an assistant to an office worker.
文摘The school placement processes of students from immigrant backgrounds considered to be in“difficulty”is an international concern at the intersection of works relating to special education and those concerning the school experiences of students from immigrant backgrounds or racialized groups.The research problem of this article concerns the identification of these students as disabled or as having adjustment or learning difficulties.From a perspective anchored in Disability Critical Race Studies,this ethnographic study documents different interpretations of perceived difficulties made by school actors with regard to seven primary school students from immigrant backgrounds.Five interpretation types are presented:(1)medicalization by dismissal of cultural markers,(2)medicalization by professional constraint,(3)medicalization by cultural deficit,(4)precautionary wait,and(5)cultural differentialism.Our results help to shed light on the special education overrepresentation phenomenon regarding these students and to understand how ableism and(neo)racism contribute to it.
文摘To promote behavioral change among adolescents in Zambia, the National HIV/AIDS/STI/TB Council, in collaboration with UNICEF, developed the Zambia U-Report platform. This platform provides young people with improved access to information on various Sexual Reproductive Health topics through Short Messaging Service (SMS) messages. Over the years, the platform has accumulated millions of incoming and outgoing messages, which need to be categorized into key thematic areas for better tracking of sexual reproductive health knowledge gaps among young people. The current manual categorization process of these text messages is inefficient and time-consuming and this study aims to automate the process for improved analysis using text-mining techniques. Firstly, the study investigates the current text message categorization process and identifies a list of categories adopted by counselors over time which are then used to build and train a categorization model. Secondly, the study presents a proof of concept tool that automates the categorization of U-report messages into key thematic areas using the developed categorization model. Finally, it compares the performance and effectiveness of the developed proof of concept tool against the manual system. The study used a dataset comprising 206,625 text messages. The current process would take roughly 2.82 years to categorise this dataset whereas the trained SVM model would require only 6.4 minutes while achieving an accuracy of 70.4% demonstrating that the automated method is significantly faster, more scalable, and consistent when compared to the current manual categorization. These advantages make the SVM model a more efficient and effective tool for categorizing large unstructured text datasets. These results and the proof-of-concept tool developed demonstrate the potential for enhancing the efficiency and accuracy of message categorization on the Zambia U-report platform and other similar text messages-based platforms.