Mandarin在(pinyin:zài)is the most frequently used character in representing spatial and temporal relationship.Current studies mostly focus on its lexical meaning and syntactic structure while cognitive features o...Mandarin在(pinyin:zài)is the most frequently used character in representing spatial and temporal relationship.Current studies mostly focus on its lexical meaning and syntactic structure while cognitive features of its grammatical categories have been neglected.This paper investigates into the categorization of zài by conducting a morphosyntactic test among College English majors in China.The results show that:prototypes are organizing the grammatical categories of zài at all levels in terms of intra-categorial gradience;the semantic construal of zài construction could significantly influence the accuracy of the grammatical categorization of zài;the syntactic structure can provide viable cue for the identification of grammatical categories of zài;spatiality,temporality and the status of existing are three essential semantic features encoded by zài,the concurrence of which leads to various degree of inter-categorial vagueness,indicating a conflict between the rigid grammatical classification and the indeterminate nature of the grammatical functions of zai,suggesting the necessity to reconsider the efficacy of applying indiscriminately the Anglo-Saxon grammar into the study of Chinese spatial-temporal constructions.展开更多
In cognitive linguistics,debates on the status and functions of categorization have been a heated issue.In semantics and second language acquisition,scholars have discussed and achieved vocabulary acquisition from dif...In cognitive linguistics,debates on the status and functions of categorization have been a heated issue.In semantics and second language acquisition,scholars have discussed and achieved vocabulary acquisition from different perspectives and academic levels.Vocabulary learning exerts a fundamental role in second language vocabulary acquisition(SLVA),and it is closely related to learners’cognitive competence.However,studies on second language vocabulary acquisition under the categorization theory in cognitive linguistics have received less attention from linguists when compared with other studies.This paper employs two representative dimensions,the basic-level effect and the prototype effect,under the categorization theory to further delve into the implications on second language vocabulary acquisition.This article first provides a comprehensive introduction to the nature and the approaches of the categorization theory,and then analyzes the relations and implications for second language vocabulary acquisition under the categorization theory from the perspective of the basic-level and the prototype effects.The research results showed that the basic-level effect on SLVA is mainly on the classification of word categories distinguished from the superordinate and subordinate categories,while the prototype effect is more on understanding the complexity and use of word meaning.展开更多
In this paper, we discuss several issues related to automated classification of web pages, especially text classification of web pages. We analyze features selection and categorization algorithms of web pages and give...In this paper, we discuss several issues related to automated classification of web pages, especially text classification of web pages. We analyze features selection and categorization algorithms of web pages and give some suggestions for web pages categorization.展开更多
This paper proposes a new approach of feature selection based on the independent measure between features for text categorization. A fundamental hypothesis that occurrence of the terms in documents is independent of e...This paper proposes a new approach of feature selection based on the independent measure between features for text categorization. A fundamental hypothesis that occurrence of the terms in documents is independent of each other, widely used in the probabilistic models for text categorization (TC), is discussed. However, the basic hypothesis is incom plete for independence of feature set. From the view of feature selection, a new independent measure between features is designed, by which a feature selection algorithm is given to ob rain a feature subset. The selected subset is high in relevance with category and strong in independence between features, satisfies the basic hypothesis at maximum degree. Compared with other traditional feature selection method in TC (which is only taken into the relevance account), the performance of feature subset selected by our method is prior to others with experiments on the benchmark dataset of 20 Newsgroups.展开更多
This paper summarizes several automatic text categorization algorithms in common use recently, analyzes and compares their advantages and disadvantages. It provides clues for making use of appropriate automatic classi...This paper summarizes several automatic text categorization algorithms in common use recently, analyzes and compares their advantages and disadvantages. It provides clues for making use of appropriate automatic classifying algorithms in different fields. Finally some evaluations and summaries of these algorithms are discussed, and directions to further research have been pointed out. Key words text categorization - naive bayes - KNN - SVM - neural network CLC number TP 391 Foundation item: Supported by the National Natural Science Foundation of China (70031010) and the Research Foundation of Beijing Institute of TechnologyBiography: SHI Yong-feng (1980-), male, Master candidate, research direction: web information mining.展开更多
To avoid the curse of dimensionality, text categorization (TC) algorithms based on machine learning (ML) have to use an feature selection (FS) method to reduce the dimensionality of feature space. Although havin...To avoid the curse of dimensionality, text categorization (TC) algorithms based on machine learning (ML) have to use an feature selection (FS) method to reduce the dimensionality of feature space. Although having been widely used, FS process will generally cause information losing and then have much side-effect on the whole performance of TC algorithms. On the basis of the sparsity characteristic of text vectors, a new TC algorithm based on lazy feature selection (LFS) is presented. As a new type of embedded feature selection approach, the LFS method can greatly reduce the dimension of features without any information losing, which can improve both efficiency and performance of algorithms greatly. The experiments show the new algorithm can simultaneously achieve much higher both performance and efficiency than some of other classical TC algorithms.展开更多
In order to provide predictable runtime performante for text categorization (TC) systems, an innovative system design method is proposed for soft real time TC systems. An analyzable mathematical model is established...In order to provide predictable runtime performante for text categorization (TC) systems, an innovative system design method is proposed for soft real time TC systems. An analyzable mathematical model is established to approximately describe the nonlinear and time-varying TC systems. According to this mathematical model, the feedback control theory is adopted to prove the system's stableness and zero steady state error. The experiments result shows that the error of deadline satisfied ratio in the system is kept within 4 of the desired value. And the number of classifiers can be dynamically adjusted by the system itself to save the computa tion resources. The proposed methodology enables the theo retical analysis and evaluation to the TC systems, leading to a high-quality and low cost implementation approach.展开更多
A study examining affective information processing in persons with Multiple Sclerosis and healthy adults was carried out. It was hypothesized that individual characteristics could modulate participants’ emotional cat...A study examining affective information processing in persons with Multiple Sclerosis and healthy adults was carried out. It was hypothesized that individual characteristics could modulate participants’ emotional categorization and reaction times for categorization decisions. For example, individuals with negative valenced emotional profile (e.g. anxious) should choose negative emotional alternatives faster and more frequently. Participants consisted of two different populations: 80 right-handed healthy French-speakers, and 40 right-handed French- speakers with multiple sclerosis. The results showed a positive correlation between high- level of negative emotional sensibility and emotional categorization (decision and decision speed) for affective information presented on the right-side of the screen. For all participants there were more frequent emotional choices and faster decisions for left-side presented emotional alternatives. It seems individuals’ emotional differences in general and in MS populations modulate hemispheric asymmetry of processing emotional judgments.展开更多
A hierarchical system to perform automatic categorization and reorientation of images using content analysis is pre-sented. The proposed system first categorizes images to some a priori defined categories using rotati...A hierarchical system to perform automatic categorization and reorientation of images using content analysis is pre-sented. The proposed system first categorizes images to some a priori defined categories using rotation invariant features. At the second stage, it detects their correct orientation out of {0o, 90o, 180o, and 270o} using category specific model. The system has been specially designed for embedded devices applications using only low level color and edge features. Machine learning algorithms optimized to suit the embedded implementation like support vector machines (SVMs) and scalable boosting have been used to develop classifiers for categorization and orientation detection. Results are presented on a collection of about 7000 consumer images collected from open resources. The proposed system finds it applications to various digital media products and brings pattern recognition solutions to the consumer electronics domain.展开更多
In current study, behavioral measures were conducted to investigate clothing color. The purpose was to focus on the rule that color brightness influencedpositive-negative emotional categorization. Results showed that ...In current study, behavioral measures were conducted to investigate clothing color. The purpose was to focus on the rule that color brightness influencedpositive-negative emotional categorization. Results showed that the effect of brightness on clothing color emotion categorization was significant. With the increase of brightness, the variation curve of positive emotion appears to be a “U-shaped”, whereas that of the negative emotion shows an upside down “U-shaped”. Compared with the low brightness colors, the emotion reaction to the high brightness colors was more positive;Most of the colors with different brightness scales were classified as positive emotions and the minors were classified as negative emotions;the positive colors could be done much faster than the negative ones.展开更多
In this paper, the role of rare or infrequent terms in enhancing the accuracy of English Text Categorization using Polynomial Networks (PNs) is investigated. To study the impact of rare terms in enhancing the accuracy...In this paper, the role of rare or infrequent terms in enhancing the accuracy of English Text Categorization using Polynomial Networks (PNs) is investigated. To study the impact of rare terms in enhancing the accuracy of PNs-based text categorization, different term reduction criteria as well as different term weighting schemes were experimented on the Reuters Corpus using PNs. Each term weighting scheme on each reduced term set was tested once keeping the rare terms and another time removing them. All the experiments conducted in this research show that keeping rare terms substantially improves the performance of Polynomial Networks in Text Categorization, regardless of the term reduction method, the number of terms used in classification, or the term weighting scheme adopted.展开更多
Content Based Image Retrieval, CBIR, performed an automated classification task for a queried image. It could relieve a user from the laborious and time-consuming metadata assigning for an image while working on massi...Content Based Image Retrieval, CBIR, performed an automated classification task for a queried image. It could relieve a user from the laborious and time-consuming metadata assigning for an image while working on massive image collection. For an image, user’s definition or description is subjective where it could belong to different categories as defined by different users. Human based categorization and computer-based categorization might produce different results due to different categorization criteria that rely on dataset structure and the clustering techniques. This paper is aimed to exhibit an idea for planning the dataset structure and choosing the clustering algorithm for CBIR implementation. There are 5 sections arranged in this paper;CBIR and QBE concepts are introduced in Section 1, related image categorization research is listed in Section 2, the 5 type of image clustering are described in Section 3, comparative analysis in Section 4, and Section 5 conclude this study. Outcome of this paper will be benefiting CBIR developer for various applications.展开更多
The study of metaphor of"xin"or heart could be traced back to the beginning of 21 stcentury,and a terrific text for the study isthe famous works called Caigentan,whose distinguished English version is transl...The study of metaphor of"xin"or heart could be traced back to the beginning of 21 stcentury,and a terrific text for the study isthe famous works called Caigentan,whose distinguished English version is translated by Paul Whiter.In this paper the metaphors of"xin"in both Chinese and English version from the categorization process are to be analyzed especially their vehicle category and themapping process and thus some similarities and differences could be spotted,and finally the Chinese and western cultural context be-hind them could be deduced and analyzed.展开更多
Text categorization(TC)is one of the widely studied branches of text mining and has many applications in different domains.It tries to automatically assign a text document to one of the predefined categories often by ...Text categorization(TC)is one of the widely studied branches of text mining and has many applications in different domains.It tries to automatically assign a text document to one of the predefined categories often by using machine learning(ML)techniques.Choosing the best classifier in this task is the most important step in which k-Nearest Neighbor(KNN)is widely employed as a classifier as well as several other well-known ones such as Support Vector Machine,Multinomial Naive Bayes,Logistic Regression,and so on.The KNN has been extensively used for TC tasks and is one of the oldest and simplest methods for pattern classification.Its performance crucially relies on the distance metric used to identify nearest neighbors such that the most frequently observed label among these neighbors is used to classify an unseen test instance.Hence,in this paper,a comparative analysis of the KNN classifier is performed on a subset(i.e.,R8)of the Reuters-21578 benchmark dataset for TC.Experimental results are obtained by using different distance metrics as well as recently proposed distance learning metrics under different cases where the feature model and term weighting scheme are different.Our comparative evaluation of the results shows that Bray-Curtis and Linear Discriminant Analysis(LDA)are often superior to the other metrics and work well with raw term frequency weights.展开更多
To promote behavioral change among adolescents in Zambia, the National HIV/AIDS/STI/TB Council, in collaboration with UNICEF, developed the Zambia U-Report platform. This platform provides young people with improved a...To promote behavioral change among adolescents in Zambia, the National HIV/AIDS/STI/TB Council, in collaboration with UNICEF, developed the Zambia U-Report platform. This platform provides young people with improved access to information on various Sexual Reproductive Health topics through Short Messaging Service (SMS) messages. Over the years, the platform has accumulated millions of incoming and outgoing messages, which need to be categorized into key thematic areas for better tracking of sexual reproductive health knowledge gaps among young people. The current manual categorization process of these text messages is inefficient and time-consuming and this study aims to automate the process for improved analysis using text-mining techniques. Firstly, the study investigates the current text message categorization process and identifies a list of categories adopted by counselors over time which are then used to build and train a categorization model. Secondly, the study presents a proof of concept tool that automates the categorization of U-report messages into key thematic areas using the developed categorization model. Finally, it compares the performance and effectiveness of the developed proof of concept tool against the manual system. The study used a dataset comprising 206,625 text messages. The current process would take roughly 2.82 years to categorise this dataset whereas the trained SVM model would require only 6.4 minutes while achieving an accuracy of 70.4% demonstrating that the automated method is significantly faster, more scalable, and consistent when compared to the current manual categorization. These advantages make the SVM model a more efficient and effective tool for categorizing large unstructured text datasets. These results and the proof-of-concept tool developed demonstrate the potential for enhancing the efficiency and accuracy of message categorization on the Zambia U-report platform and other similar text messages-based platforms.展开更多
The school placement processes of students from immigrant backgrounds considered to be in“difficulty”is an international concern at the intersection of works relating to special education and those concerning the sc...The school placement processes of students from immigrant backgrounds considered to be in“difficulty”is an international concern at the intersection of works relating to special education and those concerning the school experiences of students from immigrant backgrounds or racialized groups.The research problem of this article concerns the identification of these students as disabled or as having adjustment or learning difficulties.From a perspective anchored in Disability Critical Race Studies,this ethnographic study documents different interpretations of perceived difficulties made by school actors with regard to seven primary school students from immigrant backgrounds.Five interpretation types are presented:(1)medicalization by dismissal of cultural markers,(2)medicalization by professional constraint,(3)medicalization by cultural deficit,(4)precautionary wait,and(5)cultural differentialism.Our results help to shed light on the special education overrepresentation phenomenon regarding these students and to understand how ableism and(neo)racism contribute to it.展开更多
Text categorization is a significant technique to manage the surging text data on the Internet.The k-nearest neighbors(kNN) algorithm is an effective,but not efficient,classification model for text categorization.In t...Text categorization is a significant technique to manage the surging text data on the Internet.The k-nearest neighbors(kNN) algorithm is an effective,but not efficient,classification model for text categorization.In this paper,we propose an effective strategy to accelerate the standard kNN,based on a simple principle:usually,near points in space are also near when they are projected into a direction,which means that distant points in the projection direction are also distant in the original space.Using the proposed strategy,most of the irrelevant points can be removed when searching for the k-nearest neighbors of a query point,which greatly decreases the computation cost.Experimental results show that the proposed strategy greatly improves the time performance of the standard kNN,with little degradation in accuracy.Specifically,it is superior in applications that have large and high-dimensional datasets.展开更多
Chinese text categorization differs from English text categorization due to its much larger term set (of words or character n-grams), which results in very slow training and working of modern high-performance classi...Chinese text categorization differs from English text categorization due to its much larger term set (of words or character n-grams), which results in very slow training and working of modern high-performance classifiers. This study assumes that this high-dimensionality problem is related to the redundancy in the term set, which cannot be solved by traditional term selection methods. A greedy algorithm framework named "non-independent term selection" is presented, which reduces the redundancy according to string-level correlations. Several preliminary implementations of this idea are demonstrated. Experiment results show that a good tradeoff can be reached between the performance and the size of the term set.展开更多
With the purpose of improving the accuracy of text categorization and reducing the dimension of the feature space,this paper proposes a two-stage feature selection method based on a novel category correlation degree(C...With the purpose of improving the accuracy of text categorization and reducing the dimension of the feature space,this paper proposes a two-stage feature selection method based on a novel category correlation degree(CCD)method and latent semantic indexing(LSI).In the first stage,a novel CCD method is proposed to select the most effective features for text classification,which is more effective than the traditional feature selection method.In the second stage,document representation requires a high dimensionality of the feature space and does not take into account the semantic relation between features,which leads to a poor categorization accuracy.So LSI method is proposed to solve these problems by using statistically derived conceptual indices to replace the individual terms which can discover the important correlative relationship between features and reduce the feature space dimension.Firstly,each feature in our algorithm is ranked depending on their importance of classification using CCD method.Secondly,we construct a new semantic space based on LSI method among features.The experimental results have proved that our method can reduce effectively the dimension of text vector and improve the performance of text categorization.展开更多
Discovering the hierarchical structures of differ- ent classes of object behaviors can satisfy the requirements of various degrees of abstraction in association analysis, be- havior modeling, data preprocessing, patte...Discovering the hierarchical structures of differ- ent classes of object behaviors can satisfy the requirements of various degrees of abstraction in association analysis, be- havior modeling, data preprocessing, pattern recognition and decision making, etc. In this paper, we call this process as associative categorization, which is different from classical clustering, associative classification and associative cluster- ing. Focusing on representing the associations of behaviors and the corresponding uncertainties, we propose the method for constructing a Markov network (MN) from the results of frequent pattern mining, called item-associative Markov net- work (IAMN), where nodes and edges represent the frequent patterns and their associations respectively. We further dis- cuss the properties of a probabilistic graphical model to guar- antee the IAMN's correctness theoretically. Then, we adopt the concept of chordal to reflect the closeness of nodes in the IAMN. Adopting the algorithm for constructing join trees from an MN, we give the algorithm for IAMN-based associa- tive categorization by hierarchical bottom-up aggregations of nodes. Experimental results show the effectiveness, efficiency and correctness of our methods.展开更多
文摘Mandarin在(pinyin:zài)is the most frequently used character in representing spatial and temporal relationship.Current studies mostly focus on its lexical meaning and syntactic structure while cognitive features of its grammatical categories have been neglected.This paper investigates into the categorization of zài by conducting a morphosyntactic test among College English majors in China.The results show that:prototypes are organizing the grammatical categories of zài at all levels in terms of intra-categorial gradience;the semantic construal of zài construction could significantly influence the accuracy of the grammatical categorization of zài;the syntactic structure can provide viable cue for the identification of grammatical categories of zài;spatiality,temporality and the status of existing are three essential semantic features encoded by zài,the concurrence of which leads to various degree of inter-categorial vagueness,indicating a conflict between the rigid grammatical classification and the indeterminate nature of the grammatical functions of zai,suggesting the necessity to reconsider the efficacy of applying indiscriminately the Anglo-Saxon grammar into the study of Chinese spatial-temporal constructions.
基金“Research on the Development Path of Ideological Leadership of Ideological and Political Education in Colleges and Universities in the New Era”of the Counselor Special Research Projects of Furlong College,Hunan University of Science and Arts in 2023(Project number:FRfdy2307)。
文摘In cognitive linguistics,debates on the status and functions of categorization have been a heated issue.In semantics and second language acquisition,scholars have discussed and achieved vocabulary acquisition from different perspectives and academic levels.Vocabulary learning exerts a fundamental role in second language vocabulary acquisition(SLVA),and it is closely related to learners’cognitive competence.However,studies on second language vocabulary acquisition under the categorization theory in cognitive linguistics have received less attention from linguists when compared with other studies.This paper employs two representative dimensions,the basic-level effect and the prototype effect,under the categorization theory to further delve into the implications on second language vocabulary acquisition.This article first provides a comprehensive introduction to the nature and the approaches of the categorization theory,and then analyzes the relations and implications for second language vocabulary acquisition under the categorization theory from the perspective of the basic-level and the prototype effects.The research results showed that the basic-level effect on SLVA is mainly on the classification of word categories distinguished from the superordinate and subordinate categories,while the prototype effect is more on understanding the complexity and use of word meaning.
文摘In this paper, we discuss several issues related to automated classification of web pages, especially text classification of web pages. We analyze features selection and categorization algorithms of web pages and give some suggestions for web pages categorization.
基金Supported by the National Natural Science Foun-dation of China (60373066 ,60503020) the Outstanding Young Sci-entist’s Fund(60425206) Doctor Foundatoin of Nanjing Universityof Posts and Telecommunications (2003-02)
文摘This paper proposes a new approach of feature selection based on the independent measure between features for text categorization. A fundamental hypothesis that occurrence of the terms in documents is independent of each other, widely used in the probabilistic models for text categorization (TC), is discussed. However, the basic hypothesis is incom plete for independence of feature set. From the view of feature selection, a new independent measure between features is designed, by which a feature selection algorithm is given to ob rain a feature subset. The selected subset is high in relevance with category and strong in independence between features, satisfies the basic hypothesis at maximum degree. Compared with other traditional feature selection method in TC (which is only taken into the relevance account), the performance of feature subset selected by our method is prior to others with experiments on the benchmark dataset of 20 Newsgroups.
文摘This paper summarizes several automatic text categorization algorithms in common use recently, analyzes and compares their advantages and disadvantages. It provides clues for making use of appropriate automatic classifying algorithms in different fields. Finally some evaluations and summaries of these algorithms are discussed, and directions to further research have been pointed out. Key words text categorization - naive bayes - KNN - SVM - neural network CLC number TP 391 Foundation item: Supported by the National Natural Science Foundation of China (70031010) and the Research Foundation of Beijing Institute of TechnologyBiography: SHI Yong-feng (1980-), male, Master candidate, research direction: web information mining.
文摘To avoid the curse of dimensionality, text categorization (TC) algorithms based on machine learning (ML) have to use an feature selection (FS) method to reduce the dimensionality of feature space. Although having been widely used, FS process will generally cause information losing and then have much side-effect on the whole performance of TC algorithms. On the basis of the sparsity characteristic of text vectors, a new TC algorithm based on lazy feature selection (LFS) is presented. As a new type of embedded feature selection approach, the LFS method can greatly reduce the dimension of features without any information losing, which can improve both efficiency and performance of algorithms greatly. The experiments show the new algorithm can simultaneously achieve much higher both performance and efficiency than some of other classical TC algorithms.
基金Supported by the National Natural Science Foun-dation of China (90104032) ,the National High-Tech Research andDevelopment Plan of China (2003AA1Z2090)
文摘In order to provide predictable runtime performante for text categorization (TC) systems, an innovative system design method is proposed for soft real time TC systems. An analyzable mathematical model is established to approximately describe the nonlinear and time-varying TC systems. According to this mathematical model, the feedback control theory is adopted to prove the system's stableness and zero steady state error. The experiments result shows that the error of deadline satisfied ratio in the system is kept within 4 of the desired value. And the number of classifiers can be dynamically adjusted by the system itself to save the computa tion resources. The proposed methodology enables the theo retical analysis and evaluation to the TC systems, leading to a high-quality and low cost implementation approach.
文摘A study examining affective information processing in persons with Multiple Sclerosis and healthy adults was carried out. It was hypothesized that individual characteristics could modulate participants’ emotional categorization and reaction times for categorization decisions. For example, individuals with negative valenced emotional profile (e.g. anxious) should choose negative emotional alternatives faster and more frequently. Participants consisted of two different populations: 80 right-handed healthy French-speakers, and 40 right-handed French- speakers with multiple sclerosis. The results showed a positive correlation between high- level of negative emotional sensibility and emotional categorization (decision and decision speed) for affective information presented on the right-side of the screen. For all participants there were more frequent emotional choices and faster decisions for left-side presented emotional alternatives. It seems individuals’ emotional differences in general and in MS populations modulate hemispheric asymmetry of processing emotional judgments.
文摘A hierarchical system to perform automatic categorization and reorientation of images using content analysis is pre-sented. The proposed system first categorizes images to some a priori defined categories using rotation invariant features. At the second stage, it detects their correct orientation out of {0o, 90o, 180o, and 270o} using category specific model. The system has been specially designed for embedded devices applications using only low level color and edge features. Machine learning algorithms optimized to suit the embedded implementation like support vector machines (SVMs) and scalable boosting have been used to develop classifiers for categorization and orientation detection. Results are presented on a collection of about 7000 consumer images collected from open resources. The proposed system finds it applications to various digital media products and brings pattern recognition solutions to the consumer electronics domain.
文摘In current study, behavioral measures were conducted to investigate clothing color. The purpose was to focus on the rule that color brightness influencedpositive-negative emotional categorization. Results showed that the effect of brightness on clothing color emotion categorization was significant. With the increase of brightness, the variation curve of positive emotion appears to be a “U-shaped”, whereas that of the negative emotion shows an upside down “U-shaped”. Compared with the low brightness colors, the emotion reaction to the high brightness colors was more positive;Most of the colors with different brightness scales were classified as positive emotions and the minors were classified as negative emotions;the positive colors could be done much faster than the negative ones.
文摘In this paper, the role of rare or infrequent terms in enhancing the accuracy of English Text Categorization using Polynomial Networks (PNs) is investigated. To study the impact of rare terms in enhancing the accuracy of PNs-based text categorization, different term reduction criteria as well as different term weighting schemes were experimented on the Reuters Corpus using PNs. Each term weighting scheme on each reduced term set was tested once keeping the rare terms and another time removing them. All the experiments conducted in this research show that keeping rare terms substantially improves the performance of Polynomial Networks in Text Categorization, regardless of the term reduction method, the number of terms used in classification, or the term weighting scheme adopted.
文摘Content Based Image Retrieval, CBIR, performed an automated classification task for a queried image. It could relieve a user from the laborious and time-consuming metadata assigning for an image while working on massive image collection. For an image, user’s definition or description is subjective where it could belong to different categories as defined by different users. Human based categorization and computer-based categorization might produce different results due to different categorization criteria that rely on dataset structure and the clustering techniques. This paper is aimed to exhibit an idea for planning the dataset structure and choosing the clustering algorithm for CBIR implementation. There are 5 sections arranged in this paper;CBIR and QBE concepts are introduced in Section 1, related image categorization research is listed in Section 2, the 5 type of image clustering are described in Section 3, comparative analysis in Section 4, and Section 5 conclude this study. Outcome of this paper will be benefiting CBIR developer for various applications.
文摘The study of metaphor of"xin"or heart could be traced back to the beginning of 21 stcentury,and a terrific text for the study isthe famous works called Caigentan,whose distinguished English version is translated by Paul Whiter.In this paper the metaphors of"xin"in both Chinese and English version from the categorization process are to be analyzed especially their vehicle category and themapping process and thus some similarities and differences could be spotted,and finally the Chinese and western cultural context be-hind them could be deduced and analyzed.
文摘Text categorization(TC)is one of the widely studied branches of text mining and has many applications in different domains.It tries to automatically assign a text document to one of the predefined categories often by using machine learning(ML)techniques.Choosing the best classifier in this task is the most important step in which k-Nearest Neighbor(KNN)is widely employed as a classifier as well as several other well-known ones such as Support Vector Machine,Multinomial Naive Bayes,Logistic Regression,and so on.The KNN has been extensively used for TC tasks and is one of the oldest and simplest methods for pattern classification.Its performance crucially relies on the distance metric used to identify nearest neighbors such that the most frequently observed label among these neighbors is used to classify an unseen test instance.Hence,in this paper,a comparative analysis of the KNN classifier is performed on a subset(i.e.,R8)of the Reuters-21578 benchmark dataset for TC.Experimental results are obtained by using different distance metrics as well as recently proposed distance learning metrics under different cases where the feature model and term weighting scheme are different.Our comparative evaluation of the results shows that Bray-Curtis and Linear Discriminant Analysis(LDA)are often superior to the other metrics and work well with raw term frequency weights.
文摘To promote behavioral change among adolescents in Zambia, the National HIV/AIDS/STI/TB Council, in collaboration with UNICEF, developed the Zambia U-Report platform. This platform provides young people with improved access to information on various Sexual Reproductive Health topics through Short Messaging Service (SMS) messages. Over the years, the platform has accumulated millions of incoming and outgoing messages, which need to be categorized into key thematic areas for better tracking of sexual reproductive health knowledge gaps among young people. The current manual categorization process of these text messages is inefficient and time-consuming and this study aims to automate the process for improved analysis using text-mining techniques. Firstly, the study investigates the current text message categorization process and identifies a list of categories adopted by counselors over time which are then used to build and train a categorization model. Secondly, the study presents a proof of concept tool that automates the categorization of U-report messages into key thematic areas using the developed categorization model. Finally, it compares the performance and effectiveness of the developed proof of concept tool against the manual system. The study used a dataset comprising 206,625 text messages. The current process would take roughly 2.82 years to categorise this dataset whereas the trained SVM model would require only 6.4 minutes while achieving an accuracy of 70.4% demonstrating that the automated method is significantly faster, more scalable, and consistent when compared to the current manual categorization. These advantages make the SVM model a more efficient and effective tool for categorizing large unstructured text datasets. These results and the proof-of-concept tool developed demonstrate the potential for enhancing the efficiency and accuracy of message categorization on the Zambia U-report platform and other similar text messages-based platforms.
文摘The school placement processes of students from immigrant backgrounds considered to be in“difficulty”is an international concern at the intersection of works relating to special education and those concerning the school experiences of students from immigrant backgrounds or racialized groups.The research problem of this article concerns the identification of these students as disabled or as having adjustment or learning difficulties.From a perspective anchored in Disability Critical Race Studies,this ethnographic study documents different interpretations of perceived difficulties made by school actors with regard to seven primary school students from immigrant backgrounds.Five interpretation types are presented:(1)medicalization by dismissal of cultural markers,(2)medicalization by professional constraint,(3)medicalization by cultural deficit,(4)precautionary wait,and(5)cultural differentialism.Our results help to shed light on the special education overrepresentation phenomenon regarding these students and to understand how ableism and(neo)racism contribute to it.
基金Project (No. 2012BAH18B05) supported by the National Key Technology R&D Program of China
文摘Text categorization is a significant technique to manage the surging text data on the Internet.The k-nearest neighbors(kNN) algorithm is an effective,but not efficient,classification model for text categorization.In this paper,we propose an effective strategy to accelerate the standard kNN,based on a simple principle:usually,near points in space are also near when they are projected into a direction,which means that distant points in the projection direction are also distant in the original space.Using the proposed strategy,most of the irrelevant points can be removed when searching for the k-nearest neighbors of a query point,which greatly decreases the computation cost.Experimental results show that the proposed strategy greatly improves the time performance of the standard kNN,with little degradation in accuracy.Specifically,it is superior in applications that have large and high-dimensional datasets.
基金Supported by the National Natural Science Foundation of China(Nos. 60573187 and 60321002)the National High-Tech Research and Development (863) Program of China (No.2007AA01Z148)
文摘Chinese text categorization differs from English text categorization due to its much larger term set (of words or character n-grams), which results in very slow training and working of modern high-performance classifiers. This study assumes that this high-dimensionality problem is related to the redundancy in the term set, which cannot be solved by traditional term selection methods. A greedy algorithm framework named "non-independent term selection" is presented, which reduces the redundancy according to string-level correlations. Several preliminary implementations of this idea are demonstrated. Experiment results show that a good tradeoff can be reached between the performance and the size of the term set.
基金the National Natural Science Foundation of China(Nos.61073193 and 61300230)the Key Science and Technology Foundation of Gansu Province(No.1102FKDA010)+1 种基金the Natural Science Foundation of Gansu Province(No.1107RJZA188)the Science and Technology Support Program of Gansu Province(No.1104GKCA037)
文摘With the purpose of improving the accuracy of text categorization and reducing the dimension of the feature space,this paper proposes a two-stage feature selection method based on a novel category correlation degree(CCD)method and latent semantic indexing(LSI).In the first stage,a novel CCD method is proposed to select the most effective features for text classification,which is more effective than the traditional feature selection method.In the second stage,document representation requires a high dimensionality of the feature space and does not take into account the semantic relation between features,which leads to a poor categorization accuracy.So LSI method is proposed to solve these problems by using statistically derived conceptual indices to replace the individual terms which can discover the important correlative relationship between features and reduce the feature space dimension.Firstly,each feature in our algorithm is ranked depending on their importance of classification using CCD method.Secondly,we construct a new semantic space based on LSI method among features.The experimental results have proved that our method can reduce effectively the dimension of text vector and improve the performance of text categorization.
文摘Discovering the hierarchical structures of differ- ent classes of object behaviors can satisfy the requirements of various degrees of abstraction in association analysis, be- havior modeling, data preprocessing, pattern recognition and decision making, etc. In this paper, we call this process as associative categorization, which is different from classical clustering, associative classification and associative cluster- ing. Focusing on representing the associations of behaviors and the corresponding uncertainties, we propose the method for constructing a Markov network (MN) from the results of frequent pattern mining, called item-associative Markov net- work (IAMN), where nodes and edges represent the frequent patterns and their associations respectively. We further dis- cuss the properties of a probabilistic graphical model to guar- antee the IAMN's correctness theoretically. Then, we adopt the concept of chordal to reflect the closeness of nodes in the IAMN. Adopting the algorithm for constructing join trees from an MN, we give the algorithm for IAMN-based associa- tive categorization by hierarchical bottom-up aggregations of nodes. Experimental results show the effectiveness, efficiency and correctness of our methods.