A rough set based corner classification neural network, the Rough-CC4, is presented to solve document classification problems such as document representation of different document sizes, document feature selection and...A rough set based corner classification neural network, the Rough-CC4, is presented to solve document classification problems such as document representation of different document sizes, document feature selection and document feature encoding. In the Rough-CC4, the documents are described by the equivalent classes of the approximate words. By this method, the dimensions representing the documents can be reduced, which can solve the precision problems caused by the different document sizes and also blur the differences caused by the approximate words. In the Rough-CC4, a binary encoding method is introduced, through which the importance of documents relative to each equivalent class is encoded. By this encoding method, the precision of the Rough-CC4 is improved greatly and the space complexity of the Rough-CC4 is reduced. The Rough-CC4 can be used in automatic classification of documents.展开更多
An effective domain ontology automatically constructed is proposed in this paper. The main concept is using the Formal Concept Analysis to automatically establish domain ontology. Finally, the ontology is acted as the...An effective domain ontology automatically constructed is proposed in this paper. The main concept is using the Formal Concept Analysis to automatically establish domain ontology. Finally, the ontology is acted as the base for the Naive Bayes classifier to approve the effectiveness of the domain ontology for document classification. The 1752 documents divided into 10 categories are used to assess the effectiveness of the ontology, where 1252 and 500 documents are the training and testing documents, respectively. The Fl-measure is as the assessment criteria and the following three results are obtained. The average recall of Naive Bayes classifier is 0.94. Therefore, in recall, the performance of Naive Bayes classifier is excellent based on the automatically constructed ontology. The average precision of Naive Bayes classifier is 0.81. Therefore, in precision, the performance of Naive Bayes classifier is gored based on the automatically constructed ontology. The average Fl-measure for 10 categories by Naive Bayes classifier is 0.86. Therefore, the performance of Naive Bayes classifier is effective based on the automatically constructed ontology in the point of F 1-measure. Thus, the domain ontology automatically constructed could indeed be acted as the document categories to reach the effectiveness for document classification.展开更多
Many algorithms have been implemented for the problem of document categorization. The majority work in this area was achieved for English text, while a very few approaches have been introduced for the Arabic text. The...Many algorithms have been implemented for the problem of document categorization. The majority work in this area was achieved for English text, while a very few approaches have been introduced for the Arabic text. The nature of Arabic text is different from that of the English text and the preprocessing of the Arabic text is more challenging. This is due to Arabic language is a highly inflectional and derivational language that makes document mining a hard and complex task. In this paper, we present an Automatic Arabic documents classification system based on kNN algorithm. Also, we develop an approach to solve keywords extraction and reduction problems by using Document Frequency (DF) threshold method. The results indicate that the ability of the kNN to deal with Arabic text outperforms the other existing systems. The proposed system reached 0.95 micro-recall scores with 850 Arabic texts in 6 different categories.展开更多
The paper considers the problem of semantic processing of web documents by designing an approach, which combines extracted semantic document model and domain- related knowledge base. The knowledge base is populated wi...The paper considers the problem of semantic processing of web documents by designing an approach, which combines extracted semantic document model and domain- related knowledge base. The knowledge base is populated with learnt classification rules categorizing documents into topics. Classification provides for the reduction of the dimensio0ality of the document feature space. The semantic model of retrieved web documents is semantically labeled by querying domain ontology and processed with content-based classification method. The model obtained is mapped to the existing knowledge base by implementing inference algorithm. It enables models of the same semantic type to be recognized and integrated into the knowledge base. The approach provides for the domain knowledge integration and assists the extraction and modeling web documents semantics. Implementation results of the proposed approach are presented.展开更多
Although the corporate relationship manager seems to be the key enabler in commercial banking,the personal relationship sales model is not a sustainable model for the paradigm shift in digital financial markets.In thi...Although the corporate relationship manager seems to be the key enabler in commercial banking,the personal relationship sales model is not a sustainable model for the paradigm shift in digital financial markets.In this research,we propose a knowledge-driven decision analytics approach to improve the decision process.However,most of the corporate client documents processed in banks are not well-structured and the traditional analysis approach does not consider the document structure,which carries rich semantic information.We propose a document structure-based text representation approach with incorporating auxiliary information in the predictive analytics of unstructured data to improve the performance in the document classification task.The proposed approach significantly outperforms the traditional whole document approach which does not take into considerations of the document structure.With the proposed approach,knowledge can be effectively and efficiently used for business decisions and planning to improve the competitive advantage and substantiality of banks.展开更多
基金The National Natural Science Foundation of China(No.60503020,60373066,60403016,60425206),the Natural Science Foundation of Jiangsu Higher Education Institutions ( No.04KJB520096),the Doctoral Foundation of Nanjing University of Posts and Telecommunication (No.0302).
文摘A rough set based corner classification neural network, the Rough-CC4, is presented to solve document classification problems such as document representation of different document sizes, document feature selection and document feature encoding. In the Rough-CC4, the documents are described by the equivalent classes of the approximate words. By this method, the dimensions representing the documents can be reduced, which can solve the precision problems caused by the different document sizes and also blur the differences caused by the approximate words. In the Rough-CC4, a binary encoding method is introduced, through which the importance of documents relative to each equivalent class is encoded. By this encoding method, the precision of the Rough-CC4 is improved greatly and the space complexity of the Rough-CC4 is reduced. The Rough-CC4 can be used in automatic classification of documents.
文摘An effective domain ontology automatically constructed is proposed in this paper. The main concept is using the Formal Concept Analysis to automatically establish domain ontology. Finally, the ontology is acted as the base for the Naive Bayes classifier to approve the effectiveness of the domain ontology for document classification. The 1752 documents divided into 10 categories are used to assess the effectiveness of the ontology, where 1252 and 500 documents are the training and testing documents, respectively. The Fl-measure is as the assessment criteria and the following three results are obtained. The average recall of Naive Bayes classifier is 0.94. Therefore, in recall, the performance of Naive Bayes classifier is excellent based on the automatically constructed ontology. The average precision of Naive Bayes classifier is 0.81. Therefore, in precision, the performance of Naive Bayes classifier is gored based on the automatically constructed ontology. The average Fl-measure for 10 categories by Naive Bayes classifier is 0.86. Therefore, the performance of Naive Bayes classifier is effective based on the automatically constructed ontology in the point of F 1-measure. Thus, the domain ontology automatically constructed could indeed be acted as the document categories to reach the effectiveness for document classification.
文摘Many algorithms have been implemented for the problem of document categorization. The majority work in this area was achieved for English text, while a very few approaches have been introduced for the Arabic text. The nature of Arabic text is different from that of the English text and the preprocessing of the Arabic text is more challenging. This is due to Arabic language is a highly inflectional and derivational language that makes document mining a hard and complex task. In this paper, we present an Automatic Arabic documents classification system based on kNN algorithm. Also, we develop an approach to solve keywords extraction and reduction problems by using Document Frequency (DF) threshold method. The results indicate that the ability of the kNN to deal with Arabic text outperforms the other existing systems. The proposed system reached 0.95 micro-recall scores with 850 Arabic texts in 6 different categories.
文摘The paper considers the problem of semantic processing of web documents by designing an approach, which combines extracted semantic document model and domain- related knowledge base. The knowledge base is populated with learnt classification rules categorizing documents into topics. Classification provides for the reduction of the dimensio0ality of the document feature space. The semantic model of retrieved web documents is semantically labeled by querying domain ontology and processed with content-based classification method. The model obtained is mapped to the existing knowledge base by implementing inference algorithm. It enables models of the same semantic type to be recognized and integrated into the knowledge base. The approach provides for the domain knowledge integration and assists the extraction and modeling web documents semantics. Implementation results of the proposed approach are presented.
文摘Although the corporate relationship manager seems to be the key enabler in commercial banking,the personal relationship sales model is not a sustainable model for the paradigm shift in digital financial markets.In this research,we propose a knowledge-driven decision analytics approach to improve the decision process.However,most of the corporate client documents processed in banks are not well-structured and the traditional analysis approach does not consider the document structure,which carries rich semantic information.We propose a document structure-based text representation approach with incorporating auxiliary information in the predictive analytics of unstructured data to improve the performance in the document classification task.The proposed approach significantly outperforms the traditional whole document approach which does not take into considerations of the document structure.With the proposed approach,knowledge can be effectively and efficiently used for business decisions and planning to improve the competitive advantage and substantiality of banks.