Automatic web page classification has become inevitable for web directories due to the multitude of web pages in the World Wide Web. In this paper an improved Term Weighting technique is proposed for automatic and eff...Automatic web page classification has become inevitable for web directories due to the multitude of web pages in the World Wide Web. In this paper an improved Term Weighting technique is proposed for automatic and effective classification of web pages. The web documents are represented as set of features. The proposed method selects and extracts the most prominent features reducing the high dimensionality problem of classifier. The proper selection of features among the large set improves the performance of the classifier. The proposed algorithm is implemented and tested on a benchmarked dataset. The results show the better performance than most of the existing term weighting techniques.展开更多
Machine intelligence,is out of the system by the artificial intelligence shown.It is usually achieved by the average computer intelligence.Rough sets and Information Granules in uncertainty management and soft computi...Machine intelligence,is out of the system by the artificial intelligence shown.It is usually achieved by the average computer intelligence.Rough sets and Information Granules in uncertainty management and soft computing and granular computing is widely used in many fields,such as in protein sequence analysis and biobasis determination,TSM and Web service classification Etc.展开更多
In any organization where SOA has been implemented, all of the web services are registered in UDDI and users’ needs are served by using appropriate web services. So in this paper, we will try to discover a service fr...In any organization where SOA has been implemented, all of the web services are registered in UDDI and users’ needs are served by using appropriate web services. So in this paper, we will try to discover a service from repository first that can provide the required output to the user. The process becomes difficult when a single service is not able to fulfill a user’s need and we need a combination of services to answer complex needs of users. In our paper, we will suggest a simpler approach for dynamic service composition using a graph based methodology. This will be a design time service composition. This approach uses the functional and non-functional parameters of the services to select the most suitable services for composition as per user’s need. This approach involves “service classification” on the basis of functional parameters, “service discovery” on the basis of user’s need and then “service composition” using the selected services on the basis of non-functional parameters like response time, cost, security and availability. Another challenge in SOA implementation is that, once the composition has performed, some services may become faulty at runtime and may stop the entire process of serving a user’s need. So, we will also describe a way of “dynamic service reconfiguration” in our approach that will enable us to identify and replace a faulty service that is violating the SLA or is not accessible anymore. This service reconfiguration is done without redoing or reconfiguring the entire composition. In the end, to simulate the proposed approach, we will represent a prototype application built on php 5.4 using My SQL database at backend.展开更多
The number of Internet users and the number of web pages being added to www increase dramatically every day. It is therefore required to automatically and efficiently classify web pages into web directories. This help...The number of Internet users and the number of web pages being added to www increase dramatically every day. It is therefore required to automatically and efficiently classify web pages into web directories. This helps the search engines to provide users with relevant and quick retrieval results. As web pages are represented by thousands of features, feature selection helps the web page classifiers to resolve this large scale dimensionality problem. This paper proposes a new feature selection method using Ward’s minimum variance measure. This measure is first used to identify clusters of redundant features in a web page. In each cluster, the best representative features are retained and the others are eliminated. Removing such redundant features helps in minimizing the resource utilization during classification. The proposed method of feature selection is compared with other common feature selection methods. Experiments done on a benchmark data set, namely WebKB show that the proposed method performs better than most of the other feature selection methods in terms of reducing the number of features and the classifier modeling time.展开更多
Precise web page classification can be achieved by evaluating features of web pages, and the structural features of web pages are effective complements to their textual features. Various classifiers have different cha...Precise web page classification can be achieved by evaluating features of web pages, and the structural features of web pages are effective complements to their textual features. Various classifiers have different characteristics, and multiple classifiers can be combined to allow classifiers to complement one another. In this study, a web page classification method based on heterogeneous features and a combination of multiple classifiers is proposed. Different from computing the frequency of HTML tags, we exploit the tree-like structure of HTML tags to characterize the structural features of a web page. Heterogeneous textual features and the proposed tree-like structural features are converted into vectors and fused. Confidence is proposed here as a criterion to compare the classification results of different classifiers by calculating the classification accuracy of a set of samples. Multiple classifiers are combined based on confidence with different decision strategies, such as voting, confidence comparison, and direct output, to give the final classification results. Experimental results demonstrate that on the Amazon dataset, 7-web-genres dataset, and DMOZ dataset, the accuracies are increased to 94.2%, 95.4%, and 95.7%, respectively. The fusion of the textual features with the proposed structural features is a comprehensive approach, and the accuracy is higher than that when using only textual features. At the same time, the accuracy of the web page classification is improved by combining multiple classifiers, and is higher than those of the related web page classification algorithms.展开更多
Image classification is an essential task in content-based image retrieval.However,due to the semantic gap between low-level visual features and high-level semantic concepts,and the diversification of Web images,the p...Image classification is an essential task in content-based image retrieval.However,due to the semantic gap between low-level visual features and high-level semantic concepts,and the diversification of Web images,the performance of traditional classification approaches is far from users' expectations.In an attempt to reduce the semantic gap and satisfy the urgent requirements for dimensionality reduction,high-quality retrieval results,and batch-based processing,we propose a hierarchical image manifold with novel distance measures for calculation.Assuming that the images in an image set describe the same or similar object but have various scenes,we formulate two kinds of manifolds,object manifold and scene manifold,at different levels of semantic granularity.Object manifold is developed for object-level classification using an algorithm named extended locally linear embedding(ELLE) based on intra-and inter-object difference measures.Scene manifold is built for scene-level classification using an algorithm named locally linear submanifold extraction(LLSE) by combining linear perturbation and region growing.Experimental results show that our method is effective in improving the performance of classifying Web images.展开更多
Web page classification is an important application in many fields of Internet information retrieval,such as providing directory classification and vertical search. Methods based on query log which is a light weight v...Web page classification is an important application in many fields of Internet information retrieval,such as providing directory classification and vertical search. Methods based on query log which is a light weight version of Web page classification can avoid Web content crawling, making it relatively high in efficiency, but the sparsity of user click data makes it difficult to be used directly for constructing a classifier. To solve this problem, we explore the semantic relations among different queries through word embedding, and propose three improved graph structure classification algorithms. To reflect the semantic relevance between queries, we map the user query into the low-dimensional space according to its query vector in the first step. Then, we calculate the uniform resource locator(URL) vector according to the relationship between the query and URL. Finally, we use the improved label propagation algorithm(LPA) and the bipartite graph expansion algorithm to classify the unlabeled Web pages. Experiments show that our methods make about 20% more increase in F1-value than other Web page classification methods based on query log.展开更多
文摘Automatic web page classification has become inevitable for web directories due to the multitude of web pages in the World Wide Web. In this paper an improved Term Weighting technique is proposed for automatic and effective classification of web pages. The web documents are represented as set of features. The proposed method selects and extracts the most prominent features reducing the high dimensionality problem of classifier. The proper selection of features among the large set improves the performance of the classifier. The proposed algorithm is implemented and tested on a benchmarked dataset. The results show the better performance than most of the existing term weighting techniques.
文摘Machine intelligence,is out of the system by the artificial intelligence shown.It is usually achieved by the average computer intelligence.Rough sets and Information Granules in uncertainty management and soft computing and granular computing is widely used in many fields,such as in protein sequence analysis and biobasis determination,TSM and Web service classification Etc.
文摘In any organization where SOA has been implemented, all of the web services are registered in UDDI and users’ needs are served by using appropriate web services. So in this paper, we will try to discover a service from repository first that can provide the required output to the user. The process becomes difficult when a single service is not able to fulfill a user’s need and we need a combination of services to answer complex needs of users. In our paper, we will suggest a simpler approach for dynamic service composition using a graph based methodology. This will be a design time service composition. This approach uses the functional and non-functional parameters of the services to select the most suitable services for composition as per user’s need. This approach involves “service classification” on the basis of functional parameters, “service discovery” on the basis of user’s need and then “service composition” using the selected services on the basis of non-functional parameters like response time, cost, security and availability. Another challenge in SOA implementation is that, once the composition has performed, some services may become faulty at runtime and may stop the entire process of serving a user’s need. So, we will also describe a way of “dynamic service reconfiguration” in our approach that will enable us to identify and replace a faulty service that is violating the SLA or is not accessible anymore. This service reconfiguration is done without redoing or reconfiguring the entire composition. In the end, to simulate the proposed approach, we will represent a prototype application built on php 5.4 using My SQL database at backend.
文摘The number of Internet users and the number of web pages being added to www increase dramatically every day. It is therefore required to automatically and efficiently classify web pages into web directories. This helps the search engines to provide users with relevant and quick retrieval results. As web pages are represented by thousands of features, feature selection helps the web page classifiers to resolve this large scale dimensionality problem. This paper proposes a new feature selection method using Ward’s minimum variance measure. This measure is first used to identify clusters of redundant features in a web page. In each cluster, the best representative features are retained and the others are eliminated. Removing such redundant features helps in minimizing the resource utilization during classification. The proposed method of feature selection is compared with other common feature selection methods. Experiments done on a benchmark data set, namely WebKB show that the proposed method performs better than most of the other feature selection methods in terms of reducing the number of features and the classifier modeling time.
基金Project supported by the National Natural Science Foundation of China(No.61471314)the Welfare Technology Research Project of Zhejiang Province,China(No.LGG18F010003)。
文摘Precise web page classification can be achieved by evaluating features of web pages, and the structural features of web pages are effective complements to their textual features. Various classifiers have different characteristics, and multiple classifiers can be combined to allow classifiers to complement one another. In this study, a web page classification method based on heterogeneous features and a combination of multiple classifiers is proposed. Different from computing the frequency of HTML tags, we exploit the tree-like structure of HTML tags to characterize the structural features of a web page. Heterogeneous textual features and the proposed tree-like structural features are converted into vectors and fused. Confidence is proposed here as a criterion to compare the classification results of different classifiers by calculating the classification accuracy of a set of samples. Multiple classifiers are combined based on confidence with different decision strategies, such as voting, confidence comparison, and direct output, to give the final classification results. Experimental results demonstrate that on the Amazon dataset, 7-web-genres dataset, and DMOZ dataset, the accuracies are increased to 94.2%, 95.4%, and 95.7%, respectively. The fusion of the textual features with the proposed structural features is a comprehensive approach, and the accuracy is higher than that when using only textual features. At the same time, the accuracy of the web page classification is improved by combining multiple classifiers, and is higher than those of the related web page classification algorithms.
基金Project supported by the National High-Tech R & D Program (863) of China (No. 2009AA011900)the Zhejiang Provincial Natural Science Foundation of China (No. 2011Y1110960)the Zhejiang Provincial Nonprofit Technology and Application Research Program of China (Nos. 2011C31045 and 2012C21020)
文摘Image classification is an essential task in content-based image retrieval.However,due to the semantic gap between low-level visual features and high-level semantic concepts,and the diversification of Web images,the performance of traditional classification approaches is far from users' expectations.In an attempt to reduce the semantic gap and satisfy the urgent requirements for dimensionality reduction,high-quality retrieval results,and batch-based processing,we propose a hierarchical image manifold with novel distance measures for calculation.Assuming that the images in an image set describe the same or similar object but have various scenes,we formulate two kinds of manifolds,object manifold and scene manifold,at different levels of semantic granularity.Object manifold is developed for object-level classification using an algorithm named extended locally linear embedding(ELLE) based on intra-and inter-object difference measures.Scene manifold is built for scene-level classification using an algorithm named locally linear submanifold extraction(LLSE) by combining linear perturbation and region growing.Experimental results show that our method is effective in improving the performance of classifying Web images.
文摘Web page classification is an important application in many fields of Internet information retrieval,such as providing directory classification and vertical search. Methods based on query log which is a light weight version of Web page classification can avoid Web content crawling, making it relatively high in efficiency, but the sparsity of user click data makes it difficult to be used directly for constructing a classifier. To solve this problem, we explore the semantic relations among different queries through word embedding, and propose three improved graph structure classification algorithms. To reflect the semantic relevance between queries, we map the user query into the low-dimensional space according to its query vector in the first step. Then, we calculate the uniform resource locator(URL) vector according to the relationship between the query and URL. Finally, we use the improved label propagation algorithm(LPA) and the bipartite graph expansion algorithm to classify the unlabeled Web pages. Experiments show that our methods make about 20% more increase in F1-value than other Web page classification methods based on query log.