Web usage mining,content mining,and structure mining comprise the web mining process.Web-Page Recommendation(WPR)development by incor-porating Data Mining Techniques(DMT)did not include end-users with improved perform...Web usage mining,content mining,and structure mining comprise the web mining process.Web-Page Recommendation(WPR)development by incor-porating Data Mining Techniques(DMT)did not include end-users with improved performance in the obtainedfiltering results.The cluster user profile-based clustering process is delayed when it has a low precision rate.Markov Chain Monte Carlo-Dynamic Clustering(MC2-DC)is based on the User Behavior Profile(UBP)model group’s similar user behavior on a dynamic update of UBP.The Reversible-Jump Concept(RJC)reviews the history with updated UBP and moves to appropriate clusters.Hamilton’s Filtering Framework(HFF)is designed tofilter user data based on personalised information on automatically updated UBP through the Search Engine(SE).The Hamilton Filtered Regime Switching User Query Probability(HFRSUQP)works forward the updated UBP for easy and accuratefiltering of users’interests and improves WPR.A Probabilistic User Result Feature Ranking based on Gaussian Distribution(PURFR-GD)has been developed to user rank results in a web mining process.PURFR-GD decreases the delay time in the end-to-end workflow for SE personalization in various meth-ods by using the Gaussian Distribution Function(GDF).The theoretical analysis and experiment results of the proposed MC2-DC method automatically increase the updated UBP accuracy by 18.78%.HFRSUQP enabled extensive Maximize Log-Likelihood(ML-L)increases to 15.28%of User Personalized Information Search Retrieval Rate(UPISRT).For feature ranking,the PURFR-GD model defines higher Classification Accuracy(CA)and Precision Ratio(PR)while uti-lising minimum Execution Time(ET).Furthermore,UPISRT's ranking perfor-mance has improved by 20%.展开更多
To alleviate the scalability problem caused by the increasing Web using and changing users’ interests, this paper presents a novel Web Usage Mining algorithm-Incremental Web Usage Mining algorithm based on Active Ant...To alleviate the scalability problem caused by the increasing Web using and changing users’ interests, this paper presents a novel Web Usage Mining algorithm-Incremental Web Usage Mining algorithm based on Active Ant Colony Clustering. Firstly, an active movement strategy about direction selection and speed, different with the positive strategy employed by other Ant Colony Clustering algorithms, is proposed to construct an Active Ant Colony Clustering algorithm, which avoid the idle and “flying over the plane” moving phenomenon, effectively improve the quality and speed of clustering on large dataset. Then a mechanism of decomposing clusters based on above methods is introduced to form new clusters when users’ interests change. Empirical studies on a real Web dataset show the active ant colony clustering algorithm has better performance than the previous algorithms, and the incremental approach based on the proposed mechanism can efficiently implement incremental Web usage mining.展开更多
Cluster analysis is a crucial technique in unsupervised machine learning,pattern recognition,and data analysis.However,current clustering algorithms suffer from the need for manual determination of parameter values,lo...Cluster analysis is a crucial technique in unsupervised machine learning,pattern recognition,and data analysis.However,current clustering algorithms suffer from the need for manual determination of parameter values,low accuracy,and inconsistent performance concerning data size and structure.To address these challenges,a novel clustering algorithm called the fully automated density-based clustering method(FADBC)is proposed.The FADBC method consists of two stages:parameter selection and cluster extraction.In the first stage,a proposed method extracts optimal parameters for the dataset,including the epsilon size and a minimum number of points thresholds.These parameters are then used in a density-based technique to scan each point in the dataset and evaluate neighborhood densities to find clusters.The proposed method was evaluated on different benchmark datasets andmetrics,and the experimental results demonstrate its competitive performance without requiring manual inputs.The results show that the FADBC method outperforms well-known clustering methods such as the agglomerative hierarchical method,k-means,spectral clustering,DBSCAN,FCDCSD,Gaussian mixtures,and density-based spatial clustering methods.It can handle any kind of data set well and perform excellently.展开更多
This paper proposes a novel phishing web image segmentation algorithm which based on improving spectral clustering.Firstly,we construct a set of points which are composed of spatial location pixels and gray levels fro...This paper proposes a novel phishing web image segmentation algorithm which based on improving spectral clustering.Firstly,we construct a set of points which are composed of spatial location pixels and gray levels from a given image.Secondly,the data is clustered in spectral space of the similar matrix of the set points,in order to avoid the drawbacks of K-means algorithm in the conventional spectral clustering method that is sensitive to initial clustering centroids and convergence to local optimal solution,we introduce the clone operator,Cauthy mutation to enlarge the scale of clustering centers,quantum-inspired evolutionary algorithm to find the global optimal clustering centroids.Compared with phishing web image segmentation based on K-means,experimental results show that the segmentation performance of our method gains much improvement.Moreover,our method can convergence to global optimal solution and is better in accuracy of phishing web segmentation.展开更多
The content-ignorant clustering method takes advantages in time complexity and space complexity than the content based methods.In this paper,the authors introduce a unified expanding method for content-ignorant web pa...The content-ignorant clustering method takes advantages in time complexity and space complexity than the content based methods.In this paper,the authors introduce a unified expanding method for content-ignorant web page clustering by mining the "click-through" log,which tries to solve the problem that the "click-through" log is sparse.The relationship between two nodes which have been expanded is also defined and optimized.Analysis and experiment show that the performance of the new method has improved,by the comparison with the standard content-ignorant method.The new method can also work without iterative clustering.展开更多
With the advent of the era of big data and the development and construction of smart campuses,the campus is gradually moving towards digitalization,networking and informationization.The campus card is an important par...With the advent of the era of big data and the development and construction of smart campuses,the campus is gradually moving towards digitalization,networking and informationization.The campus card is an important part of the construction of a smart campus,and the massive data it generates can indirectly reflect the living conditions of students at school.In the face of the campus card,how to quickly and accurately obtain the information required by users from the massive data sets has become an urgent problem that needs to be solved.This paper proposes a data mining algorithm based on K-Means clustering and time series.It analyzes the consumption data of a college student’s card to deeply mine and analyze the daily life consumer behavior habits of students,and to make an accurate judgment on the specific life consumer behavior.The algorithm proposed in this paper provides a practical reference for the construction of smart campuses in universities,and has important theoretical and application values.展开更多
In this paper, an improved algorithm, named STC\|I, is proposed for Chinese Web page clustering based on Chinese language characteristics, which adopts a new unit choice principle and a novel suffix tree construction ...In this paper, an improved algorithm, named STC\|I, is proposed for Chinese Web page clustering based on Chinese language characteristics, which adopts a new unit choice principle and a novel suffix tree construction policy. The experimental results show that the new algorithm keeps advantages of STC, and is better than STC in precision and speed when they are used to cluster Chinese Web page.展开更多
The backdoor or information leak of Web servers can be detected by using Web Mining techniques on some abnormal Web log and Web application log data. The security of Web servers can be enhanced and the damage of illeg...The backdoor or information leak of Web servers can be detected by using Web Mining techniques on some abnormal Web log and Web application log data. The security of Web servers can be enhanced and the damage of illegal access can be avoided. Firstly, the system for discovering the patterns of information leakages in CGI scripts from Web log data was proposed. Secondly, those patterns for system administrators to modify their codes and enhance their Web site security were provided. The following aspects were described: one is to combine web application log with web log to extract more information,so web data mining could be used to mine web log for discovering the information that firewall and Information Detection System cannot find. Another approach is to propose an operation module of web site to enhance Web site security. In cluster server session, Density -Based Clustering technique is used to reduce resource cost and obtain better efficiency.展开更多
The integration of the two fast-developing scientific research areas Semantic Web and Web Mining is known as Semantic Web Mining. The huge increase in the amount of Semantic Web data became a perfect target for many r...The integration of the two fast-developing scientific research areas Semantic Web and Web Mining is known as Semantic Web Mining. The huge increase in the amount of Semantic Web data became a perfect target for many researchers to apply Data Mining techniques on it. This paper gives a detailed state-of-the-art survey of on-going research in this new area. It shows the positive effects of Semantic Web Mining, the obstacles faced by researchers and propose number of approaches to deal with the very complex and heterogeneous information and knowledge which are produced by the technologies of Semantic Web.展开更多
A new method for Web users fuzzy clustering based on analysis of user interest characteristic is proposed in this article. The method first defines page fuzzy categories according to the links on the index page of the...A new method for Web users fuzzy clustering based on analysis of user interest characteristic is proposed in this article. The method first defines page fuzzy categories according to the links on the index page of the site, then computes fuzzy degree of cross page through aggregating on data of Web log. After that, by using fuzzy comprehensive evaluation method, the method constructs user interest vectors according to page viewing times and frequency of hits, and derives the fuzzy similarity matrix from the interest vectors for the Web users. Finally, it gets the clustering result through the fuzzy clustering method. The experimental results show the effectiveness of the method.展开更多
Rough set theory is a new soft computing tool, and has received much attention of researchers around the world. It can deal with incomplete and uncertain information. Now, it has been applied in many areas successfull...Rough set theory is a new soft computing tool, and has received much attention of researchers around the world. It can deal with incomplete and uncertain information. Now, it has been applied in many areas successfully. This paper introduces the basic concepts of rough set and discusses its applications in Web mining. In particular, some applications of rough set theory to intelligent information processing are emphasized.展开更多
Traditional machine-learning algorithms are struggling to handle the exceedingly large amount of data being generated by the internet. In real-world applications, there is an urgent need for machine-learning algorithm...Traditional machine-learning algorithms are struggling to handle the exceedingly large amount of data being generated by the internet. In real-world applications, there is an urgent need for machine-learning algorithms to be able to handle large-scale, high-dimensional text data. Cloud computing involves the delivery of computing and storage as a service to a heterogeneous community of recipients. Recently, it has aroused much interest in industry and academia. Most previous works on cloud platforms only focus on the parallel algorithms for structured data. In this paper, we focus on the parallel implementation of web-mining algorithms and develop a parallel web-mining system that includes parallel web crawler; parallel text extract, transform and load (ETL) and modeling; and parallel text mining and application subsystems. The complete system enables variable real-world web-mining applications for mass data.展开更多
In this paper, visualization of special features in “The Tale of Genji”, which is a typical Japanese classical literature, is studied by text mining the auxiliary verbs and examining the similarity in the sentence s...In this paper, visualization of special features in “The Tale of Genji”, which is a typical Japanese classical literature, is studied by text mining the auxiliary verbs and examining the similarity in the sentence style by the correspondence analysis with clustering. The result shows that the text mining error in the number of auxiliary verbs can be as small as 15%. The extracted feature in this study supports the multiple authors of “The Tale of Genji”, which agrees well with the result by Murakami and Imanishi [1]. It is also found that extracted features are robust to the text mining error, which suggests that the classification error is less affected by the text mining error and the possible use of this technique for further statistical study in classical literatures.展开更多
Data mining is the powerful technique, which can be widely used for discovering the customers’ behaviors as well as customer’s preferences. As a result, it has been widely used in top level companies for evaluating ...Data mining is the powerful technique, which can be widely used for discovering the customers’ behaviors as well as customer’s preferences. As a result, it has been widely used in top level companies for evaluating their Customer Relationship Management (CRM) system today. In this study, a new K-means clustering method proposed to evaluate the cluster customers’ profitability in telecommunication industry in Sri Lanka. Furthermore, RFM model mainly used as an input variable for K-means clustering and distortion curve used to identify optimal number of initial clusters. Based on the results, telecommunication customers’ profitability in Sri Lanka mainly categorized into three levels.展开更多
We combine the web usage mining and fuzzy clustering and give the concept of web fuzzy clustering, and then put forward the web fuzzy clustering processing model which is discussed in detail. Web fuzzy clustering can ...We combine the web usage mining and fuzzy clustering and give the concept of web fuzzy clustering, and then put forward the web fuzzy clustering processing model which is discussed in detail. Web fuzzy clustering can be used in the web users clustering and web pages clustering. In the end, a case study is given and the result has proved the feasibility of using web fuzzy clustering in web pages clustering.展开更多
Conceptual clustering is mainly used for solving the deficiency and incompleteness of domain knowledge. Based on conceptual clustering technology and aiming at the institutional framework and characteristic of Web the...Conceptual clustering is mainly used for solving the deficiency and incompleteness of domain knowledge. Based on conceptual clustering technology and aiming at the institutional framework and characteristic of Web theme information, this paper proposes and implements dynamic conceptual clustering algorithm and merging algorithm for Web documents, and also analyses the super performance of the clustering algorithm in efficiency and clustering accuracy.展开更多
Hierarchical clustering analysis based on statistic s is one of the most important mining algorithms, but the traditionary hierarchica l clustering method is based on global comparing, which only takes in Q clusteri n...Hierarchical clustering analysis based on statistic s is one of the most important mining algorithms, but the traditionary hierarchica l clustering method is based on global comparing, which only takes in Q clusteri ng while ignoring R clustering in practice, so it has some limitation especially when the number of sample and index is very large. Furthermore, because of igno ring the association between the different indexes, the clustering result is not good & true. In this paper, we present the model and the algorithm of two-level hierarchi cal clustering which integrates Q clustering with R clustering. Moreover, becaus e two-level hierarchical clustering is based on the respective clustering resul t of each class, the classification of the indexes directly effects on the a ccuracy of the final clustering result, how to appropriately classify the inde xes is the chief and difficult problem we must handle in advance. Although some literatures also have referred to the issue of the classificati on of the indexes, but the articles classify the indexes only according to their superficial signification, which is unscientific. The reasons are as follow s: First, the superficial signification of some indexes usually takes on different meanings and it is easy to be misapprehended by different person. Furthermore, t his classification method seldom make use of history data, the classification re sult is not so objective. Second, for some indexes, its superficial signification didn’t show any mean ings, so simply from the superficial signification, we can’t classify them to c ertain classes. Third, this classification method need the users have higher level knowledge of this field, otherwise it is difficult for the users to understand the signifi cation of some indexes, which sometimes is not available. So in this paper, to this question, we first use R clustering method to cluste ring indexes, dividing p dimension indexes into q classes, then adopt two-level clustering method to get the final result. Obviously, the classification result is more objective and accurate. Moreover, after the first step, we can get the relation of the different indexes and their interaction. We can also know under a certain class indexes, which samples can be clustering to a class. (These semi finished results sometimes are very useful.) The experiments also indicates the effective and accurate of the algorithms. And, the result of R clustering ca n be easily used for the later practice.展开更多
基金Supporting this study through Taif University Researchers Supporting Project number(TURSP-2020/115),Taif University,Taif,Saudi Arabia.
文摘Web usage mining,content mining,and structure mining comprise the web mining process.Web-Page Recommendation(WPR)development by incor-porating Data Mining Techniques(DMT)did not include end-users with improved performance in the obtainedfiltering results.The cluster user profile-based clustering process is delayed when it has a low precision rate.Markov Chain Monte Carlo-Dynamic Clustering(MC2-DC)is based on the User Behavior Profile(UBP)model group’s similar user behavior on a dynamic update of UBP.The Reversible-Jump Concept(RJC)reviews the history with updated UBP and moves to appropriate clusters.Hamilton’s Filtering Framework(HFF)is designed tofilter user data based on personalised information on automatically updated UBP through the Search Engine(SE).The Hamilton Filtered Regime Switching User Query Probability(HFRSUQP)works forward the updated UBP for easy and accuratefiltering of users’interests and improves WPR.A Probabilistic User Result Feature Ranking based on Gaussian Distribution(PURFR-GD)has been developed to user rank results in a web mining process.PURFR-GD decreases the delay time in the end-to-end workflow for SE personalization in various meth-ods by using the Gaussian Distribution Function(GDF).The theoretical analysis and experiment results of the proposed MC2-DC method automatically increase the updated UBP accuracy by 18.78%.HFRSUQP enabled extensive Maximize Log-Likelihood(ML-L)increases to 15.28%of User Personalized Information Search Retrieval Rate(UPISRT).For feature ranking,the PURFR-GD model defines higher Classification Accuracy(CA)and Precision Ratio(PR)while uti-lising minimum Execution Time(ET).Furthermore,UPISRT's ranking perfor-mance has improved by 20%.
基金Supported by the Natural Science Foundation of Jiangsu Province(BK2005046)
文摘To alleviate the scalability problem caused by the increasing Web using and changing users’ interests, this paper presents a novel Web Usage Mining algorithm-Incremental Web Usage Mining algorithm based on Active Ant Colony Clustering. Firstly, an active movement strategy about direction selection and speed, different with the positive strategy employed by other Ant Colony Clustering algorithms, is proposed to construct an Active Ant Colony Clustering algorithm, which avoid the idle and “flying over the plane” moving phenomenon, effectively improve the quality and speed of clustering on large dataset. Then a mechanism of decomposing clusters based on above methods is introduced to form new clusters when users’ interests change. Empirical studies on a real Web dataset show the active ant colony clustering algorithm has better performance than the previous algorithms, and the incremental approach based on the proposed mechanism can efficiently implement incremental Web usage mining.
基金the Deanship of Scientific Research at Umm Al-Qura University,Grant Code:(23UQU4361009DSR001).
文摘Cluster analysis is a crucial technique in unsupervised machine learning,pattern recognition,and data analysis.However,current clustering algorithms suffer from the need for manual determination of parameter values,low accuracy,and inconsistent performance concerning data size and structure.To address these challenges,a novel clustering algorithm called the fully automated density-based clustering method(FADBC)is proposed.The FADBC method consists of two stages:parameter selection and cluster extraction.In the first stage,a proposed method extracts optimal parameters for the dataset,including the epsilon size and a minimum number of points thresholds.These parameters are then used in a density-based technique to scan each point in the dataset and evaluate neighborhood densities to find clusters.The proposed method was evaluated on different benchmark datasets andmetrics,and the experimental results demonstrate its competitive performance without requiring manual inputs.The results show that the FADBC method outperforms well-known clustering methods such as the agglomerative hierarchical method,k-means,spectral clustering,DBSCAN,FCDCSD,Gaussian mixtures,and density-based spatial clustering methods.It can handle any kind of data set well and perform excellently.
基金Supported by the Fundamental Research Funds for the Central Universities in North China Electric Power University(11MG13)the Natural Science Foundation of Hebei Province(F2011502038)
文摘This paper proposes a novel phishing web image segmentation algorithm which based on improving spectral clustering.Firstly,we construct a set of points which are composed of spatial location pixels and gray levels from a given image.Secondly,the data is clustered in spectral space of the similar matrix of the set points,in order to avoid the drawbacks of K-means algorithm in the conventional spectral clustering method that is sensitive to initial clustering centroids and convergence to local optimal solution,we introduce the clone operator,Cauthy mutation to enlarge the scale of clustering centers,quantum-inspired evolutionary algorithm to find the global optimal clustering centroids.Compared with phishing web image segmentation based on K-means,experimental results show that the segmentation performance of our method gains much improvement.Moreover,our method can convergence to global optimal solution and is better in accuracy of phishing web segmentation.
文摘The content-ignorant clustering method takes advantages in time complexity and space complexity than the content based methods.In this paper,the authors introduce a unified expanding method for content-ignorant web page clustering by mining the "click-through" log,which tries to solve the problem that the "click-through" log is sparse.The relationship between two nodes which have been expanded is also defined and optimized.Analysis and experiment show that the performance of the new method has improved,by the comparison with the standard content-ignorant method.The new method can also work without iterative clustering.
基金Science and Technology Project of Guizhou Province of China(Grant QKHJC[2019]1403)and(Grant QKHJC[2019]1041)Guizhou Province Colleges and Universities Top Technology Talent Support Program(Grant QJHKY[2016]068).
文摘With the advent of the era of big data and the development and construction of smart campuses,the campus is gradually moving towards digitalization,networking and informationization.The campus card is an important part of the construction of a smart campus,and the massive data it generates can indirectly reflect the living conditions of students at school.In the face of the campus card,how to quickly and accurately obtain the information required by users from the massive data sets has become an urgent problem that needs to be solved.This paper proposes a data mining algorithm based on K-Means clustering and time series.It analyzes the consumption data of a college student’s card to deeply mine and analyze the daily life consumer behavior habits of students,and to make an accurate judgment on the specific life consumer behavior.The algorithm proposed in this paper provides a practical reference for the construction of smart campuses in universities,and has important theoretical and application values.
文摘In this paper, an improved algorithm, named STC\|I, is proposed for Chinese Web page clustering based on Chinese language characteristics, which adopts a new unit choice principle and a novel suffix tree construction policy. The experimental results show that the new algorithm keeps advantages of STC, and is better than STC in precision and speed when they are used to cluster Chinese Web page.
文摘The backdoor or information leak of Web servers can be detected by using Web Mining techniques on some abnormal Web log and Web application log data. The security of Web servers can be enhanced and the damage of illegal access can be avoided. Firstly, the system for discovering the patterns of information leakages in CGI scripts from Web log data was proposed. Secondly, those patterns for system administrators to modify their codes and enhance their Web site security were provided. The following aspects were described: one is to combine web application log with web log to extract more information,so web data mining could be used to mine web log for discovering the information that firewall and Information Detection System cannot find. Another approach is to propose an operation module of web site to enhance Web site security. In cluster server session, Density -Based Clustering technique is used to reduce resource cost and obtain better efficiency.
文摘The integration of the two fast-developing scientific research areas Semantic Web and Web Mining is known as Semantic Web Mining. The huge increase in the amount of Semantic Web data became a perfect target for many researchers to apply Data Mining techniques on it. This paper gives a detailed state-of-the-art survey of on-going research in this new area. It shows the positive effects of Semantic Web Mining, the obstacles faced by researchers and propose number of approaches to deal with the very complex and heterogeneous information and knowledge which are produced by the technologies of Semantic Web.
文摘A new method for Web users fuzzy clustering based on analysis of user interest characteristic is proposed in this article. The method first defines page fuzzy categories according to the links on the index page of the site, then computes fuzzy degree of cross page through aggregating on data of Web log. After that, by using fuzzy comprehensive evaluation method, the method constructs user interest vectors according to page viewing times and frequency of hits, and derives the fuzzy similarity matrix from the interest vectors for the Web users. Finally, it gets the clustering result through the fuzzy clustering method. The experimental results show the effectiveness of the method.
文摘Rough set theory is a new soft computing tool, and has received much attention of researchers around the world. It can deal with incomplete and uncertain information. Now, it has been applied in many areas successfully. This paper introduces the basic concepts of rough set and discusses its applications in Web mining. In particular, some applications of rough set theory to intelligent information processing are emphasized.
基金supported by the National Natural Science Foundation of China (No. 61175052,60975039, 61203297, 60933004, 61035003)National High-tech R&D Program of China (863 Program) (No.2012AA011003)supported by the ZTE research found of Parallel Web Mining project
文摘Traditional machine-learning algorithms are struggling to handle the exceedingly large amount of data being generated by the internet. In real-world applications, there is an urgent need for machine-learning algorithms to be able to handle large-scale, high-dimensional text data. Cloud computing involves the delivery of computing and storage as a service to a heterogeneous community of recipients. Recently, it has aroused much interest in industry and academia. Most previous works on cloud platforms only focus on the parallel algorithms for structured data. In this paper, we focus on the parallel implementation of web-mining algorithms and develop a parallel web-mining system that includes parallel web crawler; parallel text extract, transform and load (ETL) and modeling; and parallel text mining and application subsystems. The complete system enables variable real-world web-mining applications for mass data.
文摘In this paper, visualization of special features in “The Tale of Genji”, which is a typical Japanese classical literature, is studied by text mining the auxiliary verbs and examining the similarity in the sentence style by the correspondence analysis with clustering. The result shows that the text mining error in the number of auxiliary verbs can be as small as 15%. The extracted feature in this study supports the multiple authors of “The Tale of Genji”, which agrees well with the result by Murakami and Imanishi [1]. It is also found that extracted features are robust to the text mining error, which suggests that the classification error is less affected by the text mining error and the possible use of this technique for further statistical study in classical literatures.
文摘Data mining is the powerful technique, which can be widely used for discovering the customers’ behaviors as well as customer’s preferences. As a result, it has been widely used in top level companies for evaluating their Customer Relationship Management (CRM) system today. In this study, a new K-means clustering method proposed to evaluate the cluster customers’ profitability in telecommunication industry in Sri Lanka. Furthermore, RFM model mainly used as an input variable for K-means clustering and distortion curve used to identify optimal number of initial clusters. Based on the results, telecommunication customers’ profitability in Sri Lanka mainly categorized into three levels.
文摘We combine the web usage mining and fuzzy clustering and give the concept of web fuzzy clustering, and then put forward the web fuzzy clustering processing model which is discussed in detail. Web fuzzy clustering can be used in the web users clustering and web pages clustering. In the end, a case study is given and the result has proved the feasibility of using web fuzzy clustering in web pages clustering.
文摘Conceptual clustering is mainly used for solving the deficiency and incompleteness of domain knowledge. Based on conceptual clustering technology and aiming at the institutional framework and characteristic of Web theme information, this paper proposes and implements dynamic conceptual clustering algorithm and merging algorithm for Web documents, and also analyses the super performance of the clustering algorithm in efficiency and clustering accuracy.
文摘Hierarchical clustering analysis based on statistic s is one of the most important mining algorithms, but the traditionary hierarchica l clustering method is based on global comparing, which only takes in Q clusteri ng while ignoring R clustering in practice, so it has some limitation especially when the number of sample and index is very large. Furthermore, because of igno ring the association between the different indexes, the clustering result is not good & true. In this paper, we present the model and the algorithm of two-level hierarchi cal clustering which integrates Q clustering with R clustering. Moreover, becaus e two-level hierarchical clustering is based on the respective clustering resul t of each class, the classification of the indexes directly effects on the a ccuracy of the final clustering result, how to appropriately classify the inde xes is the chief and difficult problem we must handle in advance. Although some literatures also have referred to the issue of the classificati on of the indexes, but the articles classify the indexes only according to their superficial signification, which is unscientific. The reasons are as follow s: First, the superficial signification of some indexes usually takes on different meanings and it is easy to be misapprehended by different person. Furthermore, t his classification method seldom make use of history data, the classification re sult is not so objective. Second, for some indexes, its superficial signification didn’t show any mean ings, so simply from the superficial signification, we can’t classify them to c ertain classes. Third, this classification method need the users have higher level knowledge of this field, otherwise it is difficult for the users to understand the signifi cation of some indexes, which sometimes is not available. So in this paper, to this question, we first use R clustering method to cluste ring indexes, dividing p dimension indexes into q classes, then adopt two-level clustering method to get the final result. Obviously, the classification result is more objective and accurate. Moreover, after the first step, we can get the relation of the different indexes and their interaction. We can also know under a certain class indexes, which samples can be clustering to a class. (These semi finished results sometimes are very useful.) The experiments also indicates the effective and accurate of the algorithms. And, the result of R clustering ca n be easily used for the later practice.