Traditional machine-learning algorithms are struggling to handle the exceedingly large amount of data being generated by the internet. In real-world applications, there is an urgent need for machine-learning algorithm...Traditional machine-learning algorithms are struggling to handle the exceedingly large amount of data being generated by the internet. In real-world applications, there is an urgent need for machine-learning algorithms to be able to handle large-scale, high-dimensional text data. Cloud computing involves the delivery of computing and storage as a service to a heterogeneous community of recipients, Recently, it has aroused much interest in industry and academia. Most previous works on cloud platforms only focus on the parallel algorithms for structured data. In this paper, we focus on the parallel implementation of web-mining algorithms and develop a parallel web-mining system that includes parallel web crawler; parallel text extract, transform and load (ETL) and modeling; and parallel text mining and application subsystems. The complete system enables variable real-world web-mining applications for mass data.展开更多
Web-log contains a lot of information related with user activities on the Internet. How to mine user browsing interest patterns effectively is an important and challengeable research topic. On the analysis of the pres...Web-log contains a lot of information related with user activities on the Internet. How to mine user browsing interest patterns effectively is an important and challengeable research topic. On the analysis of the present algorithm’s advantages and disadvantages we propose a new concept: support-interest. Its key insight is that visitor will backtrack if they do not find the information where they expect. And the point from where they backtrack is the expected location for the page. We present User Access Matrix and the corresponding algorithm for discovering such expected locations that can handle page caching by the browser. Since the URL-URL matrix is a sparse matrix which can be represented by List of 3-tuples, we can mine user preferred sub-paths from the computation of this matrix. Accordingly, all the sub-paths are merged, and user preferred paths are formed. Experiments showed that it was accurate and scalable. It’s suitable for website based application, such as to optimize website’s topological structure or to design personalized services. Key words Web Mining - user preferred path - Web-log - support-interest - personalized services CLC number TP 391 Foundation item: Supported by the National High Technology Development (863 program of China) (2001AA113182)Biography: ZHOU Hong-fang (1976-), female.Ph. D candidate, research direction: data mining and knowledge discovery in databases.展开更多
With the explosive growth of information sources available on the World Wide Web, how to combine the results of multiple search engines has become a valuable problem. In this paper, a search strategy based on genetic ...With the explosive growth of information sources available on the World Wide Web, how to combine the results of multiple search engines has become a valuable problem. In this paper, a search strategy based on genetic simulated annealing for search engines in Web mining is proposed. According to the proposed strategy, there exists some important relationship among Web statistical studies, search engines and optimization techniques. We have proven experimentally the relevance of our approach to the presented queries by comparing the qualities of output pages with those of the original downloaded pages, as the number of iterations increases better results are obtained with reasonable execution time.展开更多
Web usage mining,content mining,and structure mining comprise the web mining process.Web-Page Recommendation(WPR)development by incor-porating Data Mining Techniques(DMT)did not include end-users with improved perform...Web usage mining,content mining,and structure mining comprise the web mining process.Web-Page Recommendation(WPR)development by incor-porating Data Mining Techniques(DMT)did not include end-users with improved performance in the obtainedfiltering results.The cluster user profile-based clustering process is delayed when it has a low precision rate.Markov Chain Monte Carlo-Dynamic Clustering(MC2-DC)is based on the User Behavior Profile(UBP)model group’s similar user behavior on a dynamic update of UBP.The Reversible-Jump Concept(RJC)reviews the history with updated UBP and moves to appropriate clusters.Hamilton’s Filtering Framework(HFF)is designed tofilter user data based on personalised information on automatically updated UBP through the Search Engine(SE).The Hamilton Filtered Regime Switching User Query Probability(HFRSUQP)works forward the updated UBP for easy and accuratefiltering of users’interests and improves WPR.A Probabilistic User Result Feature Ranking based on Gaussian Distribution(PURFR-GD)has been developed to user rank results in a web mining process.PURFR-GD decreases the delay time in the end-to-end workflow for SE personalization in various meth-ods by using the Gaussian Distribution Function(GDF).The theoretical analysis and experiment results of the proposed MC2-DC method automatically increase the updated UBP accuracy by 18.78%.HFRSUQP enabled extensive Maximize Log-Likelihood(ML-L)increases to 15.28%of User Personalized Information Search Retrieval Rate(UPISRT).For feature ranking,the PURFR-GD model defines higher Classification Accuracy(CA)and Precision Ratio(PR)while uti-lising minimum Execution Time(ET).Furthermore,UPISRT's ranking perfor-mance has improved by 20%.展开更多
Because data warehouse is frequently changing, incremental data leads to old knowledge which is mined formerly unavailable. In order to maintain the discovered knowledge and patterns dynamically, this study presents a...Because data warehouse is frequently changing, incremental data leads to old knowledge which is mined formerly unavailable. In order to maintain the discovered knowledge and patterns dynamically, this study presents a novel algorithm updating for global frequent patterns-IPARUC. A rapid clustering method is introduced to divide database into n parts in IPARUC firstly, where the data are similar in the same part. Then, the nodes in the tree are adjusted dynamically in inserting process by "pruning and laying back" to keep the frequency descending order so that they can be shared to approaching optimization. Finally local frequent itemsets mined from each local dataset are merged into global frequent itemsets. The results of experimental study are very encouraging. It is obvious from experiment that IPARUC is more effective and efficient than other two contrastive methods. Furthermore, there is significant application potential to a prototype of Web log Analyzer in web usage mining that can help us to discover useful knowledge effectively, even help managers making decision.展开更多
A semantic session analysis method partitioning Web usage logs is presented. Semantic Web usage log preparation model enhances usage logs with semantic. The Markov chain model based on ontology semantic measurement is...A semantic session analysis method partitioning Web usage logs is presented. Semantic Web usage log preparation model enhances usage logs with semantic. The Markov chain model based on ontology semantic measurement is used to identifying which active session a request should belong to. The competitive method is applied to determine the end of the sessions. Compared with other algorithms, more successful sessions are additionally detected by semantic outlier analysis.展开更多
The structure of Web site became more complex than before. During the design period of a Web site, the lack of model and method results in improper Web structure, which depend on the designer's experience. From th...The structure of Web site became more complex than before. During the design period of a Web site, the lack of model and method results in improper Web structure, which depend on the designer's experience. From the point of view of software engineering, every period in the software life must be evaluated before starting the next period's work. It is very important and essential to search relevant methods for evaluating Web structure before the site is completed. In this work, after studying the related work about the Web structure mining and analyzing the major structure mining methods (Page\|rank and Hub/Authority), a method based on the Page\|rank for Web structure evaluation in design stage is proposed. A Web structure modeling language WSML is designed, and the implement strategies for evaluating system of the Web site structure are given out. Web structure mining has being used mainly in search engines before. It is the first time to employ the Web structure mining technology to evaluate a Web structure in the design period of a Web site. It contributes to the formalization of the design documents for Web site and the improving of software engineering for large scale Web site, and the evaluating system is a practical tool for Web site construction.展开更多
To alleviate the scalability problem caused by the increasing Web using and changing users' interests, this paper presents a novel Web Usage Mining algorithm-Incremental Web Usage Mining algorithm based on Active Ant...To alleviate the scalability problem caused by the increasing Web using and changing users' interests, this paper presents a novel Web Usage Mining algorithm-Incremental Web Usage Mining algorithm based on Active Ant Colony Clustering. Firstly, an active movement strategy about direction selection and speed, different with the positive strategy employed by other Ant Colony Clustering algorithms, is proposed to construct an Active Ant Colony Clustering algorithm, which avoid the idle and "flying over the plane" moving phenomenon, effectively improve the quality and speed of clustering on large dataset. Then a mechanism of decomposing clusters based on above methods is introduced to form new clusters when users' interests change. Empirical studies on a real Web dataset show the active ant colony clustering algorithm has better performance than the previous algorithms, and the incremental approach based on the proposed mechanism can efficiently implement incremental Web usage mining.展开更多
In this paper, an improved algorithm, named STC-I, is proposed for Chinese Web page clustering based on Chinese language characteristics, which adopts a new unit choice principle and a novel suffix tree construction p...In this paper, an improved algorithm, named STC-I, is proposed for Chinese Web page clustering based on Chinese language characteristics, which adopts a new unit choice principle and a novel suffix tree construction policy. The experimental results show that the new algorithm keeps advantages of STC, and is better than STC in precision and speed when they are used to cluster Chinese Web page. Key words clustering - suffix tree - Web mining CLC number TP 311 Foundation item: Supported by the National Information Industry Development Foundation of ChinaBiography: YANG Jian-wu (1973-), male, Ph. D, research direction: information retrieval and text mining.展开更多
A Web page typically contains many information blocks. Apart from the main content blocks, it usually has such blocks as navigation panels, copyright and privacy notices, and advertisements. We call these blocks the n...A Web page typically contains many information blocks. Apart from the main content blocks, it usually has such blocks as navigation panels, copyright and privacy notices, and advertisements. We call these blocks the noisy blocks. The noises in Web pages can seriously harm Web data mining. To the question of climinating these noises, we intro duce a new tree structure, called Style Tree, and study an algorithm how to construct a site style tree. The Style Tree Model is employed to detect and climinate noises in any Web pages of the site. An information based measure to determine which element node is noisy is also constructed. In addition, the applications of this method are discussed in detail. Experimental results show that our noises climination technique is able to improve the mining results significantly. Key words noises climination - DOM tree - style tree - Web mining CLC number TP 339 Foundation item: Supported by the National Natural Science Foundation of China (60003013)Biography: ZHAN Cheng-li (1979-), male, Master candidate, research direction: Intelligent Information System.展开更多
The fourth international conference on Web information systems and applications (WISA 2007) has received 409 submissions and has accepted 37 papers for publication in this issue. The papers cover broad research area...The fourth international conference on Web information systems and applications (WISA 2007) has received 409 submissions and has accepted 37 papers for publication in this issue. The papers cover broad research areas, including Web mining and data warehouse, Deep Web and Web integration, P2P networks, text processing and information retrieval, as well as Web Services and Web infrastructure. After briefly introducing the WISA conference, the survey outlines the current activities and future trends concerning Web information systems and applications based on the papers accepted for publication.展开更多
We combine the web usage mining and fuzzy clustering and give the concept of web fuzzy clustering, and then put forward the web fuzzy clustering processing model which is discussed in detail. Web fuzzy clustering can ...We combine the web usage mining and fuzzy clustering and give the concept of web fuzzy clustering, and then put forward the web fuzzy clustering processing model which is discussed in detail. Web fuzzy clustering can be used in the web users clustering and web pages clustering. In the end, a case study is given and the result has proved the feasibility of using web fuzzy clustering in web pages clustering. Key words web mining - web usage mining - web fuzzy clustering - WFCM CLC number TP 391 Foundation item: Supported by the National Natural Science Foundation of China (90104005)Biography: LIU Mao-fu (1977-), male, Ph. D candidate, research direction: artificial intelligence, web mining, image mining.展开更多
A large amount of data is present on the web which can be used for useful purposes like a product recommendation,price comparison and demand forecasting for a particular product.Websites are designed for human underst...A large amount of data is present on the web which can be used for useful purposes like a product recommendation,price comparison and demand forecasting for a particular product.Websites are designed for human understanding and not for machines.Therefore,to make data machine-readable,it requires techniques to grab data from web pages.Researchers have addressed the problem using two approaches,i.e.,knowledge engineering and machine learning.State of the art knowledge engineering approaches use the structure of documents,visual cues,clustering of attributes of data records and text processing techniques to identify data records on a web page.Machine learning approaches use annotated pages to learn rules.These rules are used to extract data from unseen web pages.The structure of web documents is continuously evolving.Therefore,new techniques are needed to handle the emerging requirements of web data extraction.In this paper,we have presented a novel,simple and efficient technique to extract data from web pages using visual styles and structure of documents.The proposed technique detects Rich Data Region(RDR)using query and correlative words of the query.RDR is then divided into data records using style similarity.Noisy elements are removed using a Common Tag Sequence(CTS)and formatting entropy.The system is implemented using JAVA and runs on the dataset of real-world working websites.The effectiveness of results is evaluated using precision,recall,and F-measure and compared with five existing systems.A comparison of the proposed technique to existing systems has shown encouraging results.展开更多
A new method for Web users fuzzy clustering based on analysis of user interest characteristic is proposed in this article. The method first defines page fuzzy categories according to the links on the index page of the...A new method for Web users fuzzy clustering based on analysis of user interest characteristic is proposed in this article. The method first defines page fuzzy categories according to the links on the index page of the site, then computes fuzzy degree of cross page through aggregating on data of Web log. After that, by using fuzzy comprehensive evaluation method, the method constructs user interest vectors according to page viewing times and frequency of hits, and derives the fuzzy similarity matrix from the interest vectors for the Web users. Finally, it gets the clustering result through the fuzzy clustering method. The experimental results show the effectiveness of the method. Key words Web log mining - fuzzy similarity matrix - fuzzy comprehensive evaluation - fuzzy clustering CLC number TP18 - TP311 - TP391 Foundation item: Supported by the Natural Science Foundation of Heilongjiang Province of China (F0304)Biography: ZHAN Li-qiang (1966-), male, Lecturer, Ph. D. research direction: the theory methods of data mining and theory of database.展开更多
As the increasing popularity and complexity of Web applications and the emergence of their new characteristics, the testing and maintenance of large, complex Web applications are becoming more complex and difficult. W...As the increasing popularity and complexity of Web applications and the emergence of their new characteristics, the testing and maintenance of large, complex Web applications are becoming more complex and difficult. Web applications generally contain lots of pages and are used by enormous users. Statistical testing is an effective way of ensuring their quality. Web usage can be accurately described by Markov chain which has been proved to be an ideal model for software statistical testing. The results of unit testing can be utilized in the latter stages, which is an important strategy for bottom-to-top integration testing, and the other improvement of extended Markov chain model (EMM) is to present the error type vector which is treated as a part of page node. this paper also proposes the algorithm for generating test cases of usage paths. Finally, optional usage reliability evaluation methods and an incremental usability regression testing model for testing and evaluation are presented. Key words statistical testing - evaluation for Web usability - extended Markov chain model (EMM) - Web log mining - reliability evaluation CLC number TP311. 5 Foundation item: Supported by the National Defence Research Project (No. 41315. 9. 2) and National Science and Technology Plan (2001BA102A04-02-03)Biography: MAO Cheng-ying (1978-), male, Ph.D. candidate, research direction: software testing. Research direction: advanced database system, software testing, component technology and data mining.展开更多
The task of clustering Web sessions is to group Web sessions based on similarity and consists of maximizing the intra-group similarity while minimizing the inter-group similarity. The first and foremost question neede...The task of clustering Web sessions is to group Web sessions based on similarity and consists of maximizing the intra-group similarity while minimizing the inter-group similarity. The first and foremost question needed to be considered in clustering Web sessions is how to measure the similarity between Web sessions. However, there are many shortcomings in traditional measurements. This paper introduces a new method for measuring similarities between Web pages that takes into account not only the URL but also the viewing time of the visited Web page. Then we give a new method to measure the similarity of Web sessions using sequence alignment and the similarity of Web page access in detail Experiments have proved that our method is valid and efficient.展开更多
The content-ignorant clustering method takes advantages in time complexity and space complexity than the content based methods.In this paper,the authors introduce a unified expanding method for content-ignorant web pa...The content-ignorant clustering method takes advantages in time complexity and space complexity than the content based methods.In this paper,the authors introduce a unified expanding method for content-ignorant web page clustering by mining the "click-through" log,which tries to solve the problem that the "click-through" log is sparse.The relationship between two nodes which have been expanded is also defined and optimized.Analysis and experiment show that the performance of the new method has improved,by the comparison with the standard content-ignorant method.The new method can also work without iterative clustering.展开更多
In this era of a data-driven society, useful data(Big Data) is often unintentionally ignored due to lack of convenient tools and expensive software. For example, web log files can be used to identify explicit informat...In this era of a data-driven society, useful data(Big Data) is often unintentionally ignored due to lack of convenient tools and expensive software. For example, web log files can be used to identify explicit information of browsing patterns when users access web sites. Some hidden information,however, cannot be directly derived from the log files. We may need external resources to discover more knowledge from browsing patterns. The purpose of this study is to investigate the application of web usage mining based on web log files. The outcome of this study sets further directions of this investigation on what and how implicit information embedded in log files can be efficiently and effectively extracted. Further work involves combining the use of social media data to improve business decision quality.展开更多
The massive web videos prompt an imperative demand on efficiently grasping the major events. However, the distinct characteristics of web videos, such as the limited number of features, the noisy text information, and...The massive web videos prompt an imperative demand on efficiently grasping the major events. However, the distinct characteristics of web videos, such as the limited number of features, the noisy text information, and the unavoidable error in near-duplicate keyframes (NDKs) detection, make web video event mining a challenging task. In this paper, we propose a novel four-stage framework to improve the performance of web video event mining. Data preprocessing is the first stage. Multiple Correspondence Analysis (MCA) is then applied to explore the correlation between terms and classes, targeting for bridging the gap between NDKs and high-level semantic concepts. Next, co-occurrence information is used to detect the similarity between NDKs and classes using the NDK-within-video information. Finally, both of them are integrated for web video event mining through negative NDK pruning and positive NDK enhancement. Moreover, both NDKs and terms with relatively low frequencies are treated as useful information in our experiments. Experimental results on large-scale web videos from YouTube demonstrate that the proposed framework outperforms several existing mining methods and obtains good results for web video event mining.展开更多
基金supported by the National Natural Science Foundation of China (No. 61175052,60975039, 61203297, 60933004, 61035003)National High-tech R&D Program of China (863 Program) (No.2012AA011003)supported by the ZTE research found of Parallel Web Mining project
文摘Traditional machine-learning algorithms are struggling to handle the exceedingly large amount of data being generated by the internet. In real-world applications, there is an urgent need for machine-learning algorithms to be able to handle large-scale, high-dimensional text data. Cloud computing involves the delivery of computing and storage as a service to a heterogeneous community of recipients, Recently, it has aroused much interest in industry and academia. Most previous works on cloud platforms only focus on the parallel algorithms for structured data. In this paper, we focus on the parallel implementation of web-mining algorithms and develop a parallel web-mining system that includes parallel web crawler; parallel text extract, transform and load (ETL) and modeling; and parallel text mining and application subsystems. The complete system enables variable real-world web-mining applications for mass data.
文摘Web-log contains a lot of information related with user activities on the Internet. How to mine user browsing interest patterns effectively is an important and challengeable research topic. On the analysis of the present algorithm’s advantages and disadvantages we propose a new concept: support-interest. Its key insight is that visitor will backtrack if they do not find the information where they expect. And the point from where they backtrack is the expected location for the page. We present User Access Matrix and the corresponding algorithm for discovering such expected locations that can handle page caching by the browser. Since the URL-URL matrix is a sparse matrix which can be represented by List of 3-tuples, we can mine user preferred sub-paths from the computation of this matrix. Accordingly, all the sub-paths are merged, and user preferred paths are formed. Experiments showed that it was accurate and scalable. It’s suitable for website based application, such as to optimize website’s topological structure or to design personalized services. Key words Web Mining - user preferred path - Web-log - support-interest - personalized services CLC number TP 391 Foundation item: Supported by the National High Technology Development (863 program of China) (2001AA113182)Biography: ZHOU Hong-fang (1976-), female.Ph. D candidate, research direction: data mining and knowledge discovery in databases.
基金Supported by the National Natural Science Foundation of China (60673093)
文摘With the explosive growth of information sources available on the World Wide Web, how to combine the results of multiple search engines has become a valuable problem. In this paper, a search strategy based on genetic simulated annealing for search engines in Web mining is proposed. According to the proposed strategy, there exists some important relationship among Web statistical studies, search engines and optimization techniques. We have proven experimentally the relevance of our approach to the presented queries by comparing the qualities of output pages with those of the original downloaded pages, as the number of iterations increases better results are obtained with reasonable execution time.
基金Supporting this study through Taif University Researchers Supporting Project number(TURSP-2020/115),Taif University,Taif,Saudi Arabia.
文摘Web usage mining,content mining,and structure mining comprise the web mining process.Web-Page Recommendation(WPR)development by incor-porating Data Mining Techniques(DMT)did not include end-users with improved performance in the obtainedfiltering results.The cluster user profile-based clustering process is delayed when it has a low precision rate.Markov Chain Monte Carlo-Dynamic Clustering(MC2-DC)is based on the User Behavior Profile(UBP)model group’s similar user behavior on a dynamic update of UBP.The Reversible-Jump Concept(RJC)reviews the history with updated UBP and moves to appropriate clusters.Hamilton’s Filtering Framework(HFF)is designed tofilter user data based on personalised information on automatically updated UBP through the Search Engine(SE).The Hamilton Filtered Regime Switching User Query Probability(HFRSUQP)works forward the updated UBP for easy and accuratefiltering of users’interests and improves WPR.A Probabilistic User Result Feature Ranking based on Gaussian Distribution(PURFR-GD)has been developed to user rank results in a web mining process.PURFR-GD decreases the delay time in the end-to-end workflow for SE personalization in various meth-ods by using the Gaussian Distribution Function(GDF).The theoretical analysis and experiment results of the proposed MC2-DC method automatically increase the updated UBP accuracy by 18.78%.HFRSUQP enabled extensive Maximize Log-Likelihood(ML-L)increases to 15.28%of User Personalized Information Search Retrieval Rate(UPISRT).For feature ranking,the PURFR-GD model defines higher Classification Accuracy(CA)and Precision Ratio(PR)while uti-lising minimum Execution Time(ET).Furthermore,UPISRT's ranking perfor-mance has improved by 20%.
基金Supported by the National Natural Science Foundation of China(60472099)Ningbo Natural Science Foundation(2006A610017)
文摘Because data warehouse is frequently changing, incremental data leads to old knowledge which is mined formerly unavailable. In order to maintain the discovered knowledge and patterns dynamically, this study presents a novel algorithm updating for global frequent patterns-IPARUC. A rapid clustering method is introduced to divide database into n parts in IPARUC firstly, where the data are similar in the same part. Then, the nodes in the tree are adjusted dynamically in inserting process by "pruning and laying back" to keep the frequency descending order so that they can be shared to approaching optimization. Finally local frequent itemsets mined from each local dataset are merged into global frequent itemsets. The results of experimental study are very encouraging. It is obvious from experiment that IPARUC is more effective and efficient than other two contrastive methods. Furthermore, there is significant application potential to a prototype of Web log Analyzer in web usage mining that can help us to discover useful knowledge effectively, even help managers making decision.
基金Supported by the Huo Yingdong Education Foundation of China(91101)
文摘A semantic session analysis method partitioning Web usage logs is presented. Semantic Web usage log preparation model enhances usage logs with semantic. The Markov chain model based on ontology semantic measurement is used to identifying which active session a request should belong to. The competitive method is applied to determine the end of the sessions. Compared with other algorithms, more successful sessions are additionally detected by semantic outlier analysis.
文摘The structure of Web site became more complex than before. During the design period of a Web site, the lack of model and method results in improper Web structure, which depend on the designer's experience. From the point of view of software engineering, every period in the software life must be evaluated before starting the next period's work. It is very important and essential to search relevant methods for evaluating Web structure before the site is completed. In this work, after studying the related work about the Web structure mining and analyzing the major structure mining methods (Page\|rank and Hub/Authority), a method based on the Page\|rank for Web structure evaluation in design stage is proposed. A Web structure modeling language WSML is designed, and the implement strategies for evaluating system of the Web site structure are given out. Web structure mining has being used mainly in search engines before. It is the first time to employ the Web structure mining technology to evaluate a Web structure in the design period of a Web site. It contributes to the formalization of the design documents for Web site and the improving of software engineering for large scale Web site, and the evaluating system is a practical tool for Web site construction.
基金Supported by the Natural Science Foundation of Jiangsu Province(BK2005046)
文摘To alleviate the scalability problem caused by the increasing Web using and changing users' interests, this paper presents a novel Web Usage Mining algorithm-Incremental Web Usage Mining algorithm based on Active Ant Colony Clustering. Firstly, an active movement strategy about direction selection and speed, different with the positive strategy employed by other Ant Colony Clustering algorithms, is proposed to construct an Active Ant Colony Clustering algorithm, which avoid the idle and "flying over the plane" moving phenomenon, effectively improve the quality and speed of clustering on large dataset. Then a mechanism of decomposing clusters based on above methods is introduced to form new clusters when users' interests change. Empirical studies on a real Web dataset show the active ant colony clustering algorithm has better performance than the previous algorithms, and the incremental approach based on the proposed mechanism can efficiently implement incremental Web usage mining.
文摘In this paper, an improved algorithm, named STC-I, is proposed for Chinese Web page clustering based on Chinese language characteristics, which adopts a new unit choice principle and a novel suffix tree construction policy. The experimental results show that the new algorithm keeps advantages of STC, and is better than STC in precision and speed when they are used to cluster Chinese Web page. Key words clustering - suffix tree - Web mining CLC number TP 311 Foundation item: Supported by the National Information Industry Development Foundation of ChinaBiography: YANG Jian-wu (1973-), male, Ph. D, research direction: information retrieval and text mining.
文摘A Web page typically contains many information blocks. Apart from the main content blocks, it usually has such blocks as navigation panels, copyright and privacy notices, and advertisements. We call these blocks the noisy blocks. The noises in Web pages can seriously harm Web data mining. To the question of climinating these noises, we intro duce a new tree structure, called Style Tree, and study an algorithm how to construct a site style tree. The Style Tree Model is employed to detect and climinate noises in any Web pages of the site. An information based measure to determine which element node is noisy is also constructed. In addition, the applications of this method are discussed in detail. Experimental results show that our noises climination technique is able to improve the mining results significantly. Key words noises climination - DOM tree - style tree - Web mining CLC number TP 339 Foundation item: Supported by the National Natural Science Foundation of China (60003013)Biography: ZHAN Cheng-li (1979-), male, Master candidate, research direction: Intelligent Information System.
文摘The fourth international conference on Web information systems and applications (WISA 2007) has received 409 submissions and has accepted 37 papers for publication in this issue. The papers cover broad research areas, including Web mining and data warehouse, Deep Web and Web integration, P2P networks, text processing and information retrieval, as well as Web Services and Web infrastructure. After briefly introducing the WISA conference, the survey outlines the current activities and future trends concerning Web information systems and applications based on the papers accepted for publication.
文摘We combine the web usage mining and fuzzy clustering and give the concept of web fuzzy clustering, and then put forward the web fuzzy clustering processing model which is discussed in detail. Web fuzzy clustering can be used in the web users clustering and web pages clustering. In the end, a case study is given and the result has proved the feasibility of using web fuzzy clustering in web pages clustering. Key words web mining - web usage mining - web fuzzy clustering - WFCM CLC number TP 391 Foundation item: Supported by the National Natural Science Foundation of China (90104005)Biography: LIU Mao-fu (1977-), male, Ph. D candidate, research direction: artificial intelligence, web mining, image mining.
文摘A large amount of data is present on the web which can be used for useful purposes like a product recommendation,price comparison and demand forecasting for a particular product.Websites are designed for human understanding and not for machines.Therefore,to make data machine-readable,it requires techniques to grab data from web pages.Researchers have addressed the problem using two approaches,i.e.,knowledge engineering and machine learning.State of the art knowledge engineering approaches use the structure of documents,visual cues,clustering of attributes of data records and text processing techniques to identify data records on a web page.Machine learning approaches use annotated pages to learn rules.These rules are used to extract data from unseen web pages.The structure of web documents is continuously evolving.Therefore,new techniques are needed to handle the emerging requirements of web data extraction.In this paper,we have presented a novel,simple and efficient technique to extract data from web pages using visual styles and structure of documents.The proposed technique detects Rich Data Region(RDR)using query and correlative words of the query.RDR is then divided into data records using style similarity.Noisy elements are removed using a Common Tag Sequence(CTS)and formatting entropy.The system is implemented using JAVA and runs on the dataset of real-world working websites.The effectiveness of results is evaluated using precision,recall,and F-measure and compared with five existing systems.A comparison of the proposed technique to existing systems has shown encouraging results.
文摘A new method for Web users fuzzy clustering based on analysis of user interest characteristic is proposed in this article. The method first defines page fuzzy categories according to the links on the index page of the site, then computes fuzzy degree of cross page through aggregating on data of Web log. After that, by using fuzzy comprehensive evaluation method, the method constructs user interest vectors according to page viewing times and frequency of hits, and derives the fuzzy similarity matrix from the interest vectors for the Web users. Finally, it gets the clustering result through the fuzzy clustering method. The experimental results show the effectiveness of the method. Key words Web log mining - fuzzy similarity matrix - fuzzy comprehensive evaluation - fuzzy clustering CLC number TP18 - TP311 - TP391 Foundation item: Supported by the Natural Science Foundation of Heilongjiang Province of China (F0304)Biography: ZHAN Li-qiang (1966-), male, Lecturer, Ph. D. research direction: the theory methods of data mining and theory of database.
文摘As the increasing popularity and complexity of Web applications and the emergence of their new characteristics, the testing and maintenance of large, complex Web applications are becoming more complex and difficult. Web applications generally contain lots of pages and are used by enormous users. Statistical testing is an effective way of ensuring their quality. Web usage can be accurately described by Markov chain which has been proved to be an ideal model for software statistical testing. The results of unit testing can be utilized in the latter stages, which is an important strategy for bottom-to-top integration testing, and the other improvement of extended Markov chain model (EMM) is to present the error type vector which is treated as a part of page node. this paper also proposes the algorithm for generating test cases of usage paths. Finally, optional usage reliability evaluation methods and an incremental usability regression testing model for testing and evaluation are presented. Key words statistical testing - evaluation for Web usability - extended Markov chain model (EMM) - Web log mining - reliability evaluation CLC number TP311. 5 Foundation item: Supported by the National Defence Research Project (No. 41315. 9. 2) and National Science and Technology Plan (2001BA102A04-02-03)Biography: MAO Cheng-ying (1978-), male, Ph.D. candidate, research direction: software testing. Research direction: advanced database system, software testing, component technology and data mining.
基金Supported by the Foundation of Hubei Key Technology Research and Development(2005AA101C18)the Natural Science Founda-tion of South-Central University for Nationalities(YZY06009)
文摘The task of clustering Web sessions is to group Web sessions based on similarity and consists of maximizing the intra-group similarity while minimizing the inter-group similarity. The first and foremost question needed to be considered in clustering Web sessions is how to measure the similarity between Web sessions. However, there are many shortcomings in traditional measurements. This paper introduces a new method for measuring similarities between Web pages that takes into account not only the URL but also the viewing time of the visited Web page. Then we give a new method to measure the similarity of Web sessions using sequence alignment and the similarity of Web page access in detail Experiments have proved that our method is valid and efficient.
文摘The content-ignorant clustering method takes advantages in time complexity and space complexity than the content based methods.In this paper,the authors introduce a unified expanding method for content-ignorant web page clustering by mining the "click-through" log,which tries to solve the problem that the "click-through" log is sparse.The relationship between two nodes which have been expanded is also defined and optimized.Analysis and experiment show that the performance of the new method has improved,by the comparison with the standard content-ignorant method.The new method can also work without iterative clustering.
基金Supported by Royal Thai Government ScholarshipFaculty of IT,Monash University,Resources Support
文摘In this era of a data-driven society, useful data(Big Data) is often unintentionally ignored due to lack of convenient tools and expensive software. For example, web log files can be used to identify explicit information of browsing patterns when users access web sites. Some hidden information,however, cannot be directly derived from the log files. We may need external resources to discover more knowledge from browsing patterns. The purpose of this study is to investigate the application of web usage mining based on web log files. The outcome of this study sets further directions of this investigation on what and how implicit information embedded in log files can be efficiently and effectively extracted. Further work involves combining the use of social media data to improve business decision quality.
基金supported by the National Natural Science Foundation of China under Grant Nos. 61373121, 61071184, 60972111,61036008the Research Funds for the Doctoral Program of Higher Education of China under Grant No. 20100184120009+2 种基金the Program for Sichuan Provincial Science Fund for Distinguished Young Scholars under Grant Nos. 2012JQ0029, 13QNJJ0149the Fundamental Research Funds for the Central Universities of China under Grant Nos. SWJTU09CX032, SWJTU10CX08the Program of China Scholarships Council under Grant No. 201207000050
文摘The massive web videos prompt an imperative demand on efficiently grasping the major events. However, the distinct characteristics of web videos, such as the limited number of features, the noisy text information, and the unavoidable error in near-duplicate keyframes (NDKs) detection, make web video event mining a challenging task. In this paper, we propose a novel four-stage framework to improve the performance of web video event mining. Data preprocessing is the first stage. Multiple Correspondence Analysis (MCA) is then applied to explore the correlation between terms and classes, targeting for bridging the gap between NDKs and high-level semantic concepts. Next, co-occurrence information is used to detect the similarity between NDKs and classes using the NDK-within-video information. Finally, both of them are integrated for web video event mining through negative NDK pruning and positive NDK enhancement. Moreover, both NDKs and terms with relatively low frequencies are treated as useful information in our experiments. Experimental results on large-scale web videos from YouTube demonstrate that the proposed framework outperforms several existing mining methods and obtains good results for web video event mining.