This paper proposes a new approach of feature selection based on the independent measure between features for text categorization. A fundamental hypothesis that occurrence of the terms in documents is independent of e...This paper proposes a new approach of feature selection based on the independent measure between features for text categorization. A fundamental hypothesis that occurrence of the terms in documents is independent of each other, widely used in the probabilistic models for text categorization (TC), is discussed. However, the basic hypothesis is incom plete for independence of feature set. From the view of feature selection, a new independent measure between features is designed, by which a feature selection algorithm is given to ob rain a feature subset. The selected subset is high in relevance with category and strong in independence between features, satisfies the basic hypothesis at maximum degree. Compared with other traditional feature selection method in TC (which is only taken into the relevance account), the performance of feature subset selected by our method is prior to others with experiments on the benchmark dataset of 20 Newsgroups.展开更多
Feature selection methods have been successfully applied to text categorization but seldom applied to text clustering due to the unavailability of class label information. In this paper, a new feature selection method...Feature selection methods have been successfully applied to text categorization but seldom applied to text clustering due to the unavailability of class label information. In this paper, a new feature selection method for text clustering based on expectation maximization and cluster validity is proposed. It uses supervised feature selection method on the intermediate clustering result which is generated during iterative clustering to do feature selection for text clustering; meanwhile, the Davies-Bouldin's index is used to evaluate the intermediate feature subsets indirectly. Then feature subsets are selected according to the curve of the Davies-Bouldin's index. Experiment is carried out on several popular datasets and the results show the advantages of the proposed method.展开更多
This paper proposes a method of data-flow testing for Web services composition. Firstly, to facilitate data flow analysis and constraints collecting, the existing model representation of business process execution lan...This paper proposes a method of data-flow testing for Web services composition. Firstly, to facilitate data flow analysis and constraints collecting, the existing model representation of business process execution language (BPEL) is modified in company with the analysis of data dependency and an exact representation of dead path elimination (DPE) is proposed, which over-comes the difficulties brought to dataflow analysis. Then defining and using information based on data flow rules is collected by parsing BPEL and Web services description language (WSDL) documents and the def-use annotated control flow graph is created. Based on this model, data-flow anomalies which indicate potential errors can be discovered by traversing the paths of graph, and all-du-paths used in dynamic data flow testing for Web services composition are automatically generated, then testers can design the test cases according to the collected constraints for each path selected.展开更多
A new common phrase scoring method is proposed according to term frequency-inverse document frequency (TFIDF) and independence of the phrase. Combining the two properties can help identify more reasonable common phr...A new common phrase scoring method is proposed according to term frequency-inverse document frequency (TFIDF) and independence of the phrase. Combining the two properties can help identify more reasonable common phrases, which improve the accuracy of clustering. Also, the equation to measure the in-dependence of a phrase is proposed in this paper. The new algorithm which improves suffix tree clustering algorithm (STC) is named as improved suffix tree clustering (ISTC). To validate the proposed algorithm, a prototype system is implemented and used to cluster several groups of web search results obtained from Google search engine. Experimental results show that the improved algorithm offers higher accuracy than traditional suffix tree clustering.展开更多
The fourth international conference on Web information systems and applications (WISA 2007) has received 409 submissions and has accepted 37 papers for publication in this issue. The papers cover broad research area...The fourth international conference on Web information systems and applications (WISA 2007) has received 409 submissions and has accepted 37 papers for publication in this issue. The papers cover broad research areas, including Web mining and data warehouse, Deep Web and Web integration, P2P networks, text processing and information retrieval, as well as Web Services and Web infrastructure. After briefly introducing the WISA conference, the survey outlines the current activities and future trends concerning Web information systems and applications based on the papers accepted for publication.展开更多
Many high performance database servers are becoming idle with the tax data being integrated to country tax data centers. We make use of the idle servers to set up a provincial tax Grid based on open Grid service archi...Many high performance database servers are becoming idle with the tax data being integrated to country tax data centers. We make use of the idle servers to set up a provincial tax Grid based on open Grid service architecture (OGSA). We put forward practical methods to integrate databases, to define and create basic modular Grid services and apply agent to manage Grid services. This technical innovation scheme is of a service-oriented architecture (SOA) and succeeds in averting resources waste. Tests proved that it greatly improves the quality of tax services.展开更多
This paper proposes an extended system dependence graph called AspectSDG to represent control and data dependences for AspeetC++ programs, and presents an approach for the construction of AspectSDG. This approach de...This paper proposes an extended system dependence graph called AspectSDG to represent control and data dependences for AspeetC++ programs, and presents an approach for the construction of AspectSDG. This approach decomposes aspect-oriented programs into three parts: component codes, aspect codes, and weaving codes. It constructs program dependence graphs (PDGs) for each part, and then connects the PDGs at call sites to form the complete AspectSDG. The AspectSDG can deal with advice precedence correctly, and represent the additional dependences caused by aspect codes. Based on this model, we introduce how to compute a static slice of an AspectC+ + program.展开更多
To achieve sparse sampling on a coded ultrasonic signal,the finite rate of innovation(FRI)sparse sampling technique is proposed on a binary frequency-coded(BFC)ultrasonic signal.A framework of FRI-based sparse samplin...To achieve sparse sampling on a coded ultrasonic signal,the finite rate of innovation(FRI)sparse sampling technique is proposed on a binary frequency-coded(BFC)ultrasonic signal.A framework of FRI-based sparse sampling for an ultrasonic signal pulse is presented.Differences between the pulse and the coded ultrasonic signal are analyzed,and a response mathematical model of the coded ultrasonic signal is established.A time-domain transform algorithm,called the high-order moment method,is applied to obtain a pulse stream signal to assist BFC ultrasonic signal sparse sampling.A sampling of the output signal with a uniform interval is then performed after modulating the pulse stream signal by a sampling kernel.FRI-based sparse sampling is performed using a self-made circuit on an aluminum alloy sample.Experimental results show that the sampling rate reduces to 0.5 MHz,which is at least 12.8 MHz in the Nyquist sampling mode.The echo peak amplitude and the time of flight are estimated from the sparse sampling data with maximum errors of 9.324%and 0.031%,respectively.This research can provide a theoretical basis and practical application reference for reducing the sampling rate and data volume in coded ultrasonic testing.展开更多
It has very realistic significance for improving the quality of users' accessing information to filter and selectively retrieve the large number of information on the Internet. On the basis of analyzing the existing ...It has very realistic significance for improving the quality of users' accessing information to filter and selectively retrieve the large number of information on the Internet. On the basis of analyzing the existing users' interest models and some basic questions of users' interest (representation, derivation and identification of users' interest), a Bayesian network based users' interest model is given. In this model, the users' interest reduction algorithm based on Markov Blanket model is used to reduce the interest noise, and then users' interested and not interested documents are used to train the Bayesian network. Compared to the simple model, this model has the following advantages like small space requirements, simple reasoning method and high recognition rate. The experiment result shows this model can more appropriately reflect the user's interest, and has higher performance and good usability.展开更多
Though K-means is very popular for general clustering, its performance, which generally converges to numerous local minima, depends highly on initial cluster centers. In this paper a novel initialization scheme to sel...Though K-means is very popular for general clustering, its performance, which generally converges to numerous local minima, depends highly on initial cluster centers. In this paper a novel initialization scheme to select initial cluster centers for K-means clustering is proposed. This algorithm is based on reverse nearest neighbor (RNN) search which retrieves all points in a given data set whose nearest neighbor is a given query point. The initial cluster centers computed using this methodology are found to be very close to the desired cluster centers for iterative clustering algorithms. This procedure is applicable to clustering algorithms for continuous data. The application of the proposed algorithm to K-means clustering algorithm is demonstrated. An experiment is carried out on several popular datasets and the results show the advantages of the proposed method.展开更多
Based on the common properties of logic formulas:equivalence and satisfiability,the concept of variable minimal formulas with property preservation is introduced.A formula is variable minimal if the resulting sub-form...Based on the common properties of logic formulas:equivalence and satisfiability,the concept of variable minimal formulas with property preservation is introduced.A formula is variable minimal if the resulting sub-formulas with any variable omission will change the given property.Some theoretical results of two classes:variable minimal equivalence(VME) and variable minimal satisfiability(VMS) are studied.We prove that VME is NP-complete,and VMS is in DP and coNP-hard.展开更多
Based on the different roles played by base flow and alternative flow in the process to achieve user's goals, we have found that loop structure is frequently used to implement alternative flow and/or to connect diffe...Based on the different roles played by base flow and alternative flow in the process to achieve user's goals, we have found that loop structure is frequently used to implement alternative flow and/or to connect different use cases. This paper presents an approach to identify base flows and alternative flows of different use cases by traversing control flow graph in which back edges are eliminated. The effectiveness of the approach is verified by identification of the use case structure of an ATM system. The workload of human intervention of the approach is relatively slight, and the manner of human intervention closely follows the usual process of software comprehension.展开更多
To avoid the precision loss caused by combining data- flow facts impossible to occur in the same execution path in dependence analysis for C programs, this paper first proposes a flow-sensitive and context-insensitive...To avoid the precision loss caused by combining data- flow facts impossible to occur in the same execution path in dependence analysis for C programs, this paper first proposes a flow-sensitive and context-insensitive points-to analysis algorithm and then presents a new dependence analysis approach based on it. The approach makes more sufficient consideration on the executa- ble path problem and can avoid invalid combination between points-to relations and between points-to relations and reaching definitions. The results of which are therefore more precise than those of the ordinary dependence analysis approaches.展开更多
基金Supported by the National Natural Science Foun-dation of China (60373066 ,60503020) the Outstanding Young Sci-entist’s Fund(60425206) Doctor Foundatoin of Nanjing Universityof Posts and Telecommunications (2003-02)
文摘This paper proposes a new approach of feature selection based on the independent measure between features for text categorization. A fundamental hypothesis that occurrence of the terms in documents is independent of each other, widely used in the probabilistic models for text categorization (TC), is discussed. However, the basic hypothesis is incom plete for independence of feature set. From the view of feature selection, a new independent measure between features is designed, by which a feature selection algorithm is given to ob rain a feature subset. The selected subset is high in relevance with category and strong in independence between features, satisfies the basic hypothesis at maximum degree. Compared with other traditional feature selection method in TC (which is only taken into the relevance account), the performance of feature subset selected by our method is prior to others with experiments on the benchmark dataset of 20 Newsgroups.
基金Supported by the National Natural Science Foundation of China (60503020, 60373066)the Outstanding Young Scientist’s Fund (60425206)+1 种基金the Natural Science Foundation of Jiangsu Province (BK2005060)the Opening Foundation of Jiangsu Key Laboratory of Computer Informa-tion Processing Technology in Soochow University
文摘Feature selection methods have been successfully applied to text categorization but seldom applied to text clustering due to the unavailability of class label information. In this paper, a new feature selection method for text clustering based on expectation maximization and cluster validity is proposed. It uses supervised feature selection method on the intermediate clustering result which is generated during iterative clustering to do feature selection for text clustering; meanwhile, the Davies-Bouldin's index is used to evaluate the intermediate feature subsets indirectly. Then feature subsets are selected according to the curve of the Davies-Bouldin's index. Experiment is carried out on several popular datasets and the results show the advantages of the proposed method.
基金the National Natural Science Foundation of China(60425206, 60503033)National Basic Research Program of China (973 Program, 2002CB312000)Opening Foundation of State Key Laboratory of Software Engineering in Wuhan University
文摘This paper proposes a method of data-flow testing for Web services composition. Firstly, to facilitate data flow analysis and constraints collecting, the existing model representation of business process execution language (BPEL) is modified in company with the analysis of data dependency and an exact representation of dead path elimination (DPE) is proposed, which over-comes the difficulties brought to dataflow analysis. Then defining and using information based on data flow rules is collected by parsing BPEL and Web services description language (WSDL) documents and the def-use annotated control flow graph is created. Based on this model, data-flow anomalies which indicate potential errors can be discovered by traversing the paths of graph, and all-du-paths used in dynamic data flow testing for Web services composition are automatically generated, then testers can design the test cases according to the collected constraints for each path selected.
基金Foundation item: Supported by the National Natural Science Foundation of China (60503020, 60503033, 60703086)Opening Foundation of Jiangsu Key Laboratory of Computer Information Processing Technology in Soochow Uni-versity (KJS0714)+1 种基金Research Foundation of Nanjing University of Posts and Telecommunications (NY207052, NY207082)National Natural Science Foundation of Jiangsu (BK2006094).
文摘A new common phrase scoring method is proposed according to term frequency-inverse document frequency (TFIDF) and independence of the phrase. Combining the two properties can help identify more reasonable common phrases, which improve the accuracy of clustering. Also, the equation to measure the in-dependence of a phrase is proposed in this paper. The new algorithm which improves suffix tree clustering algorithm (STC) is named as improved suffix tree clustering (ISTC). To validate the proposed algorithm, a prototype system is implemented and used to cluster several groups of web search results obtained from Google search engine. Experimental results show that the improved algorithm offers higher accuracy than traditional suffix tree clustering.
文摘The fourth international conference on Web information systems and applications (WISA 2007) has received 409 submissions and has accepted 37 papers for publication in this issue. The papers cover broad research areas, including Web mining and data warehouse, Deep Web and Web integration, P2P networks, text processing and information retrieval, as well as Web Services and Web infrastructure. After briefly introducing the WISA conference, the survey outlines the current activities and future trends concerning Web information systems and applications based on the papers accepted for publication.
基金Supported by the National Natural Science Foun-dation of China (60425206)
文摘Many high performance database servers are becoming idle with the tax data being integrated to country tax data centers. We make use of the idle servers to set up a provincial tax Grid based on open Grid service architecture (OGSA). We put forward practical methods to integrate databases, to define and create basic modular Grid services and apply agent to manage Grid services. This technical innovation scheme is of a service-oriented architecture (SOA) and succeeds in averting resources waste. Tests proved that it greatly improves the quality of tax services.
基金Supported by the National Science Foundation forDistinguished Young Scholars (60425206) the National Natural Sci-ence Foundation of China ( 90412003 , 60373066 , 60403016 ,60503033) the National Basic Research Programof China (973 Pro-gram2002CB312000)
文摘This paper proposes an extended system dependence graph called AspectSDG to represent control and data dependences for AspeetC++ programs, and presents an approach for the construction of AspectSDG. This approach decomposes aspect-oriented programs into three parts: component codes, aspect codes, and weaving codes. It constructs program dependence graphs (PDGs) for each part, and then connects the PDGs at call sites to form the complete AspectSDG. The AspectSDG can deal with advice precedence correctly, and represent the additional dependences caused by aspect codes. Based on this model, we introduce how to compute a static slice of an AspectC+ + program.
基金The National Natural Science Foundation of China (No.51375217)。
文摘To achieve sparse sampling on a coded ultrasonic signal,the finite rate of innovation(FRI)sparse sampling technique is proposed on a binary frequency-coded(BFC)ultrasonic signal.A framework of FRI-based sparse sampling for an ultrasonic signal pulse is presented.Differences between the pulse and the coded ultrasonic signal are analyzed,and a response mathematical model of the coded ultrasonic signal is established.A time-domain transform algorithm,called the high-order moment method,is applied to obtain a pulse stream signal to assist BFC ultrasonic signal sparse sampling.A sampling of the output signal with a uniform interval is then performed after modulating the pulse stream signal by a sampling kernel.FRI-based sparse sampling is performed using a self-made circuit on an aluminum alloy sample.Experimental results show that the sampling rate reduces to 0.5 MHz,which is at least 12.8 MHz in the Nyquist sampling mode.The echo peak amplitude and the time of flight are estimated from the sparse sampling data with maximum errors of 9.324%and 0.031%,respectively.This research can provide a theoretical basis and practical application reference for reducing the sampling rate and data volume in coded ultrasonic testing.
基金Supported by the National Natural Science Foundation of China (60503020, 60503033, 60373066, 60403016)Opening Foundation of Jiangsu Key Laboratory of Computer Information Processing Technology in Soochow University
文摘It has very realistic significance for improving the quality of users' accessing information to filter and selectively retrieve the large number of information on the Internet. On the basis of analyzing the existing users' interest models and some basic questions of users' interest (representation, derivation and identification of users' interest), a Bayesian network based users' interest model is given. In this model, the users' interest reduction algorithm based on Markov Blanket model is used to reduce the interest noise, and then users' interested and not interested documents are used to train the Bayesian network. Compared to the simple model, this model has the following advantages like small space requirements, simple reasoning method and high recognition rate. The experiment result shows this model can more appropriately reflect the user's interest, and has higher performance and good usability.
基金Supported by the National Natural Science Foundation of China (60503020, 60503033, 60703086)the Natural Science Foundation of Jiangsu Province (BK2006094)+1 种基金the Opening Foundation of Jiangsu Key Labo-ratory of Computer Information Processing Technology in Soochow University ( KJS0714)the Research Foundation of Nanjing University of Posts and Telecommunications (NY207052, NY207082)
文摘Though K-means is very popular for general clustering, its performance, which generally converges to numerous local minima, depends highly on initial cluster centers. In this paper a novel initialization scheme to select initial cluster centers for K-means clustering is proposed. This algorithm is based on reverse nearest neighbor (RNN) search which retrieves all points in a given data set whose nearest neighbor is a given query point. The initial cluster centers computed using this methodology are found to be very close to the desired cluster centers for iterative clustering algorithms. This procedure is applicable to clustering algorithms for continuous data. The application of the proposed algorithm to K-means clustering algorithm is demonstrated. An experiment is carried out on several popular datasets and the results show the advantages of the proposed method.
基金Supported by the National Natural Science Foundation of China (60803007, 90818027, 10871091 and 60721002)the National High Technology Re-search and Development Program of China (2009AA01Z147)the National Basic Research Program of China (2008CB320703)
文摘Based on the common properties of logic formulas:equivalence and satisfiability,the concept of variable minimal formulas with property preservation is introduced.A formula is variable minimal if the resulting sub-formulas with any variable omission will change the given property.Some theoretical results of two classes:variable minimal equivalence(VME) and variable minimal satisfiability(VMS) are studied.We prove that VME is NP-complete,and VMS is in DP and coNP-hard.
基金Supported by the Major Research Plan of the National Natural Science Foundation of China(90818027)the Key Program of the National Natural Science Foundation of China(60633010)+2 种基金the National Natural Science Foundation of China (60873050,60873049,60803008)the Natural Science Foundation of Jiangsu Province of China(BK2006094,BK2008292)the Opening Foundation of State Key Laboratory of Software Engineering in Wu-han University (SkLSE20080717)
文摘Based on the different roles played by base flow and alternative flow in the process to achieve user's goals, we have found that loop structure is frequently used to implement alternative flow and/or to connect different use cases. This paper presents an approach to identify base flows and alternative flows of different use cases by traversing control flow graph in which back edges are eliminated. The effectiveness of the approach is verified by identification of the use case structure of an ATM system. The workload of human intervention of the approach is relatively slight, and the manner of human intervention closely follows the usual process of software comprehension.
基金Supported by the National High Technology Research and Development Program of China (863 Program) (2009AA01Z147)the National Natural Science Foundation of China (90818027, 60633010, 60803008)the National Science Foun for Distinguished Young Scholars (60425206)
文摘To avoid the precision loss caused by combining data- flow facts impossible to occur in the same execution path in dependence analysis for C programs, this paper first proposes a flow-sensitive and context-insensitive points-to analysis algorithm and then presents a new dependence analysis approach based on it. The approach makes more sufficient consideration on the executa- ble path problem and can avoid invalid combination between points-to relations and between points-to relations and reaching definitions. The results of which are therefore more precise than those of the ordinary dependence analysis approaches.