In order to improve the accuracy and integrality of mining data records from the web, the concepts of isomorphic page and directory page and three algorithms are proposed. An isomorphic web page is a set of web pages ...In order to improve the accuracy and integrality of mining data records from the web, the concepts of isomorphic page and directory page and three algorithms are proposed. An isomorphic web page is a set of web pages that have uniform structure, only differing in main information. A web page which contains many links that link to isomorphic web pages is called a directory page. Algorithm 1 can find directory web pages in a web using adjacent links similar analysis method. It first sorts the link, and then counts the links in each directory. If the count is greater than a given valve then finds the similar sub-page links in the directory and gives the results. A function for an isomorphic web page judgment is also proposed. Algorithm 2 can mine data records from an isomorphic page using a noise information filter. It is based on the fact that the noise information is the same in two isomorphic pages, only the main information is different. Algorithm 3 can mine data records from an entire website using the technology of spider. The experiment shows that the proposed algorithms can mine data records more intactly than the existing algorithms. Mining data records from isomorphic pages is an efficient method.展开更多
A partition of intervals method is adopted in current classification based on associations (CBA), but this method cannot reflect the actual distribution of data and exists the problem of sharp boundary problem. The cl...A partition of intervals method is adopted in current classification based on associations (CBA), but this method cannot reflect the actual distribution of data and exists the problem of sharp boundary problem. The classification system based on the longest association rules with linguistic terms is discussed, and the shortcoming of this classification system is analyzed. Then, the classification system based on the short association rules with linguistic terms is presented. The example shows that the accuracy of the classification system based on the association rules with linguistic terms is better than two popular classification methods: C4.5 and CBA.展开更多
Current technology for frequent itemset mining mostly applies to the data stored in a single transaction database. This paper presents a novel algorithm MultiClose for frequent itemset mining in data warehouses. Multi...Current technology for frequent itemset mining mostly applies to the data stored in a single transaction database. This paper presents a novel algorithm MultiClose for frequent itemset mining in data warehouses. MultiClose respectively computes the results in single dimension tables and merges the results with a very efficient approach. Close itemsets technique is used to improve the performance of the algorithm. The authors propose an efficient implementation for star schemas in which their al- gorithm outperforms state-of-the-art single-table algorithms.展开更多
Rough set theory is a new soft computing tool, and has received much attention of researchers around the world. It can deal with incomplete and uncertain information. Now, it has been applied in many areas successfull...Rough set theory is a new soft computing tool, and has received much attention of researchers around the world. It can deal with incomplete and uncertain information. Now, it has been applied in many areas successfully. This paper introduces the basic concepts of rough set and discusses its applications in Web mining. In particular, some applications of rough set theory to intelligent information processing are emphasized.展开更多
In order to construct the data mining frame for the generic project risk research, the basic definitions of the generic project risk element were given, and then a new model of the generic project risk element was pre...In order to construct the data mining frame for the generic project risk research, the basic definitions of the generic project risk element were given, and then a new model of the generic project risk element was presented with the definitions. From the model, data mining method was used to acquire the risk transmission matrix from the historical databases analysis. The quantitative calculation problem among the generic project risk elements was solved. This method deals with well the risk element transmission problems with limited states. And in order to get the limited states, fuzzy theory was used to discrete the historical data in historical databases. In an example, the controlling risk degree is chosen as P(Rs≥2) ≤0.1, it means that the probability of risk state which is not less than 2 in project is not more than 0.1, the risk element R3 is chosen to control the project, respectively. The result shows that three risk element transmission matrix can be acquired in 4 risk elements, and the frequency histogram and cumulative frequency histogram of each risk element are also given.展开更多
Through analyzing experimental data of gas explosions in excavation roadwaysand the forecast models of the literature, Found that there is no direct proportional linearcorrelation between overpressure and the square r...Through analyzing experimental data of gas explosions in excavation roadwaysand the forecast models of the literature, Found that there is no direct proportional linearcorrelation between overpressure and the square root of the accumulated volume of gas,the square root of the propagation distance multiplicative inverse.Also, attenuation speedof the forecast model calculation is faster than that of experimental data.Based on theoriginal forecast models and experimental data, deduced the relation of factors by introducinga correlation coefficient with concrete volume and distance, which had been verifiedby the roadway experiment data.The results show that it is closer to the roadway experimentaldata and the overpressure amount increases first then decreases with thepropagation distance.展开更多
Background knowledge is important for data mining, especially in complicated situation. Ontological engineering is the successor of knowledge engineering. The sharable knowledge bases built on ontology can be used to ...Background knowledge is important for data mining, especially in complicated situation. Ontological engineering is the successor of knowledge engineering. The sharable knowledge bases built on ontology can be used to provide background knowledge to direct the process of data mining. This paper gives a common introduction to the method and presents a practical analysis example using SVM (support vector machine) as the classifier. Gene Ontology and the accompanying annotations compose a big knowledge base, on which many researches have been carried out. Microarray dataset is the output of DNA chip. With the help of Gene Ontology we present a more elaborate analysis on microarray data than former researchers. The method can also be used in other fields with similar scenario.展开更多
In order to accurately identify the characters associated with consumption behavior of apparel online shopping, a typical B/ C clothing enterprise in China was chosen. The target experimental database containing 2000 ...In order to accurately identify the characters associated with consumption behavior of apparel online shopping, a typical B/ C clothing enterprise in China was chosen. The target experimental database containing 2000 data records was obtained based on web service logs of sample enterprise. By means of clustering algorithm of Clementine Data Mining Software, K-means model was set up and 8 clusters of consumer were concluded. Meanwhile, the implicit information existed in consumer's characters and preferences for clothing was found. At last, 31 valuable association rules among casual wear, formal wear, and tie-in products were explored by using web analysis and Aprior algorithm. This finding will help to better understand the nature of online apparel consumption behavior and make a good progress in personalization and intelligent recommendation strategies.展开更多
Because mining complete set of frequent patterns from dense database could be impractical, an interesting alternative has been proposed recently. Instead of mining the complete set of frequent patterns, the new model ...Because mining complete set of frequent patterns from dense database could be impractical, an interesting alternative has been proposed recently. Instead of mining the complete set of frequent patterns, the new model only finds out the maximal frequent patterns, which can generate all frequent patterns. FP-growth algorithm is one of the most efficient frequent-pattern mining methods published so far. However, because FP-tree and conditional FP-trees must be two-way traversable, a great deal memory is needed in process of mining. This paper proposes an efficient algorithm Unid_FP-Max for mining maximal frequent patterns based on unidirectional FP-tree. Because of generation method of unidirectional FP-tree and conditional unidirectional FP-trees, the algorithm reduces the space consumption to the fullest extent. With the development of two techniques: single path pruning and header table pruning which can cut down many conditional unidirectional FP-trees generated recursively in mining process, Unid_FP-Max further lowers the expense of time and space.展开更多
Since data services are penetrating into our daily life rapidly, the mobile network becomes more complicated, and the amount of data transmission is more and more increasing. In this case, the traditional statistical ...Since data services are penetrating into our daily life rapidly, the mobile network becomes more complicated, and the amount of data transmission is more and more increasing. In this case, the traditional statistical methods for anomalous cell detection cannot adapt to the evolution of networks, and data mining becomes the mainstream. In this paper, we propose a novel kernel density-based local outlier factor(KLOF) to assign a degree of being an outlier to each object. Firstly, the notion of KLOF is introduced, which captures exactly the relative degree of isolation. Then, by analyzing its properties, including the tightness of upper and lower bounds, sensitivity of density perturbation, we find that KLOF is much greater than 1 for outliers. Lastly, KLOFis applied on a real-world dataset to detect anomalous cells with abnormal key performance indicators(KPIs) to verify its reliability. The experiment shows that KLOF can find outliers efficiently. It can be a guideline for the operators to perform faster and more efficient trouble shooting.展开更多
HA (hashing array), a new algorithm, for mining frequent itemsets of large database is proposed. It employs a structure hash array, ltemArray ( ) to store the information of database and then uses it instead of da...HA (hashing array), a new algorithm, for mining frequent itemsets of large database is proposed. It employs a structure hash array, ltemArray ( ) to store the information of database and then uses it instead of database in later iteration. By this improvement, only twice scanning of the whole database is necessary, thereby the computational cost can be reduced significantly. To overcome the performance bottleneck of frequent 2-itemsets mining, a modified algorithm of HA, DHA (directaddressing hashing and array) is proposed, which combines HA with direct-addressing hashing technique. The new hybrid algorithm, DHA, not only overcomes the performance bottleneck but also inherits the advantages of HA. Extensive simulations are conducted in this paper to evaluate the performance of the proposed new algorithm, and the results prove the new algorithm is more efficient and reasonable.展开更多
Querying XML data is a computationally expensive process due to the complex nature of both the XML data and the XML queries. In this paper we propose an approach to expedite XML query processing by caching the results...Querying XML data is a computationally expensive process due to the complex nature of both the XML data and the XML queries. In this paper we propose an approach to expedite XML query processing by caching the results of frequent queries. We discover frequent query patterns from user-issued queries using an efficient bottom-up mining approach called VBUXMiner. VBUXMiner consists of two main steps. First, all queries are merged into a summary structure named "compressed global tree guide" (CGTG). Second, a bottom-up traversal scheme based on the CGTG is employed to generate frequent query patterns. We use the frequent query patterns in a cache mechanism to improve the XML query performance. Experimental results show that our proposed mining approach outperforms the previous mining algorithms for XML queries, such as XQPMinerTID and FastXMiner, and that by caching the results of frequent query patterns, XML query performance can be dramatically improved.展开更多
Research and application of big data mining,at present,is a hot issue. This paper briefly introduces the basic ideas of big data research, analyses the necessity of big data application in earthquake precursor observa...Research and application of big data mining,at present,is a hot issue. This paper briefly introduces the basic ideas of big data research, analyses the necessity of big data application in earthquake precursor observation,and probes certain issues and solutions when applying this technology to work in the seismic-related domain. By doing so,we hope it can promote the innovative use of big data in earthquake precursor observation data analysis.展开更多
The rapid developments in the fields of telecommunication, sensor data, financial applications, analyzing of data streams, and so on, increase the rate of data arrival, among which the data mining technique is conside...The rapid developments in the fields of telecommunication, sensor data, financial applications, analyzing of data streams, and so on, increase the rate of data arrival, among which the data mining technique is considered a vital process. The data analysis process consists of different tasks, among which the data stream classification approaches face more challenges than the other commonly used techniques. Even though the classification is a continuous process, it requires a design that can adapt the classification model so as to adjust the concept change or the boundary change between the classes. Hence, we design a novel fuzzy classifier known as THRFuzzy to classify new incoming data streams. Rough set theory along with tangential holoentropy function helps in the designing the dynamic classification model. The classification approach uses kernel fuzzy c-means(FCM) clustering for the generation of the rules and tangential holoentropy function to update the membership function. The performance of the proposed THRFuzzy method is verified using three datasets, namely skin segmentation, localization, and breast cancer datasets, and the evaluated metrics, accuracy and time, comparing its performance with HRFuzzy and adaptive k-NN classifiers. The experimental results conclude that THRFuzzy classifier shows better classification results providing a maximum accuracy consuming a minimal time than the existing classifiers.展开更多
Sensors are ubiquitous in the Internet of Things for measuring and collecting data. Analyzing these data derived from sensors is an essential task and can reveal useful latent information besides the data. Since the I...Sensors are ubiquitous in the Internet of Things for measuring and collecting data. Analyzing these data derived from sensors is an essential task and can reveal useful latent information besides the data. Since the Internet of Things contains many sorts of sensors, the measurement data collected by these sensors are multi-type data, sometimes contai- ning temporal series information. If we separately deal with different sorts of data, we will miss useful information. This paper proposes a method to dis- cover the correlation in multi-faceted data, which contains many types of data with temporal informa- tion, and our method can simultaneously deal with multi-faceted data. We transform high-dimensional multi-faeeted data into lower-dimensional data which is set as multivariate Gaussian Graphical Models, then mine the correlation in multi-faceted data by discover the structure of the multivariate Gausslan Graphical Models. With a real data set, we verifies our method, and the experiment demonstrates that the method we propose can correctly fred out the correlation among multi-faceted meas- urement data.展开更多
A fundamental problem in whole sequence matching and subsequence matching is the problem of representation of time series.In the last decade many high level representations of time series have been proposed for data m...A fundamental problem in whole sequence matching and subsequence matching is the problem of representation of time series.In the last decade many high level representations of time series have been proposed for data mining which involve a trade-off between accuracy and compactness.In this paper the author proposes a novel time series representation called Grid Minimum Bounding Rectangle(GMBR) and based on Minimum Bounding Rectangle.In this paper,the binary idea is applied into the Minimum Bounding Rectangle.The experiments have been performed on synthetic,as well as real data sequences to evaluate the proposed method.The experiment demonstrates that 69%-92% of irrelevant sequences are pruned using the proposed method.展开更多
Recently,many data anonymization methods have been proposed to protect privacy in the applications of data mining.But few of them have considered the threats from user's priori knowledge of data patterns.To solve ...Recently,many data anonymization methods have been proposed to protect privacy in the applications of data mining.But few of them have considered the threats from user's priori knowledge of data patterns.To solve this problem,a flexible method was proposed to randomize the dataset,so that the user could hardly obtain the sensitive data even knowing data relationships in advance.The method also achieves a high level of accuracy in the mining process as demonstrated in the experiments.展开更多
In this paper, we conduct research on the developmental trend of the data journalism under the current background and the time of big data. Big data is not only a concept, but also a description of a state of society...In this paper, we conduct research on the developmental trend of the data journalism under the current background and the time of big data. Big data is not only a concept, but also a description of a state of society: in the era of the big data, data become important social resources and production data, the news media is no exception. In the time of the data had not been so seriously, the core of the news resources is a reporter on the scene to get first-hand material, is based on the reporter can see, smell, feel the fact description, data is often only a supplementary role. However, in today' s era of big data, although the scene is also very important, but based on the various aspects of data mining and analysis and the depth of the formation of information has become more and more important. Our research proposes the novel paradigm for the issues that is meaningful.展开更多
Traditional clustering algorithms generally have some problems, such as the sensitivity to initializing parameter, difficulty in finding out the optimization clustering result and the validity of clustering. In this p...Traditional clustering algorithms generally have some problems, such as the sensitivity to initializing parameter, difficulty in finding out the optimization clustering result and the validity of clustering. In this paper, a FSM and a mathematic model of a new-style clustering algorithm based on the swarm intelligence are provided. In this algorithm, the clustering main body moves in a three-dimensional space and has the abilities of memory, communication, analysis, judgment and coordinating information. Experimental results conform that this algorithm has many merits such as insensitive to the order of the data, capable of dealing with exceptional, high-dimension or complicated data. The algorithm can be used in the fields of Web mining, incremental clustering. economic analysis, oattern recognition, document classification and so on.展开更多
文摘In order to improve the accuracy and integrality of mining data records from the web, the concepts of isomorphic page and directory page and three algorithms are proposed. An isomorphic web page is a set of web pages that have uniform structure, only differing in main information. A web page which contains many links that link to isomorphic web pages is called a directory page. Algorithm 1 can find directory web pages in a web using adjacent links similar analysis method. It first sorts the link, and then counts the links in each directory. If the count is greater than a given valve then finds the similar sub-page links in the directory and gives the results. A function for an isomorphic web page judgment is also proposed. Algorithm 2 can mine data records from an isomorphic page using a noise information filter. It is based on the fact that the noise information is the same in two isomorphic pages, only the main information is different. Algorithm 3 can mine data records from an entire website using the technology of spider. The experiment shows that the proposed algorithms can mine data records more intactly than the existing algorithms. Mining data records from isomorphic pages is an efficient method.
文摘A partition of intervals method is adopted in current classification based on associations (CBA), but this method cannot reflect the actual distribution of data and exists the problem of sharp boundary problem. The classification system based on the longest association rules with linguistic terms is discussed, and the shortcoming of this classification system is analyzed. Then, the classification system based on the short association rules with linguistic terms is presented. The example shows that the accuracy of the classification system based on the association rules with linguistic terms is better than two popular classification methods: C4.5 and CBA.
文摘Current technology for frequent itemset mining mostly applies to the data stored in a single transaction database. This paper presents a novel algorithm MultiClose for frequent itemset mining in data warehouses. MultiClose respectively computes the results in single dimension tables and merges the results with a very efficient approach. Close itemsets technique is used to improve the performance of the algorithm. The authors propose an efficient implementation for star schemas in which their al- gorithm outperforms state-of-the-art single-table algorithms.
文摘Rough set theory is a new soft computing tool, and has received much attention of researchers around the world. It can deal with incomplete and uncertain information. Now, it has been applied in many areas successfully. This paper introduces the basic concepts of rough set and discusses its applications in Web mining. In particular, some applications of rough set theory to intelligent information processing are emphasized.
基金Project(70572090) supported by the National Natural Science Foundation of China
文摘In order to construct the data mining frame for the generic project risk research, the basic definitions of the generic project risk element were given, and then a new model of the generic project risk element was presented with the definitions. From the model, data mining method was used to acquire the risk transmission matrix from the historical databases analysis. The quantitative calculation problem among the generic project risk elements was solved. This method deals with well the risk element transmission problems with limited states. And in order to get the limited states, fuzzy theory was used to discrete the historical data in historical databases. In an example, the controlling risk degree is chosen as P(Rs≥2) ≤0.1, it means that the probability of risk state which is not less than 2 in project is not more than 0.1, the risk element R3 is chosen to control the project, respectively. The result shows that three risk element transmission matrix can be acquired in 4 risk elements, and the frequency histogram and cumulative frequency histogram of each risk element are also given.
基金Supported by the National Natural Science Foundation of China(50874005)Anhui Province College Young Teachers Scientific Research"Allotment Planning"Key Project(2009SQRZ067)
文摘Through analyzing experimental data of gas explosions in excavation roadwaysand the forecast models of the literature, Found that there is no direct proportional linearcorrelation between overpressure and the square root of the accumulated volume of gas,the square root of the propagation distance multiplicative inverse.Also, attenuation speedof the forecast model calculation is faster than that of experimental data.Based on theoriginal forecast models and experimental data, deduced the relation of factors by introducinga correlation coefficient with concrete volume and distance, which had been verifiedby the roadway experiment data.The results show that it is closer to the roadway experimentaldata and the overpressure amount increases first then decreases with thepropagation distance.
基金Project (No. 20040248001) supported by the Ph.D. Programs Foun-dation of Ministry of Education of China
文摘Background knowledge is important for data mining, especially in complicated situation. Ontological engineering is the successor of knowledge engineering. The sharable knowledge bases built on ontology can be used to provide background knowledge to direct the process of data mining. This paper gives a common introduction to the method and presents a practical analysis example using SVM (support vector machine) as the classifier. Gene Ontology and the accompanying annotations compose a big knowledge base, on which many researches have been carried out. Microarray dataset is the output of DNA chip. With the help of Gene Ontology we present a more elaborate analysis on microarray data than former researchers. The method can also be used in other fields with similar scenario.
基金Scientific Research Program Funded by Shaanxi Provincial Education Department,China(No.2013JK0749)
文摘In order to accurately identify the characters associated with consumption behavior of apparel online shopping, a typical B/ C clothing enterprise in China was chosen. The target experimental database containing 2000 data records was obtained based on web service logs of sample enterprise. By means of clustering algorithm of Clementine Data Mining Software, K-means model was set up and 8 clusters of consumer were concluded. Meanwhile, the implicit information existed in consumer's characters and preferences for clothing was found. At last, 31 valuable association rules among casual wear, formal wear, and tie-in products were explored by using web analysis and Aprior algorithm. This finding will help to better understand the nature of online apparel consumption behavior and make a good progress in personalization and intelligent recommendation strategies.
基金Supported by the National Natural Science Foundation of China ( No.60474022)Henan Innovation Project for University Prominent Research Talents (No.2007KYCX018)
文摘Because mining complete set of frequent patterns from dense database could be impractical, an interesting alternative has been proposed recently. Instead of mining the complete set of frequent patterns, the new model only finds out the maximal frequent patterns, which can generate all frequent patterns. FP-growth algorithm is one of the most efficient frequent-pattern mining methods published so far. However, because FP-tree and conditional FP-trees must be two-way traversable, a great deal memory is needed in process of mining. This paper proposes an efficient algorithm Unid_FP-Max for mining maximal frequent patterns based on unidirectional FP-tree. Because of generation method of unidirectional FP-tree and conditional unidirectional FP-trees, the algorithm reduces the space consumption to the fullest extent. With the development of two techniques: single path pruning and header table pruning which can cut down many conditional unidirectional FP-trees generated recursively in mining process, Unid_FP-Max further lowers the expense of time and space.
基金supported by the National Basic Research Program of China (973 Program: 2013CB329004)
文摘Since data services are penetrating into our daily life rapidly, the mobile network becomes more complicated, and the amount of data transmission is more and more increasing. In this case, the traditional statistical methods for anomalous cell detection cannot adapt to the evolution of networks, and data mining becomes the mainstream. In this paper, we propose a novel kernel density-based local outlier factor(KLOF) to assign a degree of being an outlier to each object. Firstly, the notion of KLOF is introduced, which captures exactly the relative degree of isolation. Then, by analyzing its properties, including the tightness of upper and lower bounds, sensitivity of density perturbation, we find that KLOF is much greater than 1 for outliers. Lastly, KLOFis applied on a real-world dataset to detect anomalous cells with abnormal key performance indicators(KPIs) to verify its reliability. The experiment shows that KLOF can find outliers efficiently. It can be a guideline for the operators to perform faster and more efficient trouble shooting.
文摘HA (hashing array), a new algorithm, for mining frequent itemsets of large database is proposed. It employs a structure hash array, ltemArray ( ) to store the information of database and then uses it instead of database in later iteration. By this improvement, only twice scanning of the whole database is necessary, thereby the computational cost can be reduced significantly. To overcome the performance bottleneck of frequent 2-itemsets mining, a modified algorithm of HA, DHA (directaddressing hashing and array) is proposed, which combines HA with direct-addressing hashing technique. The new hybrid algorithm, DHA, not only overcomes the performance bottleneck but also inherits the advantages of HA. Extensive simulations are conducted in this paper to evaluate the performance of the proposed new algorithm, and the results prove the new algorithm is more efficient and reasonable.
基金the National Natural Science Foundation of China (No. 60603044)the National Key Technologies Supporting Program of China during the 11th Five-Year Plan Period (No. 2006BAH02A03)the Program for Changjiang Scholars and Innovative Research Team in University of China (No. IRT0652)
文摘Querying XML data is a computationally expensive process due to the complex nature of both the XML data and the XML queries. In this paper we propose an approach to expedite XML query processing by caching the results of frequent queries. We discover frequent query patterns from user-issued queries using an efficient bottom-up mining approach called VBUXMiner. VBUXMiner consists of two main steps. First, all queries are merged into a summary structure named "compressed global tree guide" (CGTG). Second, a bottom-up traversal scheme based on the CGTG is employed to generate frequent query patterns. We use the frequent query patterns in a cache mechanism to improve the XML query performance. Experimental results show that our proposed mining approach outperforms the previous mining algorithms for XML queries, such as XQPMinerTID and FastXMiner, and that by caching the results of frequent query patterns, XML query performance can be dramatically improved.
基金sponsored by the Earthquake Monitoring Special Project of "Precursor Observation Data Mining",Key Laboratory of Crustal Dynamics,Institute of Crustal Dynamics,China Earthquake Administration
文摘Research and application of big data mining,at present,is a hot issue. This paper briefly introduces the basic ideas of big data research, analyses the necessity of big data application in earthquake precursor observation,and probes certain issues and solutions when applying this technology to work in the seismic-related domain. By doing so,we hope it can promote the innovative use of big data in earthquake precursor observation data analysis.
基金supported by proposal No.OSD/BCUD/392/197 Board of Colleges and University Development,Savitribai Phule Pune University,Pune
文摘The rapid developments in the fields of telecommunication, sensor data, financial applications, analyzing of data streams, and so on, increase the rate of data arrival, among which the data mining technique is considered a vital process. The data analysis process consists of different tasks, among which the data stream classification approaches face more challenges than the other commonly used techniques. Even though the classification is a continuous process, it requires a design that can adapt the classification model so as to adjust the concept change or the boundary change between the classes. Hence, we design a novel fuzzy classifier known as THRFuzzy to classify new incoming data streams. Rough set theory along with tangential holoentropy function helps in the designing the dynamic classification model. The classification approach uses kernel fuzzy c-means(FCM) clustering for the generation of the rules and tangential holoentropy function to update the membership function. The performance of the proposed THRFuzzy method is verified using three datasets, namely skin segmentation, localization, and breast cancer datasets, and the evaluated metrics, accuracy and time, comparing its performance with HRFuzzy and adaptive k-NN classifiers. The experimental results conclude that THRFuzzy classifier shows better classification results providing a maximum accuracy consuming a minimal time than the existing classifiers.
基金the Project"The Basic Research on Internet of Things Architecture"supported by National Key Basic Research Program of China(No.2011CB302704)supported by National Natural Science Foundation of China(No.60802034)+2 种基金Specialized Research Fund for the Doctoral Program of Higher Education(No.20070013026)Beijing Nova Program(No.2008B50)"New generation broadband wireless mobile communication network"Key Projects for Science and Technology Development(No.2011ZX03002-002-01)
文摘Sensors are ubiquitous in the Internet of Things for measuring and collecting data. Analyzing these data derived from sensors is an essential task and can reveal useful latent information besides the data. Since the Internet of Things contains many sorts of sensors, the measurement data collected by these sensors are multi-type data, sometimes contai- ning temporal series information. If we separately deal with different sorts of data, we will miss useful information. This paper proposes a method to dis- cover the correlation in multi-faceted data, which contains many types of data with temporal informa- tion, and our method can simultaneously deal with multi-faceted data. We transform high-dimensional multi-faeeted data into lower-dimensional data which is set as multivariate Gaussian Graphical Models, then mine the correlation in multi-faceted data by discover the structure of the multivariate Gausslan Graphical Models. With a real data set, we verifies our method, and the experiment demonstrates that the method we propose can correctly fred out the correlation among multi-faceted meas- urement data.
基金National Natural Science Foundation of China (No.60674088)Shandong Education Committee 2007 Scientific Research Development Plan (No.J07WJ20)
文摘A fundamental problem in whole sequence matching and subsequence matching is the problem of representation of time series.In the last decade many high level representations of time series have been proposed for data mining which involve a trade-off between accuracy and compactness.In this paper the author proposes a novel time series representation called Grid Minimum Bounding Rectangle(GMBR) and based on Minimum Bounding Rectangle.In this paper,the binary idea is applied into the Minimum Bounding Rectangle.The experiments have been performed on synthetic,as well as real data sequences to evaluate the proposed method.The experiment demonstrates that 69%-92% of irrelevant sequences are pruned using the proposed method.
文摘Recently,many data anonymization methods have been proposed to protect privacy in the applications of data mining.But few of them have considered the threats from user's priori knowledge of data patterns.To solve this problem,a flexible method was proposed to randomize the dataset,so that the user could hardly obtain the sensitive data even knowing data relationships in advance.The method also achieves a high level of accuracy in the mining process as demonstrated in the experiments.
文摘In this paper, we conduct research on the developmental trend of the data journalism under the current background and the time of big data. Big data is not only a concept, but also a description of a state of society: in the era of the big data, data become important social resources and production data, the news media is no exception. In the time of the data had not been so seriously, the core of the news resources is a reporter on the scene to get first-hand material, is based on the reporter can see, smell, feel the fact description, data is often only a supplementary role. However, in today' s era of big data, although the scene is also very important, but based on the various aspects of data mining and analysis and the depth of the formation of information has become more and more important. Our research proposes the novel paradigm for the issues that is meaningful.
基金Sponsored by the Scientific Research Start-up Foundation of Qingdao University of Science and Technology.
文摘Traditional clustering algorithms generally have some problems, such as the sensitivity to initializing parameter, difficulty in finding out the optimization clustering result and the validity of clustering. In this paper, a FSM and a mathematic model of a new-style clustering algorithm based on the swarm intelligence are provided. In this algorithm, the clustering main body moves in a three-dimensional space and has the abilities of memory, communication, analysis, judgment and coordinating information. Experimental results conform that this algorithm has many merits such as insensitive to the order of the data, capable of dealing with exceptional, high-dimension or complicated data. The algorithm can be used in the fields of Web mining, incremental clustering. economic analysis, oattern recognition, document classification and so on.