Mining sequential patterns from large databases has been recognized by many researchers as an attractive task of data mining and knowledge dis- covery. Previous algorithms scan the databases for many times, which is ...Mining sequential patterns from large databases has been recognized by many researchers as an attractive task of data mining and knowledge dis- covery. Previous algorithms scan the databases for many times, which is often unendurable due to the very large amount of databases. In this paper, the authors introduce an effective algorithm for mining sequential patterns from large databases. In the algorithm, the original database is not used at all for counting the support of sequences after the first pass. Rather, a tidlist structure generated in the Previous pass is employed for the purpose based on set intersection operations, avoiding the multiple scans of the databases.展开更多
Finding correlated sequential patterns in large sequence databases is one of the essential tasks in data mining since a huge number of sequential patterns are usually mined, but it is hard to find sequential patterns ...Finding correlated sequential patterns in large sequence databases is one of the essential tasks in data mining since a huge number of sequential patterns are usually mined, but it is hard to find sequential patterns with the correlation. According to the requirement of real applications, the needed data analysis should be different. In previous mining approaches, after mining the sequential patterns, sequential patterns with the weak affinity are found even with a high minimum support. In this paper, a new framework is suggested for mining weighted support affinity patterns in which an objective measure, sequential ws-confidence is developed to detect correlated sequential patterns with weighted support affinity patterns. To efficiently prune the weak affinity patterns, it is proved that ws-confidence measure satisfies the anti-monotone and cross weighted support properties which can be applied to eliminate sequential patterns with dissimilar weighted support levels. Based on the framework, a weighted support affinity pattern mining algorithm (WSMiner) is suggested. The performance study shows that WSMiner is efficient and scalable for mining weighted support affinity patterns.展开更多
In order to reduce the computational and spatial complexity in rerunning algorithm of sequential patterns query, this paper proposes sequential patterns based and projection database based algorithm for fast interacti...In order to reduce the computational and spatial complexity in rerunning algorithm of sequential patterns query, this paper proposes sequential patterns based and projection database based algorithm for fast interactive sequential patterns mining algorithm (FISP), in which the number of frequent items of the projection databases constructed by the correct mining which based on the previously mined sequences has been reduced. Furthermore, the algorithm's iterative running times are reduced greatly by using global-threshold. The results of experiments testify that FISP outperforms PrefixSpan in interactive mining展开更多
Holistic understanding of wind behaviour over space,time and height is essential for harvesting wind energy application.This study presents a novel approach for mapping frequent wind profile patterns using multidimen...Holistic understanding of wind behaviour over space,time and height is essential for harvesting wind energy application.This study presents a novel approach for mapping frequent wind profile patterns using multidimensional sequential pattern mining(MDSPM).This study is illustrated with a time series of 24 years of European Centre for Medium-Range Weather Forecasts European Reanalysis-Interim gridded(0.125°×0.125°)wind data for the Netherlands every 6 h and at six height levels.The wind data were first transformed into two spatio-temporal sequence databases(for speed and direction,respectively).Then,the Linear time Closed Itemset Miner Sequence algorithm was used to extract the multidimensional sequential patterns,which were then visualized using a 3D wind rose,a circular histogram and a geographical map.These patterns were further analysed to determine their wind shear coefficients and turbulence intensities as well as their spatial overlap with current areas with wind turbines.Our analysis identified four frequent wind profile patterns.One of them highly suitable to harvest wind energy at a height of 128 m and 68.97%of the geographical area covered by this pattern already contains wind turbines.This study shows that the proposed approach is capable of efficiently extracting meaningful patterns from complex spatio-temporal datasets.展开更多
针对传统序列模式挖掘(SPM)不考虑模式重复性且忽略各项的效用(单价或利润)与模式长度对用户兴趣度影响的问题,提出一次性条件下top-k高平均效用序列模式挖掘(TOUP)算法。TOUP算法主要包括两个核心步骤:平均效用计算和候选模式生成。首...针对传统序列模式挖掘(SPM)不考虑模式重复性且忽略各项的效用(单价或利润)与模式长度对用户兴趣度影响的问题,提出一次性条件下top-k高平均效用序列模式挖掘(TOUP)算法。TOUP算法主要包括两个核心步骤:平均效用计算和候选模式生成。首先,提出基于各项出现位置与项重复关系数组的CSP(Calculation Support of Pattern)算法计算模式支持度,从而实现模式平均效用的快速计算;其次,采用项集扩展和序列扩展生成候选模式,并提出了最大平均效用上界,基于该上界实现对候选模式的有效剪枝。在5个真实数据集和1个合成数据集上的实验结果表明,相较于TOUP-dfs和HAOP-ms算法,TOUP算法的候选模式数分别降低了38.5%~99.8%和0.9%~77.6%;运行时间分别降低了33.6%~97.1%和57.9%~97.2%。TOUP的算法性能更优,能更高效地挖掘用户感兴趣的模式。展开更多
由于序列模式挖掘需要花费大量计算时间,并需要占用大量存储空间.减少计算量、节省存储空间开销成为序列模式挖掘的关键.因PrefixSpan算法不产生候选,而适当应用Bitmap数据结构可避免重复扫描数据库,基于此,本文提出了BM-PrefixSpan算法...由于序列模式挖掘需要花费大量计算时间,并需要占用大量存储空间.减少计算量、节省存储空间开销成为序列模式挖掘的关键.因PrefixSpan算法不产生候选,而适当应用Bitmap数据结构可避免重复扫描数据库,基于此,本文提出了BM-PrefixSpan算法,用于序列模式挖掘,设计并构造了PFPBM(Prefix of First Position on BitMap)表用于记录序列中的每个项在位图中第1次出现的位置.实验结果表明,BM-PrefixSpan算法综合了PrefixSpan和SPAM算法的优点,能够更快、更好地挖掘出序列模式.展开更多
针对未知安全协议的格式解析方法是当前信息安全技术中亟待解决的关键问题.现有基于网络报文流量信息的方法仅考虑报文载荷中的明文信息,不适用于包含大量密文信息的安全协议.针对该问题,提出一种新的面向未知安全协议的格式解析方法(se...针对未知安全协议的格式解析方法是当前信息安全技术中亟待解决的关键问题.现有基于网络报文流量信息的方法仅考虑报文载荷中的明文信息,不适用于包含大量密文信息的安全协议.针对该问题,提出一种新的面向未知安全协议的格式解析方法(security protocols format parsing approach,SPFPA).SPFPA首次利用序列模式挖掘方法层次化、序列化提取协议的关键词序列特征,为明文信息格式解析提供一种新的解决思路,并在此基础上给出查找协议密文长度域的启发式规则,进而利用密文数据的随机性特征确定密文域.实验结果表明,该方法在不借助任何主机运行特征的基础上,仅依靠网络报文数据即能够有效解析未知安全协议的不变域、可变域、密文长度域及相应的密文域,并具有较高的准确率.展开更多
Sequential pattern mining is an important data mining problem with broadapplications. However, it is also a challenging problem since the mining may have to generate orexamine a combinatorially explosive number of int...Sequential pattern mining is an important data mining problem with broadapplications. However, it is also a challenging problem since the mining may have to generate orexamine a combinatorially explosive number of intermediate subsequences. Recent studies havedeveloped two major classes of sequential pattern mining methods: (1) a candidategeneration-and-test approach, represented by (ⅰ) GSP, a horizontal format-based sequential patternmining method, and (ⅱ) SPADE, a vertical format-based method; and (2) a pattern-growth method,represented by PrefixSpan and its further extensions, such as gSpan for mining structured patterns.In this study, we perform a systematic introduction and presentation of the pattern-growthmethodology and study its principles and extensions. We first introduce two interestingpattern-growth algorithms, FreeSpan and PrefixSpan, for efficient sequential pattern mining. Then weintroduce gSpan for mining structured patterns using the same methodology. Their relativeperformance in large databases is presented and analyzed. Several extensions of these methods arealso discussed in the paper, including mining multi-level, multi-dimensional patterns and miningconstraint-based patterns.展开更多
文摘Mining sequential patterns from large databases has been recognized by many researchers as an attractive task of data mining and knowledge dis- covery. Previous algorithms scan the databases for many times, which is often unendurable due to the very large amount of databases. In this paper, the authors introduce an effective algorithm for mining sequential patterns from large databases. In the algorithm, the original database is not used at all for counting the support of sequences after the first pass. Rather, a tidlist structure generated in the Previous pass is employed for the purpose based on set intersection operations, avoiding the multiple scans of the databases.
文摘Finding correlated sequential patterns in large sequence databases is one of the essential tasks in data mining since a huge number of sequential patterns are usually mined, but it is hard to find sequential patterns with the correlation. According to the requirement of real applications, the needed data analysis should be different. In previous mining approaches, after mining the sequential patterns, sequential patterns with the weak affinity are found even with a high minimum support. In this paper, a new framework is suggested for mining weighted support affinity patterns in which an objective measure, sequential ws-confidence is developed to detect correlated sequential patterns with weighted support affinity patterns. To efficiently prune the weak affinity patterns, it is proved that ws-confidence measure satisfies the anti-monotone and cross weighted support properties which can be applied to eliminate sequential patterns with dissimilar weighted support levels. Based on the framework, a weighted support affinity pattern mining algorithm (WSMiner) is suggested. The performance study shows that WSMiner is efficient and scalable for mining weighted support affinity patterns.
基金Supported by the National Natural Science Funda-tion of China (70371015) andthe Natural Science Foundation of Jian-gsu Province (BK2004058)
文摘In order to reduce the computational and spatial complexity in rerunning algorithm of sequential patterns query, this paper proposes sequential patterns based and projection database based algorithm for fast interactive sequential patterns mining algorithm (FISP), in which the number of frequent items of the projection databases constructed by the correct mining which based on the previously mined sequences has been reduced. Furthermore, the algorithm's iterative running times are reduced greatly by using global-threshold. The results of experiments testify that FISP outperforms PrefixSpan in interactive mining
基金This work was supported by the Malaysian Ministry of Education(SLAI)and Universiti Teknologi Malaysia(UTM).
文摘Holistic understanding of wind behaviour over space,time and height is essential for harvesting wind energy application.This study presents a novel approach for mapping frequent wind profile patterns using multidimensional sequential pattern mining(MDSPM).This study is illustrated with a time series of 24 years of European Centre for Medium-Range Weather Forecasts European Reanalysis-Interim gridded(0.125°×0.125°)wind data for the Netherlands every 6 h and at six height levels.The wind data were first transformed into two spatio-temporal sequence databases(for speed and direction,respectively).Then,the Linear time Closed Itemset Miner Sequence algorithm was used to extract the multidimensional sequential patterns,which were then visualized using a 3D wind rose,a circular histogram and a geographical map.These patterns were further analysed to determine their wind shear coefficients and turbulence intensities as well as their spatial overlap with current areas with wind turbines.Our analysis identified four frequent wind profile patterns.One of them highly suitable to harvest wind energy at a height of 128 m and 68.97%of the geographical area covered by this pattern already contains wind turbines.This study shows that the proposed approach is capable of efficiently extracting meaningful patterns from complex spatio-temporal datasets.
文摘针对传统序列模式挖掘(SPM)不考虑模式重复性且忽略各项的效用(单价或利润)与模式长度对用户兴趣度影响的问题,提出一次性条件下top-k高平均效用序列模式挖掘(TOUP)算法。TOUP算法主要包括两个核心步骤:平均效用计算和候选模式生成。首先,提出基于各项出现位置与项重复关系数组的CSP(Calculation Support of Pattern)算法计算模式支持度,从而实现模式平均效用的快速计算;其次,采用项集扩展和序列扩展生成候选模式,并提出了最大平均效用上界,基于该上界实现对候选模式的有效剪枝。在5个真实数据集和1个合成数据集上的实验结果表明,相较于TOUP-dfs和HAOP-ms算法,TOUP算法的候选模式数分别降低了38.5%~99.8%和0.9%~77.6%;运行时间分别降低了33.6%~97.1%和57.9%~97.2%。TOUP的算法性能更优,能更高效地挖掘用户感兴趣的模式。
文摘由于序列模式挖掘需要花费大量计算时间,并需要占用大量存储空间.减少计算量、节省存储空间开销成为序列模式挖掘的关键.因PrefixSpan算法不产生候选,而适当应用Bitmap数据结构可避免重复扫描数据库,基于此,本文提出了BM-PrefixSpan算法,用于序列模式挖掘,设计并构造了PFPBM(Prefix of First Position on BitMap)表用于记录序列中的每个项在位图中第1次出现的位置.实验结果表明,BM-PrefixSpan算法综合了PrefixSpan和SPAM算法的优点,能够更快、更好地挖掘出序列模式.
文摘针对未知安全协议的格式解析方法是当前信息安全技术中亟待解决的关键问题.现有基于网络报文流量信息的方法仅考虑报文载荷中的明文信息,不适用于包含大量密文信息的安全协议.针对该问题,提出一种新的面向未知安全协议的格式解析方法(security protocols format parsing approach,SPFPA).SPFPA首次利用序列模式挖掘方法层次化、序列化提取协议的关键词序列特征,为明文信息格式解析提供一种新的解决思路,并在此基础上给出查找协议密文长度域的启发式规则,进而利用密文数据的随机性特征确定密文域.实验结果表明,该方法在不借助任何主机运行特征的基础上,仅依靠网络报文数据即能够有效解析未知安全协议的不变域、可变域、密文长度域及相应的密文域,并具有较高的准确率.
文摘Sequential pattern mining is an important data mining problem with broadapplications. However, it is also a challenging problem since the mining may have to generate orexamine a combinatorially explosive number of intermediate subsequences. Recent studies havedeveloped two major classes of sequential pattern mining methods: (1) a candidategeneration-and-test approach, represented by (ⅰ) GSP, a horizontal format-based sequential patternmining method, and (ⅱ) SPADE, a vertical format-based method; and (2) a pattern-growth method,represented by PrefixSpan and its further extensions, such as gSpan for mining structured patterns.In this study, we perform a systematic introduction and presentation of the pattern-growthmethodology and study its principles and extensions. We first introduce two interestingpattern-growth algorithms, FreeSpan and PrefixSpan, for efficient sequential pattern mining. Then weintroduce gSpan for mining structured patterns using the same methodology. Their relativeperformance in large databases is presented and analyzed. Several extensions of these methods arealso discussed in the paper, including mining multi-level, multi-dimensional patterns and miningconstraint-based patterns.