A multi-document summarization method based on Latent Semantic Indexing (LSI) is proposed. The method combines several reports on the same issue into a matrix of terms and sentences, and uses a Singular Value Decompos...A multi-document summarization method based on Latent Semantic Indexing (LSI) is proposed. The method combines several reports on the same issue into a matrix of terms and sentences, and uses a Singular Value Decomposition (SVD) to reduce the dimension of the matrix and extract features, and then the sentence similarity is computed. The sentences are clustered according to similarity of sentences. The centroid sentences are selected from each class. Finally, the selected sentences are ordered to generate the summarization. The evaluation and results are presented, which prove that the proposed methods are efficient.展开更多
With the purpose of improving the accuracy of text categorization and reducing the dimension of the feature space,this paper proposes a two-stage feature selection method based on a novel category correlation degree(C...With the purpose of improving the accuracy of text categorization and reducing the dimension of the feature space,this paper proposes a two-stage feature selection method based on a novel category correlation degree(CCD)method and latent semantic indexing(LSI).In the first stage,a novel CCD method is proposed to select the most effective features for text classification,which is more effective than the traditional feature selection method.In the second stage,document representation requires a high dimensionality of the feature space and does not take into account the semantic relation between features,which leads to a poor categorization accuracy.So LSI method is proposed to solve these problems by using statistically derived conceptual indices to replace the individual terms which can discover the important correlative relationship between features and reduce the feature space dimension.Firstly,each feature in our algorithm is ranked depending on their importance of classification using CCD method.Secondly,we construct a new semantic space based on LSI method among features.The experimental results have proved that our method can reduce effectively the dimension of text vector and improve the performance of text categorization.展开更多
软件系统的实体演化耦合分析有助于共同变更预测、软件供应链风险识别、代码漏洞溯源、缺陷预测、架构问题定位等分析活动.两个代码实体之间存在演化耦合(evolutionary coupling)是指在软件修订历史中,这对实体倾向于共同变更(共变).已...软件系统的实体演化耦合分析有助于共同变更预测、软件供应链风险识别、代码漏洞溯源、缺陷预测、架构问题定位等分析活动.两个代码实体之间存在演化耦合(evolutionary coupling)是指在软件修订历史中,这对实体倾向于共同变更(共变).已有的演化耦合分析方法难以准确检测软件维护历史中频繁发生的、有“距离”的共变.为了解决这一问题,提出了基于关联规则挖掘、情节挖掘、潜在语义索引模型相结合的演化耦合分析方法(association rule,MINEPI and LSI based method,AR-MIM),以挖掘有“距离”的共同变更关系.实验收集了58个Python项目、242074条训练数据、330660条ground truth的数据集,与已有的4种baseline方法进行了比较,验证了AR-MIM的效果.结果表明:在预测共同变更候选项场景上,AR-MIM的准确性、召回率、F1分数均优于已有方法.展开更多
本文采用潜在语义索引(LSI)和遗传算法(GA)进行文本特征提取。在采用潜在语义索引将语义关系体现在VSM(Vector Space Model)中,通过奇异值分解(SVD,Singular Value De-composition)可以有效地降低向量空间的维数,但通过维数约简后的文...本文采用潜在语义索引(LSI)和遗传算法(GA)进行文本特征提取。在采用潜在语义索引将语义关系体现在VSM(Vector Space Model)中,通过奇异值分解(SVD,Singular Value De-composition)可以有效地降低向量空间的维数,但通过维数约简后的文本特征仍要保持在数百维左右,因此本文采用遗传算法在此基础上继续降维。实验结果表明,这两种方法结合可以极大的降低文本向量空间的维数,并能提高分类准确率。展开更多
文摘A multi-document summarization method based on Latent Semantic Indexing (LSI) is proposed. The method combines several reports on the same issue into a matrix of terms and sentences, and uses a Singular Value Decomposition (SVD) to reduce the dimension of the matrix and extract features, and then the sentence similarity is computed. The sentences are clustered according to similarity of sentences. The centroid sentences are selected from each class. Finally, the selected sentences are ordered to generate the summarization. The evaluation and results are presented, which prove that the proposed methods are efficient.
基金the National Natural Science Foundation of China(Nos.61073193 and 61300230)the Key Science and Technology Foundation of Gansu Province(No.1102FKDA010)+1 种基金the Natural Science Foundation of Gansu Province(No.1107RJZA188)the Science and Technology Support Program of Gansu Province(No.1104GKCA037)
文摘With the purpose of improving the accuracy of text categorization and reducing the dimension of the feature space,this paper proposes a two-stage feature selection method based on a novel category correlation degree(CCD)method and latent semantic indexing(LSI).In the first stage,a novel CCD method is proposed to select the most effective features for text classification,which is more effective than the traditional feature selection method.In the second stage,document representation requires a high dimensionality of the feature space and does not take into account the semantic relation between features,which leads to a poor categorization accuracy.So LSI method is proposed to solve these problems by using statistically derived conceptual indices to replace the individual terms which can discover the important correlative relationship between features and reduce the feature space dimension.Firstly,each feature in our algorithm is ranked depending on their importance of classification using CCD method.Secondly,we construct a new semantic space based on LSI method among features.The experimental results have proved that our method can reduce effectively the dimension of text vector and improve the performance of text categorization.
文摘软件系统的实体演化耦合分析有助于共同变更预测、软件供应链风险识别、代码漏洞溯源、缺陷预测、架构问题定位等分析活动.两个代码实体之间存在演化耦合(evolutionary coupling)是指在软件修订历史中,这对实体倾向于共同变更(共变).已有的演化耦合分析方法难以准确检测软件维护历史中频繁发生的、有“距离”的共变.为了解决这一问题,提出了基于关联规则挖掘、情节挖掘、潜在语义索引模型相结合的演化耦合分析方法(association rule,MINEPI and LSI based method,AR-MIM),以挖掘有“距离”的共同变更关系.实验收集了58个Python项目、242074条训练数据、330660条ground truth的数据集,与已有的4种baseline方法进行了比较,验证了AR-MIM的效果.结果表明:在预测共同变更候选项场景上,AR-MIM的准确性、召回率、F1分数均优于已有方法.
文摘本文采用潜在语义索引(LSI)和遗传算法(GA)进行文本特征提取。在采用潜在语义索引将语义关系体现在VSM(Vector Space Model)中,通过奇异值分解(SVD,Singular Value De-composition)可以有效地降低向量空间的维数,但通过维数约简后的文本特征仍要保持在数百维左右,因此本文采用遗传算法在此基础上继续降维。实验结果表明,这两种方法结合可以极大的降低文本向量空间的维数,并能提高分类准确率。