摘要
利用文本挖掘来表达文本特征,由于文本表现出巨大的维数,从而导致处理过程计算复杂,因此,首先应该对文本进行降维处理。潜在语义分析理论(latent semantican alysis,LSA)作为一种文本聚类的方法,在有效提取文本信息表现出许多特有的优势,在多个领域中被引用。本文构建了中文法律案情文本分类系统,引入LSA方法进行文本向量空间的二次降维,并利用LSA方法处理后的特征集——文档矩阵代替原有矩阵,从而进一步删除噪声,加快分类系统的处理速度。文中给出了具体实现过程及实验数据,通过实验证明该方法能收到较好的效果。
The text feature matrix has large dimensionality in expressing text feature using data mining, and leads to complex computation. So it is needed to reduce dimensionality before data mining. As text clustering method, latent semantic analysis(LSA)has advantage in text information extraction, and have been widely used in many fields. This paper established a primary automatic classification system for chinese legal text with quadratic dimension reduction method based on LSA. In the system LSA is used in increasing the speed of text classification processing with a feature set-text matrix treated by LSA replacing old one for farther denoising. The process of realization and the experiment data were given in this paper. Experiment results show that it has good effects.
出处
《电子测量技术》
2007年第10期111-114,共4页
Electronic Measurement Technology
关键词
文本分类
二次降维
法律文本
text classification
quadratic dimension reduction
legal text