The frame of text classification system was presented. The high dimensionality in feature space for text classification was studied. The mutual information is a widely used information theoretic measure, in a descript...The frame of text classification system was presented. The high dimensionality in feature space for text classification was studied. The mutual information is a widely used information theoretic measure, in a descriptive way, to measure the stochastic dependency of discrete random variables. The measure method was used as a criterion to reduce high dimensionality of feature vectors in text classification on Web. Feature selections or conversions were performed by using maximum mutual information including linear and non-linear feature conversions. Entropy was used and extended to find right features commendably in pattern recognition systems. Favorable foundation would be established for text classification mining.展开更多
以编目分类和规则匹配为主的古籍文本主题分类方法存在工作效能低、专家知识依赖性强、分类依据单一化、古籍文本主题自动分类难等问题。对此,本文结合古籍文本内容和文字特征,尝试从古籍内容分类得到符合研究者需求的主题,推动数字人...以编目分类和规则匹配为主的古籍文本主题分类方法存在工作效能低、专家知识依赖性强、分类依据单一化、古籍文本主题自动分类难等问题。对此,本文结合古籍文本内容和文字特征,尝试从古籍内容分类得到符合研究者需求的主题,推动数字人文研究范式的转型。首先,参照东汉古籍《说文解字》对文字的分析方式,以前期标注的古籍语料数据集为基础,构建全新的“字音(说)-原文(文)-结构(解)-字形(字)”四维特征数据集。其次,设计四维特征向量提取模型(speaking,word,pattern,and font to vector,SWPF2vec),并结合预训练模型实现对古籍文本细粒度的特征表示。再其次,构建融合卷积神经网络、循环神经网络和多头注意力机制的古籍文本主题分类模型(dianji-recurrent convolutional neural networks for text classification,DJ-TextRCNN)。最后,融入四维语义特征,实现对古籍文本多维度、深层次、细粒度的语义挖掘。在古籍文本主题分类任务上,DJ-TextRCNN模型在不同维度特征下的主题分类准确率均为最优,在“说文解字”四维特征下达到76.23%的准确率,初步实现了对古籍文本的精准主题分类。展开更多
With the development of large scale text processing, the dimension of text feature space has become larger and larger, which has added a lot of difficulties to natural language processing. How to reduce the dimension...With the development of large scale text processing, the dimension of text feature space has become larger and larger, which has added a lot of difficulties to natural language processing. How to reduce the dimension has become a practical problem in the field. Here we present two clustering methods, i.e. concept association and concept abstract, to achieve the goal. The first refers to the keyword clustering based on the co occurrence of展开更多
In the course of mechanical part designing, process p lanning and assembling designing, we often have to calculate and analyse a dimen sion chain. Traditionally, a dimension chain is established and calculated m anual...In the course of mechanical part designing, process p lanning and assembling designing, we often have to calculate and analyse a dimen sion chain. Traditionally, a dimension chain is established and calculated m anually. With wide computer application in the field of mechanical design and ma nufacture, people began to use a computer to acquire and calculate a dimension c hain automatically. In reported work, a dimension chain can be established and c alculated automatically. However, dimension text values of dimensions composing a dimension chain and these dimensions’ tolerance’s upper values and lower valu es are put into a computer manually, which is inefficient and easy to make mis takes. In order to overcome above difficulties. it is very important to acquir e noted dimensions automatically, furthermore analyse and calculate a dimens ion chain, then show results. At present AutoCAD softwares of Autodesk company h ave been used popularly in mechanical designing. For automatically acquiring noted dimensions, analyzing and calculating a dimension chain in a design draw in AutoCAD, this paper introduces the solvable scheme of automatic dimension acq uisition and dimension chain calculation in AutoCAD by ObjectARX. ObjectARX is a developing tool for AutoCAD. In this paper a dimension chain is expressed b y three matrixes, which respectively stand for dimension text value matrix, tole rance’s upper value matrix and tolerance’s lower value matrix. The developed p rogram can be used to both calculate a assembling dimension chain, and a process dimension chain. When the program running in AutoCAD, noted dimensions comp osing a dimension chain in AutoCAD are selected in turn with a mouse, then the c omputer begin to calculate the dimension chain and results are shown in a dialog box. A running example is given in this paper.展开更多
To construct a high efficient text clustering algorithm the multilevel graph model and the refinement algorithm used in the uncoarsening phase is discussed. The model is applied to text clustering. The performance of ...To construct a high efficient text clustering algorithm the multilevel graph model and the refinement algorithm used in the uncoarsening phase is discussed. The model is applied to text clustering. The performance of clustering algorithm has to be improved with the refinement algorithm application. The experiment result demonstrated that the multilevel graph text clustering algorithm is available. Key words text clustering - multilevel coarsen graph model - refinement algorithm - high-dimensional clustering CLC number TP301 Foundation item: Supported by the National Natural Science Foundation of China (60173051)Biography: CHEN Jian-bin(1970-), male, Associate professor, Ph. D., research direction: data mining.展开更多
文摘The frame of text classification system was presented. The high dimensionality in feature space for text classification was studied. The mutual information is a widely used information theoretic measure, in a descriptive way, to measure the stochastic dependency of discrete random variables. The measure method was used as a criterion to reduce high dimensionality of feature vectors in text classification on Web. Feature selections or conversions were performed by using maximum mutual information including linear and non-linear feature conversions. Entropy was used and extended to find right features commendably in pattern recognition systems. Favorable foundation would be established for text classification mining.
文摘以编目分类和规则匹配为主的古籍文本主题分类方法存在工作效能低、专家知识依赖性强、分类依据单一化、古籍文本主题自动分类难等问题。对此,本文结合古籍文本内容和文字特征,尝试从古籍内容分类得到符合研究者需求的主题,推动数字人文研究范式的转型。首先,参照东汉古籍《说文解字》对文字的分析方式,以前期标注的古籍语料数据集为基础,构建全新的“字音(说)-原文(文)-结构(解)-字形(字)”四维特征数据集。其次,设计四维特征向量提取模型(speaking,word,pattern,and font to vector,SWPF2vec),并结合预训练模型实现对古籍文本细粒度的特征表示。再其次,构建融合卷积神经网络、循环神经网络和多头注意力机制的古籍文本主题分类模型(dianji-recurrent convolutional neural networks for text classification,DJ-TextRCNN)。最后,融入四维语义特征,实现对古籍文本多维度、深层次、细粒度的语义挖掘。在古籍文本主题分类任务上,DJ-TextRCNN模型在不同维度特征下的主题分类准确率均为最优,在“说文解字”四维特征下达到76.23%的准确率,初步实现了对古籍文本的精准主题分类。
文摘With the development of large scale text processing, the dimension of text feature space has become larger and larger, which has added a lot of difficulties to natural language processing. How to reduce the dimension has become a practical problem in the field. Here we present two clustering methods, i.e. concept association and concept abstract, to achieve the goal. The first refers to the keyword clustering based on the co occurrence of
文摘In the course of mechanical part designing, process p lanning and assembling designing, we often have to calculate and analyse a dimen sion chain. Traditionally, a dimension chain is established and calculated m anually. With wide computer application in the field of mechanical design and ma nufacture, people began to use a computer to acquire and calculate a dimension c hain automatically. In reported work, a dimension chain can be established and c alculated automatically. However, dimension text values of dimensions composing a dimension chain and these dimensions’ tolerance’s upper values and lower valu es are put into a computer manually, which is inefficient and easy to make mis takes. In order to overcome above difficulties. it is very important to acquir e noted dimensions automatically, furthermore analyse and calculate a dimens ion chain, then show results. At present AutoCAD softwares of Autodesk company h ave been used popularly in mechanical designing. For automatically acquiring noted dimensions, analyzing and calculating a dimension chain in a design draw in AutoCAD, this paper introduces the solvable scheme of automatic dimension acq uisition and dimension chain calculation in AutoCAD by ObjectARX. ObjectARX is a developing tool for AutoCAD. In this paper a dimension chain is expressed b y three matrixes, which respectively stand for dimension text value matrix, tole rance’s upper value matrix and tolerance’s lower value matrix. The developed p rogram can be used to both calculate a assembling dimension chain, and a process dimension chain. When the program running in AutoCAD, noted dimensions comp osing a dimension chain in AutoCAD are selected in turn with a mouse, then the c omputer begin to calculate the dimension chain and results are shown in a dialog box. A running example is given in this paper.
文摘To construct a high efficient text clustering algorithm the multilevel graph model and the refinement algorithm used in the uncoarsening phase is discussed. The model is applied to text clustering. The performance of clustering algorithm has to be improved with the refinement algorithm application. The experiment result demonstrated that the multilevel graph text clustering algorithm is available. Key words text clustering - multilevel coarsen graph model - refinement algorithm - high-dimensional clustering CLC number TP301 Foundation item: Supported by the National Natural Science Foundation of China (60173051)Biography: CHEN Jian-bin(1970-), male, Associate professor, Ph. D., research direction: data mining.