摘要
Internet的高速增长同时带动了信息的高速增长,这些信息基本是以文本形式保存的。文本信息的特点是无结构,即便有也是极其有限的结构。文本相似性是文本挖掘研究的一个重点也是一个难点。从文本特征得到文本相似性信息是本文的主要研究方向。本文采用了PHP+MYSQL的开发环境对文本相似性的计算过程进行了模拟。计算过程采用的是余弦相似度和Jaccard相似度这两种基于向量内积的方法。在实验过程中通过对文本特征的操作来判断文本之间是否相似,另外还实现了将文本转化为简单的字符串集合进行比较来判断文本是否相似的方法。
Internet's rapid growth also driven the rapid growth of information, The information is saved as the text form basicly. Text characterized by unstructured, even if there is also a very limited structure. Text similarity is a key area of text mining also is a difficult area. Obtaining text similarity from the text features is the main research directions of this paper. In this paper,we use the PHP + MYSQL development environment to simulate on text similarity calculation process. Calculation process using two methods based on vector inner product, the cosine similarity and the jaccard similarity. During the experiment ,we use the text features to find the difference from texts. On the other hand, we have sueeessed on changing the text into a collection of simple string to compare the different text.
出处
《华北科技学院学报》
2013年第1期91-95,共5页
Journal of North China Institute of Science and Technology
关键词
文本挖掘
文本相似性
文本特征
web content mining
web recommendation of classification
text classification