摘要
针对向量空间模型在文档相似度量方面的局限,提出了基于计算公共子串的文档相似度量算法。对求公共子串算法进行了改进,提高了空间效率。用XML存储学生毕业设计论文文档,通过Java提供的DOM API生成文档对象树。深度优先搜索树中结点,进行结点比较,计算论文文档中出现的雷同文本,结合文档的结构相似性,能有效计算文档相似度。
In respect to the limitation of document similarity measuring based on VSM, this paper put forward an algorithm based on public substring of strings. Storing studen's graduation-design documents with XML and generating document object tree by DOM API in java, it calculates homologous text numbers by visiting vertexes with depth-first search algorithm and making comparison of them. Taking into consideration the similarity of document structures, the new algorithm can judge documents similarity effectively.
出处
《淮海工学院学报(自然科学版)》
CAS
2007年第3期28-31,共4页
Journal of Huaihai Institute of Technology:Natural Sciences Edition
基金
江苏省现代教育技术研究课题(2004-METR-8)