摘要
基于XML文档格式良好、层次清晰,可以方便地操纵、分析其结构的特点。文中在将Web上的HTML文档转化为XML文档的基础上,通过Java中的DOM树,分析文档的层次结构。把文档分为层次化的文本段,对传统的VSM算法进行改进,把每个文本段转换为空间向量,实现了N层VSM算法,通过试验证明,改进后算法的查全率和查准率都要优于传统的VSM算法。
XML documents have well form, clear levels and analyses the structure easily. Convert HTML documents on Web into XML document, so can use DOM tree in Java to analyse the hierarchy of the documents. The documents can be divided into N level text paragraphs' content,which are represented by index term vectors, Using this method improve traditional vector space model, the N level VSM is achieved. And proved by the experiment, both recall and precision of the N level VSM are performing well than the traditional VSM.
出处
《计算机技术与发展》
2006年第5期56-58,共3页
Computer Technology and Development