XML文档聚类在众多数据应用领域都具有重要作用。基于特征偏好的XML文档聚类算法是对XML文档进行特征选择,将XML文档描述为n维特征向量,再结合CFP(Clustering with Feature order Preference)算法,根据特征偏好为其赋予权重,每次迭代聚...XML文档聚类在众多数据应用领域都具有重要作用。基于特征偏好的XML文档聚类算法是对XML文档进行特征选择,将XML文档描述为n维特征向量,再结合CFP(Clustering with Feature order Preference)算法,根据特征偏好为其赋予权重,每次迭代聚类过程中进行权重的更新。实验结果表明当CFP算法中的特征偏好权重和XML文档向量化时所用的层次权重设定相结合时,可弥补XML文档向量化时的弊端,提高了XML文档聚类的精度。展开更多
The eXtensible markup language (XML) is a kind of new meta language for replacing HTML and has many advantages. Traditional engineering documents have too many expression forms to be expediently managed and have no dy...The eXtensible markup language (XML) is a kind of new meta language for replacing HTML and has many advantages. Traditional engineering documents have too many expression forms to be expediently managed and have no dynamic correlation functions. This paper introduces a new method and uses XML to store and manage engineering documents to realize the format unity of engineering documents and their dynamic correlations.展开更多
Traditional information retrieval systems respond to user queries with ranked lists of relevant documents. Since, XML (Extensible Markup Language) documents separate content and structure; XML-IR (information retri...Traditional information retrieval systems respond to user queries with ranked lists of relevant documents. Since, XML (Extensible Markup Language) documents separate content and structure; XML-IR (information retrieval) systems are able to retrieve only the relevant portions of documents. Therefore, users who utilize an XML-IR system could potentially receive highly relevant and precise material. We have developed the XML information retrieval system by using MySQL and Sphinx, which we call MEXIR. In our system, XML documents are stored into one table that has fixed relational schema. The schema is independent of the logical structure of XML documents. Each node in XML documents is represented by labels that express the positions in XML tree, namely ADXPI scheme. Our system has performance experiments on INEX collections and shown an average up to four seconds better than GPX. In addition, it has been reduced the size of the data down by 82.29 % compare to GPX system.展开更多
Most efficient indeces and query techniques over XML (extensible markup language) data are based on a certain labeling scheme, which can quickly determine ancestor-descendant and parent-child relationship between tw...Most efficient indeces and query techniques over XML (extensible markup language) data are based on a certain labeling scheme, which can quickly determine ancestor-descendant and parent-child relationship between two nodes. The current basic labeling schemes such as containment scheme and prefix scheme cannot avoid re- labeling when XML documents are updated. After analyzing the essence of existing dynamic XML labels such as compact dynamic binary string (CDBS) and vector encoding, this paper gives a common unifying framework for the numeric-based generalized dynamic label, which can be implemented into a variety of dynamic labels according to the different user-defined value comparison methods. This paper also proposes a novel dynamic labeling scheme called radical sign label. Extensive experiments show that the radical sign label performs well for the initialization, insertion and query operations, and especially for skewed insertion where the storage cost of the radical sign label is better than that of former methods.展开更多
文摘XML文档聚类在众多数据应用领域都具有重要作用。基于特征偏好的XML文档聚类算法是对XML文档进行特征选择,将XML文档描述为n维特征向量,再结合CFP(Clustering with Feature order Preference)算法,根据特征偏好为其赋予权重,每次迭代聚类过程中进行权重的更新。实验结果表明当CFP算法中的特征偏好权重和XML文档向量化时所用的层次权重设定相结合时,可弥补XML文档向量化时的弊端,提高了XML文档聚类的精度。
文摘The eXtensible markup language (XML) is a kind of new meta language for replacing HTML and has many advantages. Traditional engineering documents have too many expression forms to be expediently managed and have no dynamic correlation functions. This paper introduces a new method and uses XML to store and manage engineering documents to realize the format unity of engineering documents and their dynamic correlations.
文摘Traditional information retrieval systems respond to user queries with ranked lists of relevant documents. Since, XML (Extensible Markup Language) documents separate content and structure; XML-IR (information retrieval) systems are able to retrieve only the relevant portions of documents. Therefore, users who utilize an XML-IR system could potentially receive highly relevant and precise material. We have developed the XML information retrieval system by using MySQL and Sphinx, which we call MEXIR. In our system, XML documents are stored into one table that has fixed relational schema. The schema is independent of the logical structure of XML documents. Each node in XML documents is represented by labels that express the positions in XML tree, namely ADXPI scheme. Our system has performance experiments on INEX collections and shown an average up to four seconds better than GPX. In addition, it has been reduced the size of the data down by 82.29 % compare to GPX system.
基金the National Major Projects on Science and Technology(No.2010ZX01042-002-003-004)the National Basic Research Program (973) of China(No.2010CB328106)+2 种基金the National Natural Science Foundation of China(No. 61170085)the Program for New Century Excellent Talents in China(No.NCET-10-0388)the Shanghai Leading Academic Discipline Project(No.B412)
文摘Most efficient indeces and query techniques over XML (extensible markup language) data are based on a certain labeling scheme, which can quickly determine ancestor-descendant and parent-child relationship between two nodes. The current basic labeling schemes such as containment scheme and prefix scheme cannot avoid re- labeling when XML documents are updated. After analyzing the essence of existing dynamic XML labels such as compact dynamic binary string (CDBS) and vector encoding, this paper gives a common unifying framework for the numeric-based generalized dynamic label, which can be implemented into a variety of dynamic labels according to the different user-defined value comparison methods. This paper also proposes a novel dynamic labeling scheme called radical sign label. Extensive experiments show that the radical sign label performs well for the initialization, insertion and query operations, and especially for skewed insertion where the storage cost of the radical sign label is better than that of former methods.