期刊文献+

仅根据Proximity数据构建向量空间模型的方法 被引量:1

An Approach to Constructing Vector Space Models from Proximity Data Alone
下载PDF
导出
摘要 在实际应用中,许多研究对象都是抽象的,难以用某种特征向量的形式表示,这使得许多成熟的数据挖掘和机器学习方法难以被采用。不过,通常可将其转化成一个Proximity数据矩阵,使得矩阵中的元素表示两个对象间某种“比较”关系。针对该问题,本文提出仅根据Proximity数据矩阵利用多维尺度分析法(MDS)将研究对象进行向量化表示,即构建了一种向量空间模型。最后,对汉语科技词系统中的词语进行了聚类分析,结果表明,向量空间模型构建后再聚类的结果明显优于直接针对Proximity数据进行聚类分析的结果,从而验证了该方法的可行性和有效性。 In real-world applications, there are lots and lots of abstract research objects that cannot be represented as feature vectors, therefore many mature data mining and machine learning methods cannot be utilized directly. Nevertheless, it is often not difficult to obtain a proximity matrix, which indicates some "comparison" relationship between objects. To overcome this problem, this study puts forward to obtain corresponding feature vectors for objects only from proximity data matrix by multidimensional scaling (MDS), that is, to construct a vector space model. Finally, the clustering analysis is conducted on words from Chinese Scientific & Technical Vocabulary System. Experimental results show that the clustering performance from vector space model construction is obviously better than that from clustering analysis directly on proximity data, which verifies the feasibility and efficiency of our approach.
出处 《情报学报》 CSSCI 北大核心 2011年第11期1163-1170,共8页 Journal of the China Society for Scientific and Technical Information
基金 本研究受“十一五”国家科技支撑计划“知识组织系统的集成及服务研究与实现”(2006BAH03803)和中国科学技术信息研究所重点工作项目“汉语科技词系统建设与应用工程--新能源汽车领域完善及领域扩展”(2008KP01-3-1)资助.
关键词 多维尺度法 Proximity数据 向量空间模型 汉语科技词系统 聚类分析 multidimensional scaling, proximity data, vector space model, chinese scientific & technical vocabulary system, clustering analysis
  • 相关文献

参考文献29

  • 1Duda R O, Hart P E, Stork D G. Pattern Classification 2nd ed. [ M]. New York: John Wiley & Sons,Inc,2001.
  • 2Jain A K, Dubes R C. Algorithms for Clustering Data [ M ]. New Jersey : Prentice-Hall, Englewood Cliffs, 1988.
  • 3Frey B J,Dueck D. Clustering by passing messages betw- een data points [ J]. Science,2007,315:972-976.
  • 4Cox T, Cox M. Multidimensional Scaling 2nd ed. [ M ]. London : Chapman & Hall ,2001.
  • 5Borg I, Groenen P J F. Modern Multidimensional Scaling 2nd ed. [ M ]. New York : Springer-Verlag,2005.
  • 6Tzeng J,Lu H H S,Li W H. Multidimensional scaling for large genomic data sets [ J ]. BMC Bioinformatics ,2008,9 : 179.
  • 7Pei Z M, Deng Z D, Xu S, et al. Anchor-free localization method for mobile targets in coal mine wireless sensor networks [ J ]. Sensor, 2009,9 ( 4 ) : 2836-2850.
  • 8Everitt B S, Rabe-Hesketh S. The Analysis of Proximity Data[ M ]. London : Arnold, 1997.
  • 9Young G,Householder A S. Discussion of a set of points in terms of their mutual distances [ J]. Psychometrika, 1938,3(1 ) :19-22.
  • 10Roth V, Laub J, Buhmann J M, et al. Going Metric: Denoising Pairwise Data [ C]//Becker S, Thrun S, Obermayer K. Advances in Neural Information Processing System 15. Cambridge: MIT Press, 2003: 817 -824.

二级参考文献29

  • 1章成志.基于多层特征的字符串相似度计算模型[J].情报学报,2005,24(6):696-701. 被引量:40
  • 2卜书庆,贺玲勇.《中国分类主题词表》电子版研制概述[J].国家图书馆学刊,2006,15(2):10-14. 被引量:9
  • 3张晓梅,李丹亚,胡铁军.一体化医学语言系统与本体论研究[J].医学信息学杂志,2006,27(2):89-92. 被引量:12
  • 4董振东,董强,郝长伶.知网的理论发现[J].中文信息学报,2007,21(4):3-9. 被引量:99
  • 5夏天.汉语词语语义相似度计算研究[J].计算机工程,2007,33(6):191-194. 被引量:63
  • 6Agirre E,Rigau G.A Proposal for Word Sense Disambiguation using Conceptual Distance[C] // Current Issues in Linguistic Theory,Proceedings of International Conference on Recent Advances in Natural Language Processing (RANLP),Tzigov Chark,Bulgaria.Amsterdam:John Benjamins Publishing Company.1995:258-264.
  • 7Chen K-J,You J-M.A Study on Word Similarity using Context Vector Models[J].Computational Linguistics and Chinese Language Processing,2002,7(2):37-58.
  • 8Tran H-M,Dan S.Word Similarity in WordNet[C] //Modeling,Simulation and Optimization of Complex Processes,Proceedings of the 13th International Conference on High Performance Scientific Computing,Hanoi,Vietnam.Berlin:Springer,2006:293-302.
  • 9Liu X Y,Zhou Y M,Zheng R S.Measuring Semantic Similarity in WordNet[C] //Proceedings of the 6th International Conference on Machine Learning and Cybernetics,Hong Kong,China.Washington:IEEE Computer Society Press,2007:3431-3435.
  • 10Dagan I,Marcus S,Markovitch S.Contextual Word Similarity and Estimation from Sparse Data[C] //Proceedings of the Annual Meeting the Association for Computational Linguistics (ACL).NY:Association for Computational Linguistics,1993:164-171.

共引文献16

同被引文献25

  • 1桂婕,许德山,姜彩红,等.汉语科技词系统调研报告(5)--知识组织系统应用[M].北京:中国科学技术信息研究所,2009.
  • 2Salton G.Experiments in Automatic Thesaurus Construc-tion for Information Retrieval[C] ∥Freiman C V,Griffith J E,Rosenfeld J L.Proceedings of the IFIP Congress,Volume 1.Amsterdam:North Holland Publishing Co,1971:115-123.
  • 3Booth A D.A law of occurrences for words of low frequency[J].Information and Control,1967,10(4):386-393.
  • 4Donohue J C.Understanding Scientific Literature:A Bibliographic Approach[M].Cambridge:MIT Press,1973.
  • 5Callon M,Law J,Rip A.Qualitative Scientometrics[M] //Mapping the Dynamics of Science and Tehnology.London:Macmillan Publishers Limited,1986:103-123.
  • 6Callon M,Courtial J P,Laville F.Co-word analysis as a tool for describing the network of interactions between basic and technological research:the case of polymer chemistry[J].Scientometrics,1991,22(1):155-205.
  • 7Batagelj V,Mrvar A.Pajek-Progam for Large Network Analysis[EB/OL].[2010-10-12].http://pajek.imfm.si/doku.php? id=pajek.
  • 8Borgatti S.NetDraw Network Visualization[EB/OL].[2010-12-12].http://www.analytictech.com/netdraw/netdraw.htm.
  • 9Duda R O,Hart P E,Stork D G.Pattern Classification.2nd ed.[M].New York:John Wiley & Sons,Inc,2001.
  • 10Jain A K,Dubes R C.Algorithms for Clustering Data[M].New Jersey:Prentice-Hall,1988.

引证文献1

二级引证文献22

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部