共现聚类分析的新方法：最大频繁项集挖掘被引量：22

A Novel Approach for Co-occurrence Clustering Analysis： Maximal Frequent Itemset Mining

下载PDF

导出

摘要针对某一领域的文献，如果两个研究对象同现的频率越高，则通常假设二者存在联系的可能性越大。从而促使共词分析、文献共引分析以及文献作者共著分析等共现分析方法的流行。然而，传统共现分析三个阶段中的前两个阶段存在一定的缺陷，从而导致最后得到的共现聚类分析的结果可能存在一定的误导性。为克服该缺陷，本文从关联规则挖掘领域引入了一种新的共现聚类分析方法——最大频繁项集挖掘，它将传统共现分析法的三个阶段压缩为一个阶段，充分利用了可以利用的各种信息，克服了传统方法的缺陷。通过实验分析发现，设置合适的最小支持度阈值，基本上可以得到比较满意的结果。 In documents for some area, if two research objects have higher co-occurrence frequency, then one usually assumes that there is higher probability an underlying link exists between the two objects. It is this reason that prompts the popularity of many co-occurrence analysis methods, such as co-word analysis, co-citation analysis, co-authorship analysis, etc. The process of traditional co-occurrence analysis often consists of three steps. However, there are problematic for the previous two steps, which may lead to some misleading co-occurrence clustering results. Therefore, this paper introduces a new method for co-occurrence clustering analysis--maximal frequent itemset mining--from association rule mining domain. This approach compresses three steps in the traditional co-occurrence clustering into one step, which simplifies greatly the resulting process. One of the most appealing characteristic of this approach is that it can make the best use of all available information, which overcomes the problem in the traditional co-occurrence analysis. Experimental results show that one can basically obtain satisfactory clustering results by setting a proper minimal support threshold.

作者徐硕乔晓东朱礼军张运良薛春香

机构地区中国科学技术信息研究所南京理工大学经济管理学院

出处《情报学报》 CSSCI 北大核心 2012年第2期143-150,共8页 Journal of the China Society for Scientific and Technical Information

基金）本研究受“十二五”国家科技支撑计划项目“面向外文科技知识组织体系的大规模语义计算关键技术研究”（2011BAH10804）中国科学技术信息研究所预研项目“科技文献深层领域主题监测及主题演化规律揭示”（YY-201129）江苏省社会科学基金项目“数字报纸的自动标引研究”（09TQC011）和教育部人文社会科学研究项目“电子报纸内容深加工研究”（09YJC870014）资助.

关键词共现分析共词分析聚类分析最大频繁项集层次聚类 co-occurrence analysis ,co-word analysis, clustering analysis, maximal frequent itemset,hierarchical clustering

分类号 G254 [文化科学—图书馆学]

引文网络
相关文献

参考文献26

1朱礼军,乔晓东,张运良.汉语科技词系统建设实践——以新能源汽车领域为例[J].情报学报,2010,29(4):723-731. 被引量：11
2桂婕,许德山,姜彩红,等.汉语科技词系统调研报告(5)--知识组织系统应用[M].北京:中国科学技术信息研究所,2009.
3Salton G.Experiments in Automatic Thesaurus Construc-tion for Information Retrieval[C] ∥Freiman C V,Griffith J E,Rosenfeld J L.Proceedings of the IFIP Congress,Volume 1.Amsterdam:North Holland Publishing Co,1971:115-123.
4Booth A D.A law of occurrences for words of low frequency[J].Information and Control,1967,10(4):386-393.
5Donohue J C.Understanding Scientific Literature:A Bibliographic Approach[M].Cambridge:MIT Press,1973.
6Callon M,Law J,Rip A.Qualitative Scientometrics[M] //Mapping the Dynamics of Science and Tehnology.London:Macmillan Publishers Limited,1986:103-123.
7Callon M,Courtial J P,Laville F.Co-word analysis as a tool for describing the network of interactions between basic and technological research:the case of polymer chemistry[J].Scientometrics,1991,22(1):155-205.
8Batagelj V,Mrvar A.Pajek-Progam for Large Network Analysis[EB/OL].[2010-10-12].http://pajek.imfm.si/doku.php? id=pajek.
9Borgatti S.NetDraw Network Visualization[EB/OL].[2010-12-12].http://www.analytictech.com/netdraw/netdraw.htm.
10Duda R O,Hart P E,Stork D G.Pattern Classification.2nd ed.[M].New York:John Wiley & Sons,Inc,2001.

二级参考文献40

1卜书庆,贺玲勇.《中国分类主题词表》电子版研制概述[J].国家图书馆学刊,2006,15(2):10-14. 被引量：9
2张晓梅,李丹亚,胡铁军.一体化医学语言系统与本体论研究[J].医学信息学杂志,2006,27(2):89-92. 被引量：12
3董振东,董强,郝长伶.知网的理论发现[J].中文信息学报,2007,21(4):3-9. 被引量：99
4朱礼军,乔晓东,刘建东,等.构建开放、共享的汉语科技词系统[C].图书馆与信息社会的和谐发展论文集.北京:图书情报工作杂志社,2008:142-146.
5中国科学技术信息研究所信息技术支持中心.国内外词系统发展情况调研报告[R].北京:中国科学技术信息研究所,2007.
6University of Strathclyde.HILT Phase III:M2M Pilot Demonstrator Project-Final Report[OL].[2007-09-18].http://hilt.cdlr.strath.ac.uk/.
7Vizine-Goetz Diane.Terminology services:Making knowledge organization schemes more accessible to people and computers.[OL].[2007-09-18].http://www.oclc.org/news/publications/newsletters/oclc/2004/266/.
8全国术语标准化技术委员会,中国术语信息网[OL].[2009-08-02].http://www.cnterm.org/aboutus.htm.
9张全.HNC(概念层次网络)理论[C].中国中文信息学会二十五周年学术会议论文集.北京:中国中文信息学会,2006:139-143.
10Duda R O, Hart P E, Stork D G. Pattern Classification 2nd ed. [ M]. New York: John Wiley & Sons,Inc,2001.