基于PSP_HDP主题模型的非结构化经济指标挖掘被引量：4

Mining Unstructured Economic Indicators Based on PSP_HDP Topic Model

下载PDF

导出

摘要随着经济活动数据的不断丰富,互联网平台上产生了大量的财经文本,其中蕴含了经济领域发展状况的影响因素.如何从这些财经文本中有效地挖掘与经济有关的经济要素,是实现非结构化数据在经济研究中应用的关键.根据人工构建非结构化经济指标的局限性,以及主题模型在非结构化经济指标挖掘中存在的问题,结合已有经济领域分类标准、词语之间的语义关系和词语对主题的代表性,定义了文档的领域隶属度、词语与主题的语义相关度和词语对主题的贡献度,用于分别描述CRF(Chinese restaurant franchise)中餐厅的菜肴风格、顾客之间对菜肴要求的一致程度和顾客对菜肴的专一程度;结合文档领域属性、词语语义和词语在主题中的出现情况,提出了PSP_HDP(combining documents’domain properties,word semantics and words’presences in topics with HDP)主题模型.由于PSP_HDP主题模型改进了文档-主题与主题-词语的分配过程,从而提高了经济主题的区分度和辨识度,可以更有效地挖掘与经济有关的经济主题和经济要素词.实验结果表明:提出的PSP_HDP主题模型不仅在主题多样性、内容困惑度和模型复杂度等评价指标方面的整体性能优于HDP主题模型,而且在非结构化经济指标挖掘和经济要素词抽取方面能够得到区分度更好、辨识度更高的结果. With the increasing enrichment of economic activity data,a large number of financial texts have emerged on Internet,which contains the influence factors of the economic development.How to mine these economic factors from these texts is the key to conduct economic analysis based on unstructured data.Due to the limitation of manual selection of economic indicators,and the inaccuracy of modelling economic indicators in unstructured texts,the CRF(Chinese restaurant franchise)allocation processes in HDP topic model are extended to a more efficient pattern.In order to describe the dish style in a restaurant,the existing economic taxonomies are used to determine the domain membership of a document.The semantic similarity between words is exploited to define the semantic relevance between words and topics,which reflect the similarity of customers'requirements for dishes.For each word,its representativeness of each topic is employed to evaluate its contribution to the topic,which explains the loyalty of a customer to each dish.By combining documents’domain properties,word semantics and words’presence in topics with HDP topic model,a novel model,PSP_HDP topic model,is proposed.As the PSP_HDP topic model improves documents-topics and topics-words allocation processes,it increases the accuracy of identifying economic topics and distinctiveness of the topics,which leads to a more effective mining of economic topics and economic factors.Experimental results show that the proposed model not only achieves a better performance in terms of topic diversity,topic perplexity and topic complexity,but also is effective in finding more cohesive unstructured economic indicators and economic factors.

作者张奕韬万常选刘喜平江腾蛟刘德喜廖国琼 ZHANG Yi-Tao;WAN Chang-Xuan;LIU Xi-Ping;JIANG Teng-Jiao;LIU De-Xi;LIAO Guo-Qiong(School of Information Management,Jiangxi University of Finance and Economics,Nanchang 330013,China;School of Software,East China Jiaotong University,Nanchang 330013,China;Jiangxi Key Laboratory of Data and Knowledge Engineering(Jiangxi University of Finance and Economics),Nanchang 330013,China)

机构地区江西财经大学信息管理学院华东交通大学软件学院数据与知识工程江西省高校重点实验室(江西财经大学)

出处《软件学报》 EI CSCD 北大核心 2020年第3期845-865,共21页 Journal of Software

基金国家自然科学基金(61972184,61562032,61662027,61762042) 江西省自然科学基金(20152ACB20003)。

关键词 HDP主题模型经济领域分类标准语义关系非结构化经济指标经济要素词 HDP topic model economic taxonomy semantic relevance unstructured economic indicator economic factor

分类号 TP18 [自动化与计算机技术—控制理论与控制工程]

引文网络
相关文献

参考文献6

1张晨逸,孙建伶,丁轶群.基于MB-LDA模型的微博主题挖掘[J].计算机研究与发展,2011,48(10):1795-1802. 被引量：166
2罗鹏,陈义国,许传华.百度搜索、风险感知与金融风险预测--基于行为金融学的视角[J].金融论坛,2018,23(1):39-51. 被引量：13
3周建英,王飞跃,曾大军.分层Dirichlet过程及其应用综述[J].自动化学报,2011,37(4):389-407. 被引量：40
4庞雄文,万本帅,王盼.基于MRT-LDA模型的微博文本分类[J].计算机科学,2017,44(8):236-241. 被引量：2
5刘涛雄,徐晓飞.互联网搜索行为能帮助我们预测宏观经济吗?[J].经济研究,2015,50(12):68-83. 被引量：99
6刘少鹏,印鉴,欧阳佳,黄云,杨晓颖.基于MB-HDP模型的微博主题挖掘[J].计算机学报,2015,38(7):1408-1419. 被引量：31

二级参考文献188

1陈守东,杨莹,马辉.中国金融风险预警研究[J].数量经济技术经济研究,2006,23(7):36-48. 被引量：111
2Kang J H, Lerman K, Plangprasopchok A. Analyzing Microblogs with affinity propagation [C] //Proc of the 1st KDD Workshop on Social Media Analytic. New York: ACM, 2010:67-70.
3Ramage D, Dumais S, Liebling D. Characterizing microblogs with topic models [C] //Proc of Int AAAI Conf on Weblogs and Social Media. Menlo Park, CA: AAAI, 2010:130-137.
4Xu R, Wunsch D. Survey of clustering algorithms [J]. IEEE Trans on Neural Networks, 2005, 16(3): 645-678.
5Deerwester S, Dumais S, Landauer T, et al. Indexing by latent semantic analysis [J]. Journal of the American Society of Information Science, 1990, 41(6): 391-407.
6Landauer T K, Foltz P W, Laham D. Introduction to Latent Semantic Analysis [J]. Discourse Processes, 1998, 25 (2) 259-284.
7Griffiths T, Steyvers M. Probabilistic topic models [G] // Latent Semantic Analysis: A Road to Meaning. Hillsdale, NJ: Laurence Erlbaum, 2006.
8Hofmann T. Probabilistic latent semantic indexing [C] // Proc of the 22nd Annual Int ACM SIGIR Conf on Research and Development in Information Retrieval. New York: ACM, 1999:50-57.
9Salton G, McGill M. Introduction to Modern Information Retrieval [M]. New York: McGraw-Hill, 1983.
10Blei D M, Ng A Y, Jordan M I. Latent Dirichlet Allocation [J]. The Journal of Machine Learning Research, 2003, 3: 993-1022.