The dissemination of news is a vital topic in management science,social science and data science.With the development of technology,the sample sizes and dimensions of digital news data increase remarkably.To alleviate...The dissemination of news is a vital topic in management science,social science and data science.With the development of technology,the sample sizes and dimensions of digital news data increase remarkably.To alleviate the computational burden in big data,this paper proposes a method to deal with massive and moderate-dimensional data for linear regression models via combing model averaging and subsampling methodologies.The author first samples a subsample from the full data according to some special probabilities and split covariates into several groups to construct candidate models.Then,the author solves each candidate model and calculates the model-averaging weights to combine these estimators based on this subsample.Additionally,the asymptotic optimality in subsampling form is proved and the way to calculate optimal subsampling probabilities is provided.The author also illustrates the proposed method via simulations,which shows it takes less running time than that of the full data and generates more accurate estimations than uniform subsampling.Finally,the author applies the proposed method to analyze and predict the sharing number of news,and finds the topic,vocabulary and dissemination time are the determinants.展开更多
Softmax regression,which is also called multinomial logistic regression,is widely used in various fields for modeling the relationship between covariates and categorical responses with multiple levels.The increasing v...Softmax regression,which is also called multinomial logistic regression,is widely used in various fields for modeling the relationship between covariates and categorical responses with multiple levels.The increasing volumes of data bring new challenges for parameter estimation in softmax regression,and the optimal subsampling method is an effective way to solve them.However,optimal subsampling with replacement requires to access all the sampling probabilities simultaneously to draw a subsample,and the resultant subsample could contain duplicate observations.In this paper,the authors consider Poisson subsampling for its higher estimation accuracy and applicability in the scenario that the data exceed the memory limit.The authors derive the asymptotic properties of the general Poisson subsampling estimator and obtain optimal subsampling probabilities by minimizing the asymptotic variance-covariance matrix under both A-and L-optimality criteria.The optimal subsampling probabilities contain unknown quantities from the full dataset,so the authors suggest an approximately optimal Poisson subsampling algorithm which contains two sampling steps,with the first step as a pilot phase.The authors demonstrate the performance of our optimal Poisson subsampling algorithm through numerical simulations and real data examples.展开更多
基金supported by the National Natural Science Foundation of China under Grant No.12201431the Young Teacher Foundation of Capital University of Economics and Business under Grant Nos.XRZ2022-070 and 00592254413070。
文摘The dissemination of news is a vital topic in management science,social science and data science.With the development of technology,the sample sizes and dimensions of digital news data increase remarkably.To alleviate the computational burden in big data,this paper proposes a method to deal with massive and moderate-dimensional data for linear regression models via combing model averaging and subsampling methodologies.The author first samples a subsample from the full data according to some special probabilities and split covariates into several groups to construct candidate models.Then,the author solves each candidate model and calculates the model-averaging weights to combine these estimators based on this subsample.Additionally,the asymptotic optimality in subsampling form is proved and the way to calculate optimal subsampling probabilities is provided.The author also illustrates the proposed method via simulations,which shows it takes less running time than that of the full data and generates more accurate estimations than uniform subsampling.Finally,the author applies the proposed method to analyze and predict the sharing number of news,and finds the topic,vocabulary and dissemination time are the determinants.
基金Wang Haiying’s research was partially supported by the National Science Foundation under Grant No.CCF 2105571.
文摘Softmax regression,which is also called multinomial logistic regression,is widely used in various fields for modeling the relationship between covariates and categorical responses with multiple levels.The increasing volumes of data bring new challenges for parameter estimation in softmax regression,and the optimal subsampling method is an effective way to solve them.However,optimal subsampling with replacement requires to access all the sampling probabilities simultaneously to draw a subsample,and the resultant subsample could contain duplicate observations.In this paper,the authors consider Poisson subsampling for its higher estimation accuracy and applicability in the scenario that the data exceed the memory limit.The authors derive the asymptotic properties of the general Poisson subsampling estimator and obtain optimal subsampling probabilities by minimizing the asymptotic variance-covariance matrix under both A-and L-optimality criteria.The optimal subsampling probabilities contain unknown quantities from the full dataset,so the authors suggest an approximately optimal Poisson subsampling algorithm which contains two sampling steps,with the first step as a pilot phase.The authors demonstrate the performance of our optimal Poisson subsampling algorithm through numerical simulations and real data examples.