With the number of social media users ramping up,microblogs are generated and shared at record levels.The high momentum and large volumes of short texts bring redundancies and noises,in which the users and analysts of...With the number of social media users ramping up,microblogs are generated and shared at record levels.The high momentum and large volumes of short texts bring redundancies and noises,in which the users and analysts often find it problematic to elicit useful information of interest.In this paper,we study a query-focused summarization as a solution to address this issue and propose a novel summarization framework to generate personalized online summaries and historical summaries of arbitrary time durations.Our framework can deal with dynamic,perpetual,and large-scale microblogging streams.Specifically,we propose an online microblogging stream clustering algorithm to cluster microblogs and maintain distilled statistics called Microblog Cluster Vectors(MCV).Then we develop a ranking method to extract the most representative sentences relative to the query from the MCVs and generate a query-focused summary of arbitrary time durations.Our experiments on large-scale real microblogs demonstrate the efficiency and effectiveness of our approach.展开更多
The measurement of influence in social networks has received a lot of attention in the data mining community. Influence maximization refers to the process of finding influential users who make the most of information ...The measurement of influence in social networks has received a lot of attention in the data mining community. Influence maximization refers to the process of finding influential users who make the most of information or product adoption. In real settings, the influence of a user in a social network can be modeled by the set of actions (e.g., "like", "share", "retweet", "comment") performed by other users of the network on his/her publications. To the best of our knowledge, all proposed models in the literature treat these actions equally. However, it is obvious that a "like" of a publication means less influence than a "share" of the same publication. This suggests that each action has its own level of influence (or importance). In this paper, we propose a model (called Social Action-Based Influence Maximization Model, SAIM) for influence maximization in social networks. In SAIM, actions are not considered equally in measuring the "influence power" of an individual, and it is composed of two major steps. In the first step, we compute the influence power of each individual in the social network. This influence power is computed from user actions using PageRank. At the end of this step, we get a weighted social network in which each node is labeled by its influence power. In the second step of SAIM, we compute an optimal set of influential nodes using a new concept named "influence-BFS tree". Experiments conducted on large-scale real-world and synthetic social networks reveal the good performance of our model SAIM in computing, in acceptable time scales, a minimal set of influential nodes allowing the maximum spreading of information.展开更多
Link-based similarity measures play a significant role in many graph based applications. Consequently, mea- suring node similarity in a graph is a fundamental problem of graph data mining. Personalized PageRank (PPR...Link-based similarity measures play a significant role in many graph based applications. Consequently, mea- suring node similarity in a graph is a fundamental problem of graph data mining. Personalized PageRank (PPR) and Sim- Rank (SR) have emerged as the most popular and influen- tial link-based similarity measures. Recently, a novel link- based similarity measure, penetrating rank (P-Rank), which enriches SR, was proposed. In practice, PPR, SR and P-Rank scores are calculated by iterative methods. As the number of iterations increases so does the overhead of the calcula- tion. The ideal solution is that computing similarity within the minimum number of iterations is sufficient to guaran- tee a desired accuracy. However, the existing upper bounds are too coarse to be useful in general. Therefore, we focus on designing an accurate and tight upper bounds for PPR, SR, and P-Rank in the paper. Our upper bounds are designed based on the following intuition: the smaller the difference between the two consecutive iteration steps is, the smaller the difference between the theoretical and iterative similar- ity scores becomes. Furthermore, we demonstrate the effec- tiveness of our upper bounds in the scenario of top-k similar nodes queries, where our upper bounds helps accelerate the speed of the query. We also run a comprehensive set of exper- iments on real world data sets to verify the effectiveness and efficiency of our upper bounds.展开更多
基金This work was supported by Chongqing Research Program of Basic Research and Frontier Technology(cstc2017jcyjAX0071)Basic and Advanced Research Projects of CSTC(cstc2019jcyjzdxm0102)+1 种基金Chongqing Science and Technology Innovation Leading Talent Support Program(CSTCCXLJRC201908)Science and Technology Research Program of Chongqing Municipal Education Commission(KJZD-K201900605).
文摘With the number of social media users ramping up,microblogs are generated and shared at record levels.The high momentum and large volumes of short texts bring redundancies and noises,in which the users and analysts often find it problematic to elicit useful information of interest.In this paper,we study a query-focused summarization as a solution to address this issue and propose a novel summarization framework to generate personalized online summaries and historical summaries of arbitrary time durations.Our framework can deal with dynamic,perpetual,and large-scale microblogging streams.Specifically,we propose an online microblogging stream clustering algorithm to cluster microblogs and maintain distilled statistics called Microblog Cluster Vectors(MCV).Then we develop a ranking method to extract the most representative sentences relative to the query from the MCVs and generate a query-focused summary of arbitrary time durations.Our experiments on large-scale real microblogs demonstrate the efficiency and effectiveness of our approach.
文摘The measurement of influence in social networks has received a lot of attention in the data mining community. Influence maximization refers to the process of finding influential users who make the most of information or product adoption. In real settings, the influence of a user in a social network can be modeled by the set of actions (e.g., "like", "share", "retweet", "comment") performed by other users of the network on his/her publications. To the best of our knowledge, all proposed models in the literature treat these actions equally. However, it is obvious that a "like" of a publication means less influence than a "share" of the same publication. This suggests that each action has its own level of influence (or importance). In this paper, we propose a model (called Social Action-Based Influence Maximization Model, SAIM) for influence maximization in social networks. In SAIM, actions are not considered equally in measuring the "influence power" of an individual, and it is composed of two major steps. In the first step, we compute the influence power of each individual in the social network. This influence power is computed from user actions using PageRank. At the end of this step, we get a weighted social network in which each node is labeled by its influence power. In the second step of SAIM, we compute an optimal set of influential nodes using a new concept named "influence-BFS tree". Experiments conducted on large-scale real-world and synthetic social networks reveal the good performance of our model SAIM in computing, in acceptable time scales, a minimal set of influential nodes allowing the maximum spreading of information.
文摘Link-based similarity measures play a significant role in many graph based applications. Consequently, mea- suring node similarity in a graph is a fundamental problem of graph data mining. Personalized PageRank (PPR) and Sim- Rank (SR) have emerged as the most popular and influen- tial link-based similarity measures. Recently, a novel link- based similarity measure, penetrating rank (P-Rank), which enriches SR, was proposed. In practice, PPR, SR and P-Rank scores are calculated by iterative methods. As the number of iterations increases so does the overhead of the calcula- tion. The ideal solution is that computing similarity within the minimum number of iterations is sufficient to guaran- tee a desired accuracy. However, the existing upper bounds are too coarse to be useful in general. Therefore, we focus on designing an accurate and tight upper bounds for PPR, SR, and P-Rank in the paper. Our upper bounds are designed based on the following intuition: the smaller the difference between the two consecutive iteration steps is, the smaller the difference between the theoretical and iterative similar- ity scores becomes. Furthermore, we demonstrate the effec- tiveness of our upper bounds in the scenario of top-k similar nodes queries, where our upper bounds helps accelerate the speed of the query. We also run a comprehensive set of exper- iments on real world data sets to verify the effectiveness and efficiency of our upper bounds.