期刊文献+
共找到4篇文章
< 1 >
每页显示 20 50 100
基于Scrapy框架的微博用户信息采集系统设计与实现
1
作者 朱烨行 赵宝莹 +2 位作者 张明杰 魏笑笑 卫昆 《现代信息科技》 2023年第24期41-44,48,共5页
为深入了解新浪微博用户的有关情况,从中找出最有影响力的微博用户,发现当前新浪微博中的意见领袖,需要知道每位微博用户迄今为止已发表的微博数、关注数和粉丝数等信息。为此使用Python语言设计实现了一个基于Scrapy框架的网络爬虫,该... 为深入了解新浪微博用户的有关情况,从中找出最有影响力的微博用户,发现当前新浪微博中的意见领袖,需要知道每位微博用户迄今为止已发表的微博数、关注数和粉丝数等信息。为此使用Python语言设计实现了一个基于Scrapy框架的网络爬虫,该爬虫根据输入的微博用户标识号ID抓取该用户目前已发表的微博数、关注数和粉丝数等信息。由于要连续多次访问微博网站,为了不让微博网站拒绝访问,在设计的爬虫中使用了用户代理和IP代理,其中IP代理选用隧道代理这一动态类型。实验结果表明下载七千多位微博用户的信息用时6小时22分钟。 展开更多
关键词 Scrapy 网络爬虫 微博 用户代理 IP代理
下载PDF
A High-Performance Extraction Method for Public Opinion on Internet 被引量:3
2
作者 LI Yanling DAI Guanzhong +1 位作者 zhu yehang QIN Sen 《Wuhan University Journal of Natural Sciences》 CAS 2007年第5期902-906,共5页
Aiming at the importance of the analysis for public opinion on Internet, the authors propose a high-performance extraction method for public opinion. In this method, the space model for classification is adopted to de... Aiming at the importance of the analysis for public opinion on Internet, the authors propose a high-performance extraction method for public opinion. In this method, the space model for classification is adopted to describe the relationship between words and categories. The combined feature selection method is used to remove noisy words from the original feature space effectively. Then the category weight of words is calculated by the improved formula combining the frequency of words and distribution of words. Finally, the class weights of the not-categorized documents based on the category weight of words are obtained for realizing opinion extraction. Experiment results show that the method has comparatively high classification and good stability. 展开更多
关键词 opinion extraction text categorization class spacemodel feature selection
下载PDF
Application of Algorithm CARDBK in Document Clustering
3
作者 zhu yehang ZHANG Mingjie SHI Feng 《Wuhan University Journal of Natural Sciences》 CAS CSCD 2018年第6期514-524,共11页
In the K-means clustering algorithm, each data point is uniquely placed into one category. The clustering quality is heavily dependent on the initial cluster centroid. Different initializations can yield varied result... In the K-means clustering algorithm, each data point is uniquely placed into one category. The clustering quality is heavily dependent on the initial cluster centroid. Different initializations can yield varied results; local adjustment cannot save the clustering result from poor local optima. If there is an anomaly in a cluster, it will seriously affect the cluster mean value. The K-means clustering algorithm is only suitable for clusters with convex shapes. We therefore propose a novel clustering algorithm CARDBK—"centroid all rank distance(CARD)" which means that all centroids are sorted by distance value from one point and "BK" are the initials of "batch K-means"—in which one point not only modifies a cluster centroid nearest to this point but also modifies multiple clusters centroids adjacent to this point, and the degree of influence of a point on a cluster centroid depends on the distance value between this point and the other nearer cluster centroids. Experimental results showed that our CARDBK algorithm outperformed other algorithms when tested on a number of different data sets based on the following performance indexes: entropy, purity, F1 value, Rand index and normalized mutual information(NMI). Our algorithm manifested to be more stable, linearly scalable and faster. 展开更多
关键词 algorithm design and analysis CLUSTERING documentanalysis text processing
原文传递
Application of a soft competition learning method in document clustering
4
作者 zhu yehang Zhang Mingjie 《The Journal of China Universities of Posts and Telecommunications》 EI CSCD 2018年第3期80-91,共12页
Hard competition learning has the feature that each point modifies only one cluster centroid that wins. Correspondingly, soft competition learning has the feature that each point modifies not only the cluster centroid... Hard competition learning has the feature that each point modifies only one cluster centroid that wins. Correspondingly, soft competition learning has the feature that each point modifies not only the cluster centroid that wins, but also many other cluster centroids near this point. A soft competition learning method is proposed. Centroid all rank distance (CARD), CARDx, and centroid all rank distance batch K-means (CARDBK) are three clustering algorithms that adopt the proposed soft competition learning method. Among them the extent to which one point affects a cluster centroid depends on the distances from this point to the other nearer cluster centroids, rather than just the rank number of the distance from this point to this cluster centroid among the distances from this point to all cluster centroids. In addition, the validation experiments are carried out in order to compare the three soft competition learning algorithms CARD, CARDx, and CARDBK with several hard competition learning algorithms as well as neural gas (NG) algorithm on five data sets from different sources. Judging from the values of five performance indexes in the clustering results, this kind of soft competition learning method has better clustering effect and efficiency, and has linear scalability. 展开更多
关键词 clustering methods text processing document handling competition learning method
原文传递
上一页 1 下一页 到第
使用帮助 返回顶部