摘要
面对信息社会中老年人对养老信息的关注与需求,本文使用基于Python的网络爬虫技术对民政部网站的新闻和公文进行抓取。针对门户网站的新闻特点,对数据抓取过程以及训练集进行优化,使用AdaBoost算法对给定的文本集合进行训练,得到筛选模型。提供一种有效的特征选择方法,采用χ2统计量准则,有效降低了特征维数,然后用该模型对采集的信息进行筛选得到养老信息。最后,对信息筛选结果进行了分析。实验分析结果表明,本文提出的方法可以实现对养老信息的有效筛选,在应用上可以满足老年人对养老信息的获取需求。
Facing attention to the needs of older persons in the information society for aged information, this paper uses Web crawler technology based on Python to crawl the news and official documents from Ministry of Civil Affairs website. Aiming at the characteristics of news on portals, the paper optimizes data fetching process as well as the training set, uses Adaboost algorithm to train a given collection of text and get filtering model. And the paper provides an effective feature selection method which uses the χ2 statistic principles, effectively reduces the feature dimension, and then uses this model to filter the collection information to get aged information. Finally, the results of information filtering are analyzed. The experimental analysis results show that the proposed method can effectively filter the aged information and meet the elderly demand of aged information acquisition in the practical application. Key words: Web crawler; AdaBoost; aged information; government press ; information filtering
出处
《计算机与现代化》
2016年第12期102-106,110,共6页
Computer and Modernization