目前主流开源爬虫框架在分析页面与主题领域关联性上,常采用基于关键词的量化和向量空间模型算法相融合,但融合疏忽了界面语义与特定主题间的关联,导致爬取内容与主题产生偏差。为了给金融等领域的舆情分析提供准确的数据支撑,提出一种...目前主流开源爬虫框架在分析页面与主题领域关联性上,常采用基于关键词的量化和向量空间模型算法相融合,但融合疏忽了界面语义与特定主题间的关联,导致爬取内容与主题产生偏差。为了给金融等领域的舆情分析提供准确的数据支撑,提出一种面向领域扩展主题库的爬虫及系统,通过扩展主题特征库,融合向量空间模型(Vector Space Model,VSM)与超链接主题搜索算法(Hyperlink-Induced Topic Search,HITS),优化了主题页面相关度计算,并针对股票舆情信息爬取进行仿真。结果表明,上述扩展主题型爬虫在爬取准确率和效率等方面有较好地提升,能够有效地完成领域主题信息的爬取任务。展开更多
The Internet presents numerous sources of useful information nowadays. However, these resources are drowning under the dynamic Web, so accurate finding user-specific information is very difficult. In this paper we dis...The Internet presents numerous sources of useful information nowadays. However, these resources are drowning under the dynamic Web, so accurate finding user-specific information is very difficult. In this paper we discuss a Semantic Graph Web Search (SGWS) algorithm in topic-specific resource discovery on the Web. This method combines the use of hyperlinks, characteristics of Web graph and semantic term weights. We implement the algorithm to find Chinese medical information from the Internet. Our study showed that it has better precision than traditional IR (Information Retrieval) methods and traditional search engines. Key words HITS - evolution web graph - power law distribution - context analysis CLC number TP 391 - TP 393 Foundation item: Supported by the National High-Performance Computation Fund (00303)Biography: Ye Wei-guo (1970-), male, Ph. D candidate, research direction: Web information mining, network security, artificial intelligence.展开更多
文摘目前主流开源爬虫框架在分析页面与主题领域关联性上,常采用基于关键词的量化和向量空间模型算法相融合,但融合疏忽了界面语义与特定主题间的关联,导致爬取内容与主题产生偏差。为了给金融等领域的舆情分析提供准确的数据支撑,提出一种面向领域扩展主题库的爬虫及系统,通过扩展主题特征库,融合向量空间模型(Vector Space Model,VSM)与超链接主题搜索算法(Hyperlink-Induced Topic Search,HITS),优化了主题页面相关度计算,并针对股票舆情信息爬取进行仿真。结果表明,上述扩展主题型爬虫在爬取准确率和效率等方面有较好地提升,能够有效地完成领域主题信息的爬取任务。
文摘The Internet presents numerous sources of useful information nowadays. However, these resources are drowning under the dynamic Web, so accurate finding user-specific information is very difficult. In this paper we discuss a Semantic Graph Web Search (SGWS) algorithm in topic-specific resource discovery on the Web. This method combines the use of hyperlinks, characteristics of Web graph and semantic term weights. We implement the algorithm to find Chinese medical information from the Internet. Our study showed that it has better precision than traditional IR (Information Retrieval) methods and traditional search engines. Key words HITS - evolution web graph - power law distribution - context analysis CLC number TP 391 - TP 393 Foundation item: Supported by the National High-Performance Computation Fund (00303)Biography: Ye Wei-guo (1970-), male, Ph. D candidate, research direction: Web information mining, network security, artificial intelligence.