To improve the accuracy of text clustering, fuzzy c-means clustering based on topic concept sub-space (TCS2FCM) is introduced for classifying texts. Five evaluation functions are combined to extract key phrases. Con...To improve the accuracy of text clustering, fuzzy c-means clustering based on topic concept sub-space (TCS2FCM) is introduced for classifying texts. Five evaluation functions are combined to extract key phrases. Concept phrases, as well as the descriptions of final clusters, are presented using WordNet origin from key phrases. Initial centers and membership matrix are the most important factors affecting clustering performance. Orthogonal concept topic sub-spaces are built with the topic concept phrases representing topics of the texts and the initialization of centers and the membership matrix depend on the concept vectors in sub-spaces. The results show that, different from random initialization of traditional fuzzy c-means clustering, the initialization related to text content contributions can improve clustering precision.展开更多
University campus is the most important place for life, study, activity and experience of contemporary college students. It is helpful for students to survive and develop to create the topic space of campus. Taking th...University campus is the most important place for life, study, activity and experience of contemporary college students. It is helpful for students to survive and develop to create the topic space of campus. Taking the topic space of college campuses in Lishui City of Zhejiang Province as an example, the current situations are analyzed through questionnaire survey and field visit. The results show that uni- versity campus space needs a clear topic; the demands are generally large for the topics of exchange and communication, learning and thinking, sports and leisure in all kinds of space; the creation of these types of topic spaces should focus on the peaceful environment, beautiful scenery, privacy of the space and WlFI coverage.展开更多
目前主流开源爬虫框架在分析页面与主题领域关联性上,常采用基于关键词的量化和向量空间模型算法相融合,但融合疏忽了界面语义与特定主题间的关联,导致爬取内容与主题产生偏差。为了给金融等领域的舆情分析提供准确的数据支撑,提出一种...目前主流开源爬虫框架在分析页面与主题领域关联性上,常采用基于关键词的量化和向量空间模型算法相融合,但融合疏忽了界面语义与特定主题间的关联,导致爬取内容与主题产生偏差。为了给金融等领域的舆情分析提供准确的数据支撑,提出一种面向领域扩展主题库的爬虫及系统,通过扩展主题特征库,融合向量空间模型(Vector Space Model,VSM)与超链接主题搜索算法(Hyperlink-Induced Topic Search,HITS),优化了主题页面相关度计算,并针对股票舆情信息爬取进行仿真。结果表明,上述扩展主题型爬虫在爬取准确率和效率等方面有较好地提升,能够有效地完成领域主题信息的爬取任务。展开更多
The purpose of this paper is to introduce and discuss the concept of topical functions on upward sets. We give characterizations of topical functions in terms of upward sets.
为了缓解社交网络热点话题生成的密集图数据导致存储的频繁读取和缓存空间浪费等问题,针对话题产生与消亡的演化更新规律,提出了基于话题热度演化加速度的缓存置换算法(cache replacement algorithm based on topic heat evolution acce...为了缓解社交网络热点话题生成的密集图数据导致存储的频繁读取和缓存空间浪费等问题,针对话题产生与消亡的演化更新规律,提出了基于话题热度演化加速度的缓存置换算法(cache replacement algorithm based on topic heat evolution acceleration,THEA-CR)。该算法首先对社交网络数据进行话题簇的实体划分,识别锚定目标。其次,计算话题热度演化加速度,对热点数据的优先级进行研判;最后设计双队列缓存置换策略,针对话题关注度和访问频率进行缓存空间的置换和更新。在新浪微博数据集中与经典的缓存置换算法进行大量对比实验,验证了所提算法具有较好的可行性与有效性。结果表明提出的THEA-CR算法能够在社交网络密集图数据的不同图查询操作中平均提升约31.4%的缓存命中率,并且缩短了约27.1%的查询响应时间。展开更多
基金The National Natural Science Foundation of China(No60672056)Open Fund of MOE-MS Key Laboratory of Multime-dia Computing and Communication(No06120809)
文摘To improve the accuracy of text clustering, fuzzy c-means clustering based on topic concept sub-space (TCS2FCM) is introduced for classifying texts. Five evaluation functions are combined to extract key phrases. Concept phrases, as well as the descriptions of final clusters, are presented using WordNet origin from key phrases. Initial centers and membership matrix are the most important factors affecting clustering performance. Orthogonal concept topic sub-spaces are built with the topic concept phrases representing topics of the texts and the initialization of centers and the membership matrix depend on the concept vectors in sub-spaces. The results show that, different from random initialization of traditional fuzzy c-means clustering, the initialization related to text content contributions can improve clustering precision.
文摘University campus is the most important place for life, study, activity and experience of contemporary college students. It is helpful for students to survive and develop to create the topic space of campus. Taking the topic space of college campuses in Lishui City of Zhejiang Province as an example, the current situations are analyzed through questionnaire survey and field visit. The results show that uni- versity campus space needs a clear topic; the demands are generally large for the topics of exchange and communication, learning and thinking, sports and leisure in all kinds of space; the creation of these types of topic spaces should focus on the peaceful environment, beautiful scenery, privacy of the space and WlFI coverage.
文摘目前主流开源爬虫框架在分析页面与主题领域关联性上,常采用基于关键词的量化和向量空间模型算法相融合,但融合疏忽了界面语义与特定主题间的关联,导致爬取内容与主题产生偏差。为了给金融等领域的舆情分析提供准确的数据支撑,提出一种面向领域扩展主题库的爬虫及系统,通过扩展主题特征库,融合向量空间模型(Vector Space Model,VSM)与超链接主题搜索算法(Hyperlink-Induced Topic Search,HITS),优化了主题页面相关度计算,并针对股票舆情信息爬取进行仿真。结果表明,上述扩展主题型爬虫在爬取准确率和效率等方面有较好地提升,能够有效地完成领域主题信息的爬取任务。
文摘The purpose of this paper is to introduce and discuss the concept of topical functions on upward sets. We give characterizations of topical functions in terms of upward sets.
文摘为了缓解社交网络热点话题生成的密集图数据导致存储的频繁读取和缓存空间浪费等问题,针对话题产生与消亡的演化更新规律,提出了基于话题热度演化加速度的缓存置换算法(cache replacement algorithm based on topic heat evolution acceleration,THEA-CR)。该算法首先对社交网络数据进行话题簇的实体划分,识别锚定目标。其次,计算话题热度演化加速度,对热点数据的优先级进行研判;最后设计双队列缓存置换策略,针对话题关注度和访问频率进行缓存空间的置换和更新。在新浪微博数据集中与经典的缓存置换算法进行大量对比实验,验证了所提算法具有较好的可行性与有效性。结果表明提出的THEA-CR算法能够在社交网络密集图数据的不同图查询操作中平均提升约31.4%的缓存命中率,并且缩短了约27.1%的查询响应时间。