The procedure of hypertext induced topic search based on a semantic relation model is analyzed, and the reason for the topic drift of HITS algorithm was found to prove that Web pages are projected to a wrong latent se...The procedure of hypertext induced topic search based on a semantic relation model is analyzed, and the reason for the topic drift of HITS algorithm was found to prove that Web pages are projected to a wrong latent semantic basis. A new concept-generalized similarity is introduced and, based on this, a new topic distillation algorithm GSTDA(generalized similarity based topic distillation algorithm) was presented to improve the quality of topic distillation. GSTDA was applied not only to avoid the topic drift, but also to explore relative topics to user query. The experimental results on 10 queries show that GSTDA reduces topic drift rate by 10% to 58% compared to that of HITS(hypertext induced topic search) algorithm, and discovers several relative topics to queries that have multiple meanings.展开更多
目前主流开源爬虫框架在分析页面与主题领域关联性上,常采用基于关键词的量化和向量空间模型算法相融合,但融合疏忽了界面语义与特定主题间的关联,导致爬取内容与主题产生偏差。为了给金融等领域的舆情分析提供准确的数据支撑,提出一种...目前主流开源爬虫框架在分析页面与主题领域关联性上,常采用基于关键词的量化和向量空间模型算法相融合,但融合疏忽了界面语义与特定主题间的关联,导致爬取内容与主题产生偏差。为了给金融等领域的舆情分析提供准确的数据支撑,提出一种面向领域扩展主题库的爬虫及系统,通过扩展主题特征库,融合向量空间模型(Vector Space Model,VSM)与超链接主题搜索算法(Hyperlink-Induced Topic Search,HITS),优化了主题页面相关度计算,并针对股票舆情信息爬取进行仿真。结果表明,上述扩展主题型爬虫在爬取准确率和效率等方面有较好地提升,能够有效地完成领域主题信息的爬取任务。展开更多
随着内地和台湾地区交流的日益密切和频繁,加强两岸术语研究工作的交流与互鉴变得尤为重要。文章对台湾地区术语建设的管理结构、历时发展、已有成果,两岸共同编纂术语工具书的合作成果,“乐词网”术语搜索及资源在线平台,两岸共同建设...随着内地和台湾地区交流的日益密切和频繁,加强两岸术语研究工作的交流与互鉴变得尤为重要。文章对台湾地区术语建设的管理结构、历时发展、已有成果,两岸共同编纂术语工具书的合作成果,“乐词网”术语搜索及资源在线平台,两岸共同建设的“中华语文知识库”及其他语料库进行了详细介绍和全面梳理。对台湾地区在Web of Science(WOS)核心合集数据库中与术语相关的研究进行了主题抽样分析,借助文献计量学工具VOSviewer进行了可视化呈现。揭示了台湾地区学者在国际核心期刊上发表的术语相关研究的发展趋势和热点议题。以期为众多两岸术语研究者、语言爱好者提供研究与学习的素材和途径,助力两岸学者的沟通与合作,并确定未来协作努力的方向,也为两岸的术语建设、制定科技发展战略提供有益的参考和支撑。展开更多
基金Supported by the Shaanxi Provincial Educational Depar tment Special-Purpose Technology and Research of China (06JK229)
文摘The procedure of hypertext induced topic search based on a semantic relation model is analyzed, and the reason for the topic drift of HITS algorithm was found to prove that Web pages are projected to a wrong latent semantic basis. A new concept-generalized similarity is introduced and, based on this, a new topic distillation algorithm GSTDA(generalized similarity based topic distillation algorithm) was presented to improve the quality of topic distillation. GSTDA was applied not only to avoid the topic drift, but also to explore relative topics to user query. The experimental results on 10 queries show that GSTDA reduces topic drift rate by 10% to 58% compared to that of HITS(hypertext induced topic search) algorithm, and discovers several relative topics to queries that have multiple meanings.
文摘目前主流开源爬虫框架在分析页面与主题领域关联性上,常采用基于关键词的量化和向量空间模型算法相融合,但融合疏忽了界面语义与特定主题间的关联,导致爬取内容与主题产生偏差。为了给金融等领域的舆情分析提供准确的数据支撑,提出一种面向领域扩展主题库的爬虫及系统,通过扩展主题特征库,融合向量空间模型(Vector Space Model,VSM)与超链接主题搜索算法(Hyperlink-Induced Topic Search,HITS),优化了主题页面相关度计算,并针对股票舆情信息爬取进行仿真。结果表明,上述扩展主题型爬虫在爬取准确率和效率等方面有较好地提升,能够有效地完成领域主题信息的爬取任务。
文摘随着内地和台湾地区交流的日益密切和频繁,加强两岸术语研究工作的交流与互鉴变得尤为重要。文章对台湾地区术语建设的管理结构、历时发展、已有成果,两岸共同编纂术语工具书的合作成果,“乐词网”术语搜索及资源在线平台,两岸共同建设的“中华语文知识库”及其他语料库进行了详细介绍和全面梳理。对台湾地区在Web of Science(WOS)核心合集数据库中与术语相关的研究进行了主题抽样分析,借助文献计量学工具VOSviewer进行了可视化呈现。揭示了台湾地区学者在国际核心期刊上发表的术语相关研究的发展趋势和热点议题。以期为众多两岸术语研究者、语言爱好者提供研究与学习的素材和途径,助力两岸学者的沟通与合作,并确定未来协作努力的方向,也为两岸的术语建设、制定科技发展战略提供有益的参考和支撑。