Purpose:The main objective of this work is to show the potentialities of recently developed approaches for automatic knowledge extraction directly from the universities’websites.The information automatically extracte...Purpose:The main objective of this work is to show the potentialities of recently developed approaches for automatic knowledge extraction directly from the universities’websites.The information automatically extracted can be potentially updated with a frequency higher than once per year,and be safe from manipulations or misinterpretations.Moreover,this approach allows us flexibility in collecting indicators about the efficiency of universities’websites and their effectiveness in disseminating key contents.These new indicators can complement traditional indicators of scientific research(e.g.number of articles and number of citations)and teaching(e.g.number of students and graduates)by introducing further dimensions to allow new insights for“profiling”the analyzed universities.Design/methodology/approach:Webometrics relies on web mining methods and techniques to perform quantitative analyses of the web.This study implements an advanced application of the webometric approach,exploiting all the three categories of web mining:web content mining;web structure mining;web usage mining.The information to compute our indicators has been extracted from the universities’websites by using web scraping and text mining techniques.The scraped information has been stored in a NoSQL DB according to a semistructured form to allow for retrieving information efficiently by text mining techniques.This provides increased flexibility in the design of new indicators,opening the door to new types of analyses.Some data have also been collected by means of batch interrogations of search engines(Bing,www.bing.com)or from a leading provider of Web analytics(SimilarWeb,http://www.similarweb.com).The information extracted from the Web has been combined with the University structural information taken from the European Tertiary Education Register(https://eter.joanneum.at/#/home),a database collecting information on Higher Education Institutions(HEIs)at European level.All the above was used to perform a clusterization of 79 Italian universities based on structural and digital indicators.Findings:The main findings of this study concern the evaluation of the potential in digitalization of universities,in particular by presenting techniques for the automatic extraction of information from the web to build indicators of quality and impact of universities’websites.These indicators can complement traditional indicators and can be used to identify groups of universities with common features using clustering techniques working with the above indicators.Research limitations:The results reported in this study refers to Italian universities only,but the approach could be extended to other university systems abroad.Practical implications:The approach proposed in this study and its illustration on Italian universities show the usefulness of recently introduced automatic data extraction and web scraping approaches and its practical relevance for characterizing and profiling the activities of universities on the basis of their websites.The approach could be applied to other university systems.Originality/value:This work applies for the first time to university websites some recently introduced techniques for automatic knowledge extraction based on web scraping,optical character recognition and nontrivial text mining operations(Bruni&Bianchi,2020).展开更多
As a basic unit of knowledge representation and an important means for information organization, event has drawn growing number of people’s attention, the research of event identification and extraction in natural la...As a basic unit of knowledge representation and an important means for information organization, event has drawn growing number of people’s attention, the research of event identification and extraction in natural language processing field is an important research topic in information extraction area, the recognition and extraction of event trigger word plays a decisive role in event identification and extraction. In this paper, the authors make experiment in Chinese Event Corpus CEC, and present a method of extracting event trigger word automatically that combines extended trigger word table and machine learning. The experiment result shows that the F-score of extracting event trigger word. can reach 71.2% by using this method.展开更多
为精准识别水体信息并实时监测湖泊水体时空特征及其环境特征变化情况,以洞庭湖为例,基于Landsat-8影像数据分别使用改进的归一化差异水体指数(Modified Normalized Difference Water Index,MNDWI)、自动水体提取指数(Automated Water E...为精准识别水体信息并实时监测湖泊水体时空特征及其环境特征变化情况,以洞庭湖为例,基于Landsat-8影像数据分别使用改进的归一化差异水体指数(Modified Normalized Difference Water Index,MNDWI)、自动水体提取指数(Automated Water Extraction Index,AWEI sh)、支持向量机(Support Vector Machine,SVM)、人工神经网络(Artificial Neural Networks,ANNs)、随机森林(Random Forest,RF)等5种方法提取枯、丰水期水体分布信息,通过精度指标评价及影响因素分析,旨在找到提取精度高、鲁棒性强的水体提取方法。结果表明:5种方法中SVM法水体提取总精度最高且泛化能力良好。研究成果可为各方法适用性提供一定参考,并通过定量分析揭示漏提率在提取精度评价指标中的重要性。展开更多
基金Supported by the National Natural Science Foundation of China under Grant Nos.90412010 60573166 (国家自然科学基金)+1 种基金the Specialized Research Fund for the Doctoral Program of Higher Education of China under Grant No.2007108 (高等学校博士学科点专项科研基金)the HP University Collaborative Foundation of China under Grant No.HLCFY08-001 (惠普大学合作基金)
基金This work is developed with the support of the H2020 RISIS 2 Project(No.824091)and of the“Sapienza”Research Awards No.RM1161550376E40E of 2016 and RM11916B8853C925 of 2019.This article is a largely extended version of Bianchi et al.(2019)presented at the ISSI 2019 Conference held in Rome,2–5 September 2019.
文摘Purpose:The main objective of this work is to show the potentialities of recently developed approaches for automatic knowledge extraction directly from the universities’websites.The information automatically extracted can be potentially updated with a frequency higher than once per year,and be safe from manipulations or misinterpretations.Moreover,this approach allows us flexibility in collecting indicators about the efficiency of universities’websites and their effectiveness in disseminating key contents.These new indicators can complement traditional indicators of scientific research(e.g.number of articles and number of citations)and teaching(e.g.number of students and graduates)by introducing further dimensions to allow new insights for“profiling”the analyzed universities.Design/methodology/approach:Webometrics relies on web mining methods and techniques to perform quantitative analyses of the web.This study implements an advanced application of the webometric approach,exploiting all the three categories of web mining:web content mining;web structure mining;web usage mining.The information to compute our indicators has been extracted from the universities’websites by using web scraping and text mining techniques.The scraped information has been stored in a NoSQL DB according to a semistructured form to allow for retrieving information efficiently by text mining techniques.This provides increased flexibility in the design of new indicators,opening the door to new types of analyses.Some data have also been collected by means of batch interrogations of search engines(Bing,www.bing.com)or from a leading provider of Web analytics(SimilarWeb,http://www.similarweb.com).The information extracted from the Web has been combined with the University structural information taken from the European Tertiary Education Register(https://eter.joanneum.at/#/home),a database collecting information on Higher Education Institutions(HEIs)at European level.All the above was used to perform a clusterization of 79 Italian universities based on structural and digital indicators.Findings:The main findings of this study concern the evaluation of the potential in digitalization of universities,in particular by presenting techniques for the automatic extraction of information from the web to build indicators of quality and impact of universities’websites.These indicators can complement traditional indicators and can be used to identify groups of universities with common features using clustering techniques working with the above indicators.Research limitations:The results reported in this study refers to Italian universities only,but the approach could be extended to other university systems abroad.Practical implications:The approach proposed in this study and its illustration on Italian universities show the usefulness of recently introduced automatic data extraction and web scraping approaches and its practical relevance for characterizing and profiling the activities of universities on the basis of their websites.The approach could be applied to other university systems.Originality/value:This work applies for the first time to university websites some recently introduced techniques for automatic knowledge extraction based on web scraping,optical character recognition and nontrivial text mining operations(Bruni&Bianchi,2020).
文摘As a basic unit of knowledge representation and an important means for information organization, event has drawn growing number of people’s attention, the research of event identification and extraction in natural language processing field is an important research topic in information extraction area, the recognition and extraction of event trigger word plays a decisive role in event identification and extraction. In this paper, the authors make experiment in Chinese Event Corpus CEC, and present a method of extracting event trigger word automatically that combines extended trigger word table and machine learning. The experiment result shows that the F-score of extracting event trigger word. can reach 71.2% by using this method.