Satellite remote sensing data are usually used to analyze the spatial distribution pattern of geological structures and generally serve as a significant means for the identification of alteration zones. Based on the L...Satellite remote sensing data are usually used to analyze the spatial distribution pattern of geological structures and generally serve as a significant means for the identification of alteration zones. Based on the Landsat Enhanced Thematic Mapper (ETM+) data, which have better spectral resolution (8 bands) and spatial resolution (15 m in PAN band), the synthesis processing techniques were presented to fulfill alteration information extraction: data preparation, vegetation indices and band ratios, and expert classifier-based classification. These techniques have been implemented in the MapGIS-RSP software (version 1.0), developed by the Wuhan Zondy Cyber Technology Co., Ltd, China. In the study area application of extracting alteration information in the Zhaoyuan (招远) gold mines, Shandong (山东) Province, China, several hydorthermally altered zones (included two new sites) were found after satellite imagery interpretation coupled with field surveys. It is concluded that these synthesis processing techniques are useful approaches and are applicable to a wide range of gold-mineralized alteration information extraction.展开更多
Over the past ten years,large amounts of original research data related to Earth system science have been made available at a rapidly increasing rate.Such growing data stock helps researchers understand the human-Eart...Over the past ten years,large amounts of original research data related to Earth system science have been made available at a rapidly increasing rate.Such growing data stock helps researchers understand the human-Earth system across different fields.A substantial amount of this data is published by geoscientists as open-access in authoritative journals.If the information stored in this literature is properly extracted,there is significant potential to build a domain knowledge base.However,this potential remains largely unfulfilled in geoscience,with one of the biggest obstacles being the lack of publicly available related corpora and baselines.To fill this gap,the Earth Science Data Corpus(ESDC),an academic text corpus of 600 abstracts,was built from the international journal Earth System Science Data(ESSD).To the best of our knowledge,ESDC is the first corpus with the needed detail to provide a professional training dataset for knowledge extraction and construction of domain-specific knowledge graphs from massive amounts of literature.The production process of ESDC incorporates both the contextual features of spatiotemporal entities and the linguistic characteristics of academic literature.Furthermore,annotation guidelines and procedures tailored for Earth science data are formulated to ensure reliability.ChatGPT with zero-and few-shot prompting,BARTNER generative,and W2NER discriminative models were trained on ESDC to evaluate the performance of the name entity recognition task and showed increasing performance metrics,with the highest achieved by BARTNER.Performance metrics for various entity types output by each model were also assessed.We utilized the trained BARTNER model to perform model inference on a larger unlabeled literature corpus,aiming to automatically extract a broader and richer set of entity information.Subsequently,the extracted entity information was mapped and associated with the Earth science data knowledge graph.Around this knowledge graph,this paper validates multiple downstream applications,including hot topic research analysis,scientometric analysis,and knowledge-enhanced large language model question-answering systems.These applications have demonstrated that the ESDC can provide scientists from different disciplines with information on Earth science data,help them better understand and obtain data,and promote further exploration in their respective professional fields.展开更多
Purpose:The main objective of this work is to show the potentialities of recently developed approaches for automatic knowledge extraction directly from the universities’websites.The information automatically extracte...Purpose:The main objective of this work is to show the potentialities of recently developed approaches for automatic knowledge extraction directly from the universities’websites.The information automatically extracted can be potentially updated with a frequency higher than once per year,and be safe from manipulations or misinterpretations.Moreover,this approach allows us flexibility in collecting indicators about the efficiency of universities’websites and their effectiveness in disseminating key contents.These new indicators can complement traditional indicators of scientific research(e.g.number of articles and number of citations)and teaching(e.g.number of students and graduates)by introducing further dimensions to allow new insights for“profiling”the analyzed universities.Design/methodology/approach:Webometrics relies on web mining methods and techniques to perform quantitative analyses of the web.This study implements an advanced application of the webometric approach,exploiting all the three categories of web mining:web content mining;web structure mining;web usage mining.The information to compute our indicators has been extracted from the universities’websites by using web scraping and text mining techniques.The scraped information has been stored in a NoSQL DB according to a semistructured form to allow for retrieving information efficiently by text mining techniques.This provides increased flexibility in the design of new indicators,opening the door to new types of analyses.Some data have also been collected by means of batch interrogations of search engines(Bing,www.bing.com)or from a leading provider of Web analytics(SimilarWeb,http://www.similarweb.com).The information extracted from the Web has been combined with the University structural information taken from the European Tertiary Education Register(https://eter.joanneum.at/#/home),a database collecting information on Higher Education Institutions(HEIs)at European level.All the above was used to perform a clusterization of 79 Italian universities based on structural and digital indicators.Findings:The main findings of this study concern the evaluation of the potential in digitalization of universities,in particular by presenting techniques for the automatic extraction of information from the web to build indicators of quality and impact of universities’websites.These indicators can complement traditional indicators and can be used to identify groups of universities with common features using clustering techniques working with the above indicators.Research limitations:The results reported in this study refers to Italian universities only,but the approach could be extended to other university systems abroad.Practical implications:The approach proposed in this study and its illustration on Italian universities show the usefulness of recently introduced automatic data extraction and web scraping approaches and its practical relevance for characterizing and profiling the activities of universities on the basis of their websites.The approach could be applied to other university systems.Originality/value:This work applies for the first time to university websites some recently introduced techniques for automatic knowledge extraction based on web scraping,optical character recognition and nontrivial text mining operations(Bruni&Bianchi,2020).展开更多
Web information system(WIS)is frequently-used and indispensable in daily social life.WIS provides information services in many scenarios,such as electronic commerce,communities,and edutainment.Data cleaning plays an e...Web information system(WIS)is frequently-used and indispensable in daily social life.WIS provides information services in many scenarios,such as electronic commerce,communities,and edutainment.Data cleaning plays an essential role in various WIS scenarios to improve the quality of data service.In this paper,we present a review of the state-of-the-art methods for data cleaning in WIS.According to the characteristics of data cleaning,we extract the critical elements of WIS,such as interactive objects,application scenarios,and core technology,to classify the existing works.Then,after elaborating and analyzing each category,we summarize the descriptions and challenges of data cleaning methods with sub-elements such as data&user interaction,data quality rule,model,crowdsourcing,and privacy preservation.Finally,we analyze various types of problems and provide suggestions for future research on data cleaning in WIS from the technology and interactive perspective.展开更多
The aim is to solve the problem that how to share dispersive and heterogeneous data inside business information system or some other information source. On the basis of Web service, this paper adopts the notion of Dat...The aim is to solve the problem that how to share dispersive and heterogeneous data inside business information system or some other information source. On the basis of Web service, this paper adopts the notion of Data As ,Service to build service-oriented data integration architecture. According to this architecture, we develop data collection system which effectively integrates data from heterogeneous informa tion source and present a uniform data view to end users by implementing sharing data from heterogeneous systems and information source . At last, this paper gives an example of a compositive information collection platform system.展开更多
The fourth international conference on Web information systems and applications (WISA 2007) has received 409 submissions and has accepted 37 papers for publication in this issue. The papers cover broad research area...The fourth international conference on Web information systems and applications (WISA 2007) has received 409 submissions and has accepted 37 papers for publication in this issue. The papers cover broad research areas, including Web mining and data warehouse, Deep Web and Web integration, P2P networks, text processing and information retrieval, as well as Web Services and Web infrastructure. After briefly introducing the WISA conference, the survey outlines the current activities and future trends concerning Web information systems and applications based on the papers accepted for publication.展开更多
With the propagation of applications on the internet, the internet has become a great information source which supplies users with valuable information. But it is hard for users to quickly acquire the right informatio...With the propagation of applications on the internet, the internet has become a great information source which supplies users with valuable information. But it is hard for users to quickly acquire the right information on the web. This paper an intelligent agent for internet applications to retrieve and extract web information under user's guidance. The intelligent agent is made up of a retrieval script to identify web sources, an extraction script based on the document object model to express extraction process, a data translator to export the extracted information into knowledge bases with frame structures, and a data reasoning to reply users' questions. A GUI tool named Script Writer helps to generate the extraction script visually, and knowledge rule databases help to extract wanted information and to generate the answer to questions.展开更多
Purpose: The objectives of this study are to explore an effective technique to extract information from weblogs and develop an experimental system to extract structured information as much as possible with this techni...Purpose: The objectives of this study are to explore an effective technique to extract information from weblogs and develop an experimental system to extract structured information as much as possible with this technique. The system will lay a foundation for evaluation, analysis, retrieval, and utilization of the extracted information.Design/methodology/approach: An improved template extraction technique was proposed.Separate templates designed for extracting blog entry titles, posts and their comments were established, and structured information was extracted online step by step. A dozen of data items, such as the entry titles, posts and their commenters and comments, the numbers of views, and the numbers of citations were extracted from eight major Chinese blog websites,including Sina, Sohu and Bokee.Findings: Results showed that the average accuracy of the experimental extraction system reached 94.6%. Because the online and multi-threading extraction technique was adopted, the speed of extraction was improved with the average speed of 15 pages per second without considering the network delay. In addition, entries posted by Ajax technology can be extracted successfully.Research limitations: As the templates need to be established in advance, this extraction technique can be effectively applied to a limited range of blog websites. In addition, the stability of the extraction templates was affected by the source code of the blog pages.Practical implications: This paper has studied and established a blog page extraction system,which can be used to extract structured data, preserve and update the data, and facilitate the collection, study and utilization of the blog resources, especially academic blog resources.Originality/value: This modified template extraction technique outperforms the Web page downloaders and the specialized blog page downloaders with structured and comprehensive data extraction.展开更多
More web pages are widely applying AJAX (Asynchronous JavaScript XML) due to the rich interactivity and incremental communication. By observing, it is found that the AJAX contents, which could not be seen by traditi...More web pages are widely applying AJAX (Asynchronous JavaScript XML) due to the rich interactivity and incremental communication. By observing, it is found that the AJAX contents, which could not be seen by traditional crawler, are well-structured and belong to one specific domain generally. Extracting the structured data from AJAX contents and annotating its semantic are very significant for further applications. In this paper, a structured AJAX data extraction method for agricultural domain based on agricultural ontology was proposed. Firstly, Crawljax, an open AJAX crawling tool, was overridden to explore and retrieve the AJAX contents; secondly, the retrieved contents were partitioned into items and then classified by combining with agricultural ontology. HTML tags and punctuations were used to segment the retrieved contents into entity items. Finally, the entity items were clustered and the semantic annotation was assigned to clustering results according to agricultural ontology. By experimental evaluation, the proposed approach was proved effectively in resource exploring, entity extraction, and semantic annotation.展开更多
With long-term marine surveys and research,and especially with the development of new marine environment monitoring technologies,prodigious amounts of complex marine environmental data are generated,and continuously i...With long-term marine surveys and research,and especially with the development of new marine environment monitoring technologies,prodigious amounts of complex marine environmental data are generated,and continuously increase rapidly.Features of these data include massive volume,widespread distribution,multiple-sources,heterogeneous,multi-dimensional and dynamic in structure and time.The present study recommends an integrative visualization solution for these data,to enhance the visual display of data and data archives,and to develop a joint use of these data distributed among different organizations or communities.This study also analyses the web services technologies and defines the concept of the marine information gird,then focuses on the spatiotemporal visualization method and proposes a process-oriented spatiotemporal visualization method.We discuss how marine environmental data can be organized based on the spatiotemporal visualization method,and how organized data are represented for use with web services and stored in a reusable fashion.In addition,we provide an original visualization architecture that is integrative and based on the explored technologies.In the end,we propose a prototype system of marine environmental data of the South China Sea for visualizations of Argo floats,sea surface temperature fields,sea current fields,salinity,in-situ investigation data,and ocean stations.An integration visualization architecture is illustrated on the prototype system,which highlights the process-oriented temporal visualization method and demonstrates the benefit of the architecture and the methods described in this study.展开更多
Since web based GIS processes large size spatial geographic information on internet, we should try to improve the efficiency of spatial data query processing and transmission. This paper presents two efficient metho...Since web based GIS processes large size spatial geographic information on internet, we should try to improve the efficiency of spatial data query processing and transmission. This paper presents two efficient methods for this purpose: division transmission and progressive transmission methods. In division transmission method, a map can be divided into several parts, called “tiles”, and only tiles can be transmitted at the request of a client. In progressive transmission method, a map can be split into several phase views based on the significance of vertices, and a server produces a target object and then transmits it progressively when this spatial object is requested from a client. In order to achieve these methods, the algorithms, “tile division”, “priority order estimation” and the strategies for data transmission are proposed in this paper, respectively. Compared with such traditional methods as “map total transmission” and “layer transmission”, the web based GIS data transmission, proposed in this paper, is advantageous in the increase of the data transmission efficiency by a great margin.展开更多
In order to use data information in the Internet, it is necessary to extract data from web pages. An HTT tree model representing HTML pages is presented. Based on the HTT model, a wrapper generation algorithm AGW is p...In order to use data information in the Internet, it is necessary to extract data from web pages. An HTT tree model representing HTML pages is presented. Based on the HTT model, a wrapper generation algorithm AGW is proposed. The AGW algorithm utilizes comparing and correcting technique to generate the wrapper with the native characteristic of the HTT tree structure. The AGW algorithm can not only generate the wrapper automatically, but also rebuild the data schema easily and reduce the complexity of the computing.展开更多
The drastic growth of coastal observation sensors results in copious data that provide weather information.The intricacies in sensor-generated big data are heterogeneity and interpretation,driving high-end Information...The drastic growth of coastal observation sensors results in copious data that provide weather information.The intricacies in sensor-generated big data are heterogeneity and interpretation,driving high-end Information Retrieval(IR)systems.The Semantic Web(SW)can solve this issue by integrating data into a single platform for information exchange and knowledge retrieval.This paper focuses on exploiting the SWbase systemto provide interoperability through ontologies by combining the data concepts with ontology classes.This paper presents a 4-phase weather data model:data processing,ontology creation,SW processing,and query engine.The developed Oceanographic Weather Ontology helps to enhance data analysis,discovery,IR,and decision making.In addition to that,it also evaluates the developed ontology with other state-of-the-art ontologies.The proposed ontology’s quality has improved by 39.28%in terms of completeness,and structural complexity has decreased by 45.29%,11%and 37.7%in Precision and Accuracy.Indian Meteorological Satellite INSAT-3D’s ocean data is a typical example of testing the proposed model.The experimental result shows the effectiveness of the proposed data model and its advantages in machine understanding and IR.展开更多
Organizations tend to use information systems (IS) applications that require data to be exchanged between different parties, while data exchange is restricted with information reach and range, which determines the org...Organizations tend to use information systems (IS) applications that require data to be exchanged between different parties, while data exchange is restricted with information reach and range, which determines the organizations’ IT platform. To determine the best platform, a comparison between electronic data interchange (EDI) and web services was conducted depending on certain criteria, and then we match the results with the information reach and range. The main findings show that the web services platform can take place when the range of information access is required by anyone and anywhere regardless of IT base. EDI can take place when the range of information access doesn’t exceed the organizations’ boundaries. But when the range of information access exceeds the organizations’ boundaries, still between certain partners, web services or EDI can take place, and thus the organization can select them from those platforms depending on other criteria such as security, and cost.展开更多
In order to extract the boundary of rural habitation, based on geographic name data and basic geographic information data, an extraction method that use polygon aggregation is raised, it can extract the boundary of th...In order to extract the boundary of rural habitation, based on geographic name data and basic geographic information data, an extraction method that use polygon aggregation is raised, it can extract the boundary of three levels of rural habitation consists of town, administrative village and nature village. The method first extracts the boundary of nature village by aggregating the resident polygon, then extracts the boundary of administrative village by aggregating the boundary of nature village, and last extracts the boundary of town by aggregating the boundary of administrative village. The related methods of extracting the boundary of those three levels rural habitation has been given in detail during the experiment with basic geographic information data and geographic name data. Experimental results show the method can be a reference for boundary extraction of rural habitation.展开更多
A vast amount of data (known as big data) may now be collected and stored from a variety of data sources, including event logs, the internet, smartphones, databases, sensors, cloud computing, and Internet of Things (I...A vast amount of data (known as big data) may now be collected and stored from a variety of data sources, including event logs, the internet, smartphones, databases, sensors, cloud computing, and Internet of Things (IoT) devices. The term “big data security” refers to all the safeguards and instruments used to protect both the data and analytics processes against intrusions, theft, and other hostile actions that could endanger or adversely influence them. Beyond being a high-value and desirable target, protecting Big Data has particular difficulties. Big Data security does not fundamentally differ from conventional data security. Big Data security issues are caused by extraneous distinctions rather than fundamental ones. This study meticulously outlines the numerous security difficulties Large Data analytics now faces and encourages additional joint research for reducing both big data security challenges utilizing Ontology Web Language (OWL). Although we focus on the Security Challenges of Big Data in this essay, we will also briefly cover the broader Challenges of Big Data. The proposed classification of Big Data security based on ontology web language resulting from the protégé software has 32 classes and 45 subclasses.展开更多
基金The paper is supported by the Research Foundation for Out-standing Young Teachers, China University of Geosciences (Wuhan) (Nos. CUGQNL0628, CUGQNL0640)the National High-Tech Research and Development Program (863 Program) (No. 2001AA135170)the Postdoctoral Foundation of the Shandong Zhaojin Group Co. (No. 20050262120)
文摘Satellite remote sensing data are usually used to analyze the spatial distribution pattern of geological structures and generally serve as a significant means for the identification of alteration zones. Based on the Landsat Enhanced Thematic Mapper (ETM+) data, which have better spectral resolution (8 bands) and spatial resolution (15 m in PAN band), the synthesis processing techniques were presented to fulfill alteration information extraction: data preparation, vegetation indices and band ratios, and expert classifier-based classification. These techniques have been implemented in the MapGIS-RSP software (version 1.0), developed by the Wuhan Zondy Cyber Technology Co., Ltd, China. In the study area application of extracting alteration information in the Zhaoyuan (招远) gold mines, Shandong (山东) Province, China, several hydorthermally altered zones (included two new sites) were found after satellite imagery interpretation coupled with field surveys. It is concluded that these synthesis processing techniques are useful approaches and are applicable to a wide range of gold-mineralized alteration information extraction.
基金supported by the National Natural Science Foundation of China(Grant No.42090011)。
文摘Over the past ten years,large amounts of original research data related to Earth system science have been made available at a rapidly increasing rate.Such growing data stock helps researchers understand the human-Earth system across different fields.A substantial amount of this data is published by geoscientists as open-access in authoritative journals.If the information stored in this literature is properly extracted,there is significant potential to build a domain knowledge base.However,this potential remains largely unfulfilled in geoscience,with one of the biggest obstacles being the lack of publicly available related corpora and baselines.To fill this gap,the Earth Science Data Corpus(ESDC),an academic text corpus of 600 abstracts,was built from the international journal Earth System Science Data(ESSD).To the best of our knowledge,ESDC is the first corpus with the needed detail to provide a professional training dataset for knowledge extraction and construction of domain-specific knowledge graphs from massive amounts of literature.The production process of ESDC incorporates both the contextual features of spatiotemporal entities and the linguistic characteristics of academic literature.Furthermore,annotation guidelines and procedures tailored for Earth science data are formulated to ensure reliability.ChatGPT with zero-and few-shot prompting,BARTNER generative,and W2NER discriminative models were trained on ESDC to evaluate the performance of the name entity recognition task and showed increasing performance metrics,with the highest achieved by BARTNER.Performance metrics for various entity types output by each model were also assessed.We utilized the trained BARTNER model to perform model inference on a larger unlabeled literature corpus,aiming to automatically extract a broader and richer set of entity information.Subsequently,the extracted entity information was mapped and associated with the Earth science data knowledge graph.Around this knowledge graph,this paper validates multiple downstream applications,including hot topic research analysis,scientometric analysis,and knowledge-enhanced large language model question-answering systems.These applications have demonstrated that the ESDC can provide scientists from different disciplines with information on Earth science data,help them better understand and obtain data,and promote further exploration in their respective professional fields.
基金This work is developed with the support of the H2020 RISIS 2 Project(No.824091)and of the“Sapienza”Research Awards No.RM1161550376E40E of 2016 and RM11916B8853C925 of 2019.This article is a largely extended version of Bianchi et al.(2019)presented at the ISSI 2019 Conference held in Rome,2–5 September 2019.
文摘Purpose:The main objective of this work is to show the potentialities of recently developed approaches for automatic knowledge extraction directly from the universities’websites.The information automatically extracted can be potentially updated with a frequency higher than once per year,and be safe from manipulations or misinterpretations.Moreover,this approach allows us flexibility in collecting indicators about the efficiency of universities’websites and their effectiveness in disseminating key contents.These new indicators can complement traditional indicators of scientific research(e.g.number of articles and number of citations)and teaching(e.g.number of students and graduates)by introducing further dimensions to allow new insights for“profiling”the analyzed universities.Design/methodology/approach:Webometrics relies on web mining methods and techniques to perform quantitative analyses of the web.This study implements an advanced application of the webometric approach,exploiting all the three categories of web mining:web content mining;web structure mining;web usage mining.The information to compute our indicators has been extracted from the universities’websites by using web scraping and text mining techniques.The scraped information has been stored in a NoSQL DB according to a semistructured form to allow for retrieving information efficiently by text mining techniques.This provides increased flexibility in the design of new indicators,opening the door to new types of analyses.Some data have also been collected by means of batch interrogations of search engines(Bing,www.bing.com)or from a leading provider of Web analytics(SimilarWeb,http://www.similarweb.com).The information extracted from the Web has been combined with the University structural information taken from the European Tertiary Education Register(https://eter.joanneum.at/#/home),a database collecting information on Higher Education Institutions(HEIs)at European level.All the above was used to perform a clusterization of 79 Italian universities based on structural and digital indicators.Findings:The main findings of this study concern the evaluation of the potential in digitalization of universities,in particular by presenting techniques for the automatic extraction of information from the web to build indicators of quality and impact of universities’websites.These indicators can complement traditional indicators and can be used to identify groups of universities with common features using clustering techniques working with the above indicators.Research limitations:The results reported in this study refers to Italian universities only,but the approach could be extended to other university systems abroad.Practical implications:The approach proposed in this study and its illustration on Italian universities show the usefulness of recently introduced automatic data extraction and web scraping approaches and its practical relevance for characterizing and profiling the activities of universities on the basis of their websites.The approach could be applied to other university systems.Originality/value:This work applies for the first time to university websites some recently introduced techniques for automatic knowledge extraction based on web scraping,optical character recognition and nontrivial text mining operations(Bruni&Bianchi,2020).
文摘Web information system(WIS)is frequently-used and indispensable in daily social life.WIS provides information services in many scenarios,such as electronic commerce,communities,and edutainment.Data cleaning plays an essential role in various WIS scenarios to improve the quality of data service.In this paper,we present a review of the state-of-the-art methods for data cleaning in WIS.According to the characteristics of data cleaning,we extract the critical elements of WIS,such as interactive objects,application scenarios,and core technology,to classify the existing works.Then,after elaborating and analyzing each category,we summarize the descriptions and challenges of data cleaning methods with sub-elements such as data&user interaction,data quality rule,model,crowdsourcing,and privacy preservation.Finally,we analyze various types of problems and provide suggestions for future research on data cleaning in WIS from the technology and interactive perspective.
基金Supported by the Plan of Research on Science andTechnology and Development in Hebei Province (04213534)
文摘The aim is to solve the problem that how to share dispersive and heterogeneous data inside business information system or some other information source. On the basis of Web service, this paper adopts the notion of Data As ,Service to build service-oriented data integration architecture. According to this architecture, we develop data collection system which effectively integrates data from heterogeneous informa tion source and present a uniform data view to end users by implementing sharing data from heterogeneous systems and information source . At last, this paper gives an example of a compositive information collection platform system.
文摘The fourth international conference on Web information systems and applications (WISA 2007) has received 409 submissions and has accepted 37 papers for publication in this issue. The papers cover broad research areas, including Web mining and data warehouse, Deep Web and Web integration, P2P networks, text processing and information retrieval, as well as Web Services and Web infrastructure. After briefly introducing the WISA conference, the survey outlines the current activities and future trends concerning Web information systems and applications based on the papers accepted for publication.
文摘With the propagation of applications on the internet, the internet has become a great information source which supplies users with valuable information. But it is hard for users to quickly acquire the right information on the web. This paper an intelligent agent for internet applications to retrieve and extract web information under user's guidance. The intelligent agent is made up of a retrieval script to identify web sources, an extraction script based on the document object model to express extraction process, a data translator to export the extracted information into knowledge bases with frame structures, and a data reasoning to reply users' questions. A GUI tool named Script Writer helps to generate the extraction script visually, and knowledge rule databases help to extract wanted information and to generate the answer to questions.
基金supported by the Foundation for Humanities and Social Sciences of the Chinese Ministry of Education(Grant No.:08JC870002)
文摘Purpose: The objectives of this study are to explore an effective technique to extract information from weblogs and develop an experimental system to extract structured information as much as possible with this technique. The system will lay a foundation for evaluation, analysis, retrieval, and utilization of the extracted information.Design/methodology/approach: An improved template extraction technique was proposed.Separate templates designed for extracting blog entry titles, posts and their comments were established, and structured information was extracted online step by step. A dozen of data items, such as the entry titles, posts and their commenters and comments, the numbers of views, and the numbers of citations were extracted from eight major Chinese blog websites,including Sina, Sohu and Bokee.Findings: Results showed that the average accuracy of the experimental extraction system reached 94.6%. Because the online and multi-threading extraction technique was adopted, the speed of extraction was improved with the average speed of 15 pages per second without considering the network delay. In addition, entries posted by Ajax technology can be extracted successfully.Research limitations: As the templates need to be established in advance, this extraction technique can be effectively applied to a limited range of blog websites. In addition, the stability of the extraction templates was affected by the source code of the blog pages.Practical implications: This paper has studied and established a blog page extraction system,which can be used to extract structured data, preserve and update the data, and facilitate the collection, study and utilization of the blog resources, especially academic blog resources.Originality/value: This modified template extraction technique outperforms the Web page downloaders and the specialized blog page downloaders with structured and comprehensive data extraction.
基金supported by the Knowledge Innovation Program of the Chinese Academy of Sciencesthe National High-Tech R&D Program of China(2008BAK49B05)
文摘More web pages are widely applying AJAX (Asynchronous JavaScript XML) due to the rich interactivity and incremental communication. By observing, it is found that the AJAX contents, which could not be seen by traditional crawler, are well-structured and belong to one specific domain generally. Extracting the structured data from AJAX contents and annotating its semantic are very significant for further applications. In this paper, a structured AJAX data extraction method for agricultural domain based on agricultural ontology was proposed. Firstly, Crawljax, an open AJAX crawling tool, was overridden to explore and retrieve the AJAX contents; secondly, the retrieved contents were partitioned into items and then classified by combining with agricultural ontology. HTML tags and punctuations were used to segment the retrieved contents into entity items. Finally, the entity items were clustered and the semantic annotation was assigned to clustering results according to agricultural ontology. By experimental evaluation, the proposed approach was proved effectively in resource exploring, entity extraction, and semantic annotation.
基金Supported by the Knowledge Innovation Program of the Chinese Academy of Sciences (No.KZCX1-YW-12-04)the National High Technology Research and Development Program of China (863 Program) (Nos.2009AA12Z148,2007AA092202)Support for this study was provided by the Institute of Geographical Sciences and the Natural Resources Research,Chinese Academy of Science (IGSNRR,CAS) and the Institute of Oceanology, CAS
文摘With long-term marine surveys and research,and especially with the development of new marine environment monitoring technologies,prodigious amounts of complex marine environmental data are generated,and continuously increase rapidly.Features of these data include massive volume,widespread distribution,multiple-sources,heterogeneous,multi-dimensional and dynamic in structure and time.The present study recommends an integrative visualization solution for these data,to enhance the visual display of data and data archives,and to develop a joint use of these data distributed among different organizations or communities.This study also analyses the web services technologies and defines the concept of the marine information gird,then focuses on the spatiotemporal visualization method and proposes a process-oriented spatiotemporal visualization method.We discuss how marine environmental data can be organized based on the spatiotemporal visualization method,and how organized data are represented for use with web services and stored in a reusable fashion.In addition,we provide an original visualization architecture that is integrative and based on the explored technologies.In the end,we propose a prototype system of marine environmental data of the South China Sea for visualizations of Argo floats,sea surface temperature fields,sea current fields,salinity,in-situ investigation data,and ocean stations.An integration visualization architecture is illustrated on the prototype system,which highlights the process-oriented temporal visualization method and demonstrates the benefit of the architecture and the methods described in this study.
文摘Since web based GIS processes large size spatial geographic information on internet, we should try to improve the efficiency of spatial data query processing and transmission. This paper presents two efficient methods for this purpose: division transmission and progressive transmission methods. In division transmission method, a map can be divided into several parts, called “tiles”, and only tiles can be transmitted at the request of a client. In progressive transmission method, a map can be split into several phase views based on the significance of vertices, and a server produces a target object and then transmits it progressively when this spatial object is requested from a client. In order to achieve these methods, the algorithms, “tile division”, “priority order estimation” and the strategies for data transmission are proposed in this paper, respectively. Compared with such traditional methods as “map total transmission” and “layer transmission”, the web based GIS data transmission, proposed in this paper, is advantageous in the increase of the data transmission efficiency by a great margin.
基金the National Grand Fundamental Research 973 Program of China(G1998030414)
文摘In order to use data information in the Internet, it is necessary to extract data from web pages. An HTT tree model representing HTML pages is presented. Based on the HTT model, a wrapper generation algorithm AGW is proposed. The AGW algorithm utilizes comparing and correcting technique to generate the wrapper with the native characteristic of the HTT tree structure. The AGW algorithm can not only generate the wrapper automatically, but also rebuild the data schema easily and reduce the complexity of the computing.
基金This work is financially supported by the Ministry of Earth Science(MoES),Government of India,(Grant.No.MoES/36/OOIS/Extra/45/2015),URL:https://www.moes.gov.in。
文摘The drastic growth of coastal observation sensors results in copious data that provide weather information.The intricacies in sensor-generated big data are heterogeneity and interpretation,driving high-end Information Retrieval(IR)systems.The Semantic Web(SW)can solve this issue by integrating data into a single platform for information exchange and knowledge retrieval.This paper focuses on exploiting the SWbase systemto provide interoperability through ontologies by combining the data concepts with ontology classes.This paper presents a 4-phase weather data model:data processing,ontology creation,SW processing,and query engine.The developed Oceanographic Weather Ontology helps to enhance data analysis,discovery,IR,and decision making.In addition to that,it also evaluates the developed ontology with other state-of-the-art ontologies.The proposed ontology’s quality has improved by 39.28%in terms of completeness,and structural complexity has decreased by 45.29%,11%and 37.7%in Precision and Accuracy.Indian Meteorological Satellite INSAT-3D’s ocean data is a typical example of testing the proposed model.The experimental result shows the effectiveness of the proposed data model and its advantages in machine understanding and IR.
文摘Organizations tend to use information systems (IS) applications that require data to be exchanged between different parties, while data exchange is restricted with information reach and range, which determines the organizations’ IT platform. To determine the best platform, a comparison between electronic data interchange (EDI) and web services was conducted depending on certain criteria, and then we match the results with the information reach and range. The main findings show that the web services platform can take place when the range of information access is required by anyone and anywhere regardless of IT base. EDI can take place when the range of information access doesn’t exceed the organizations’ boundaries. But when the range of information access exceeds the organizations’ boundaries, still between certain partners, web services or EDI can take place, and thus the organization can select them from those platforms depending on other criteria such as security, and cost.
文摘In order to extract the boundary of rural habitation, based on geographic name data and basic geographic information data, an extraction method that use polygon aggregation is raised, it can extract the boundary of three levels of rural habitation consists of town, administrative village and nature village. The method first extracts the boundary of nature village by aggregating the resident polygon, then extracts the boundary of administrative village by aggregating the boundary of nature village, and last extracts the boundary of town by aggregating the boundary of administrative village. The related methods of extracting the boundary of those three levels rural habitation has been given in detail during the experiment with basic geographic information data and geographic name data. Experimental results show the method can be a reference for boundary extraction of rural habitation.
文摘A vast amount of data (known as big data) may now be collected and stored from a variety of data sources, including event logs, the internet, smartphones, databases, sensors, cloud computing, and Internet of Things (IoT) devices. The term “big data security” refers to all the safeguards and instruments used to protect both the data and analytics processes against intrusions, theft, and other hostile actions that could endanger or adversely influence them. Beyond being a high-value and desirable target, protecting Big Data has particular difficulties. Big Data security does not fundamentally differ from conventional data security. Big Data security issues are caused by extraneous distinctions rather than fundamental ones. This study meticulously outlines the numerous security difficulties Large Data analytics now faces and encourages additional joint research for reducing both big data security challenges utilizing Ontology Web Language (OWL). Although we focus on the Security Challenges of Big Data in this essay, we will also briefly cover the broader Challenges of Big Data. The proposed classification of Big Data security based on ontology web language resulting from the protégé software has 32 classes and 45 subclasses.