Web information system(WIS)is frequently-used and indispensable in daily social life.WIS provides information services in many scenarios,such as electronic commerce,communities,and edutainment.Data cleaning plays an e...Web information system(WIS)is frequently-used and indispensable in daily social life.WIS provides information services in many scenarios,such as electronic commerce,communities,and edutainment.Data cleaning plays an essential role in various WIS scenarios to improve the quality of data service.In this paper,we present a review of the state-of-the-art methods for data cleaning in WIS.According to the characteristics of data cleaning,we extract the critical elements of WIS,such as interactive objects,application scenarios,and core technology,to classify the existing works.Then,after elaborating and analyzing each category,we summarize the descriptions and challenges of data cleaning methods with sub-elements such as data&user interaction,data quality rule,model,crowdsourcing,and privacy preservation.Finally,we analyze various types of problems and provide suggestions for future research on data cleaning in WIS from the technology and interactive perspective.展开更多
To solve the problem of chaining distributed geographic information Web services (GI Web services), this paper provides an ontology-based method. With this method, semantic service description can be achieved by sem...To solve the problem of chaining distributed geographic information Web services (GI Web services), this paper provides an ontology-based method. With this method, semantic service description can be achieved by semantic annotation of the elements in a Web service description language(WSDL) document with concepts of geographic ontology, and then a common under-standing about service semantics between customers and providers of Web services is built. Based on the decomposition and formalization of customer requirements, the discovery, composition and execution of GI Web services are explained in detail, and then a chaining of GI Web services is built and used to achieve the customer's requirement. Finally, an example based on Web ontology language for service (OWL-S) is provided for testing the feasibility of this method.展开更多
With the propagation of applications on the internet, the internet has become a great information source which supplies users with valuable information. But it is hard for users to quickly acquire the right informatio...With the propagation of applications on the internet, the internet has become a great information source which supplies users with valuable information. But it is hard for users to quickly acquire the right information on the web. This paper an intelligent agent for internet applications to retrieve and extract web information under user's guidance. The intelligent agent is made up of a retrieval script to identify web sources, an extraction script based on the document object model to express extraction process, a data translator to export the extracted information into knowledge bases with frame structures, and a data reasoning to reply users' questions. A GUI tool named Script Writer helps to generate the extraction script visually, and knowledge rule databases help to extract wanted information and to generate the answer to questions.展开更多
This paper selects 998 articles as its data sources from four Chinese core journals in the field of Library and Information Science from 2003 to 2007.Some pertinent aspects of reference citations particularly from web...This paper selects 998 articles as its data sources from four Chinese core journals in the field of Library and Information Science from 2003 to 2007.Some pertinent aspects of reference citations particularly from web resources are selected for a focused analysis and discussion.This includes primarily such items as the number of web citations,web citations per each article,the distribution of domain names of web citations and also certain aspects about the institutional and/or geographical affiliations of the author.The evolving situation of utilizing online networked academic information resources in China is the central thematic discussion of this study.The writing of this paper is augmented by the explicatory presentation of 3 graphic figures,6 tables and 18 references.展开更多
Purpose: The objectives of this study are to explore an effective technique to extract information from weblogs and develop an experimental system to extract structured information as much as possible with this techni...Purpose: The objectives of this study are to explore an effective technique to extract information from weblogs and develop an experimental system to extract structured information as much as possible with this technique. The system will lay a foundation for evaluation, analysis, retrieval, and utilization of the extracted information.Design/methodology/approach: An improved template extraction technique was proposed.Separate templates designed for extracting blog entry titles, posts and their comments were established, and structured information was extracted online step by step. A dozen of data items, such as the entry titles, posts and their commenters and comments, the numbers of views, and the numbers of citations were extracted from eight major Chinese blog websites,including Sina, Sohu and Bokee.Findings: Results showed that the average accuracy of the experimental extraction system reached 94.6%. Because the online and multi-threading extraction technique was adopted, the speed of extraction was improved with the average speed of 15 pages per second without considering the network delay. In addition, entries posted by Ajax technology can be extracted successfully.Research limitations: As the templates need to be established in advance, this extraction technique can be effectively applied to a limited range of blog websites. In addition, the stability of the extraction templates was affected by the source code of the blog pages.Practical implications: This paper has studied and established a blog page extraction system,which can be used to extract structured data, preserve and update the data, and facilitate the collection, study and utilization of the blog resources, especially academic blog resources.Originality/value: This modified template extraction technique outperforms the Web page downloaders and the specialized blog page downloaders with structured and comprehensive data extraction.展开更多
Due to the rapid development of electronic information technology,the development of Internet technology and system software development technology has become more and more common.Especially,along with the development...Due to the rapid development of electronic information technology,the development of Internet technology and system software development technology has become more and more common.Especially,along with the development of public security,there are more and more provisions for standard administrative department management system,improving office efficiency and enhancing decision encouragement.Therefore,it is of great practical value to design and complete a comprehensive public security business information system.Based on java technology,this paper designs and builds a comprehensive information management platform for public security through the analysis of comprehensive public security business,and also gets good feedback during the actual test,which confirms the feasibility of the system.展开更多
In recent years, as there has been a major change in the necessity and importance of sightseeing information, a platform to provide real-time sightseeing information according to the ever-changing circumstances is nec...In recent years, as there has been a major change in the necessity and importance of sightseeing information, a platform to provide real-time sightseeing information according to the ever-changing circumstances is necessary. Additionally, it is effective to adopt gamification to increase users’ motivation to continuously utilize the system in order to provide them with more information. In the present study, in order to support users’ enjoyment in creating efficient and pleasant sightseeing plans, the system that incorporates gamification to increase motivation was developed combining with web-geographic information systems (Web-GIS) and sightseeing planning and sharing system. The system was operated over a period of 2 weeks in Chofu City, Tokyo Metropolis, Japan. Based on the results of a questionnaire survey for 51 users, though the operability of the 3 main functions incorporated with motivation by gamification was rated lower than those of the 2 basic functions, their usefulness was highly rated. Based on the results of the access log analysis, it was effective to design the system so that the same functions can be used regardless of the type of information terminal. Additionally, it was evident that the continuous utilization of the system could increase the number of sightseeing plans created by the users.展开更多
The web is an extremely dynamic world where information is updated even every second. A web information monitoring system fetches information from the web continuously and finds changes by compar- ing two versions of ...The web is an extremely dynamic world where information is updated even every second. A web information monitoring system fetches information from the web continuously and finds changes by compar- ing two versions of the same page. The updating of a specific web page is modeled as a Poisson process with parameter to indicate the change frequency. As the amount of computing resources is limited, it is nec- essary to find some policies for reducing the overall change-detection time. Different allocation schemas are evaluated experimentally to find out which one is the most suitable for the web information monitoring prob- lem. The experimental data shows the runtime characteristics of the overall system performance and the re- lationship to the total amount of resources.展开更多
The incidence of porcine pasteurellosis in China is so widespread that it is difficult to clearly understand the prevalence and maintain continuous monitoring.In order to reduce immense negative economic impact on the...The incidence of porcine pasteurellosis in China is so widespread that it is difficult to clearly understand the prevalence and maintain continuous monitoring.In order to reduce immense negative economic impact on the livestock industry;monitoring,early warning,and visual management systems are highly desirable.In this study,a monitoring and early warning systemfor porcine pasteurellosis was established based onWeb Geographical Information System(WebGIS)technology.By establishing a path analysis function,buffer analysis function,and hot spot analysis function,it can provide a method of support and control of infectious diseases.For early warning of disease,four common interpolation methods were tested,all of which showed that the affected area of porcine pasteurellosis in China was mainly concentrated in the south of the mainland.A cross-validation method was used to compare the four interpolation methods.The cross-validation showed that the inverse distance weighting(IDW)method was suitable for forecasting the occurrence of porcine pasteurellosis in China.Finally,using C sharp(C#)as the development language and WebGIS technology,a monitoring and early warning system based on Browser/Server structure was developed.This is the first monitoring and early warning system of porcine pasteurellosis based onWebGIS.The performance of theWebGIS technology indicated a great potential for animal infectious disease applications and provided a foundation for future work.展开更多
In this paper, first studied are the distribution characteristics of user behaviors based on log data from a massive web search engine. Analysis shows that stochastic distribution of user queries accords with the char...In this paper, first studied are the distribution characteristics of user behaviors based on log data from a massive web search engine. Analysis shows that stochastic distribution of user queries accords with the characteristics of power-law function and exhibits strong similarity, and the user' s queries and clicked URLs present dramatic locality, which implies that query cache and 'hot click' cache can be employed to improve system performance. Then three typical cache replacement policies are compared, including LRU, FIFO, and LFU with attenuation. In addition, the distribution character-istics of web information are also analyzed, which demonstrates that the link popularity and replica pop-ularity of a URL have positive influence on its importance. Finally, variance between the link popularity and user popularity, and variance between replica popularity and user popularity are analyzed, which give us some important insight that helps us improve the ranking algorithms in a search engine.展开更多
We introduce a new method for visualizing and analyzing information landscapes of ideas and events posted on public web pages through customized web-search engines and keywords.This research integrates GIScience and w...We introduce a new method for visualizing and analyzing information landscapes of ideas and events posted on public web pages through customized web-search engines and keywords.This research integrates GIScience and web-search engines to track and analyze public web pages and their web contents with associated spatial relationships.Web pages searched by clusters of keywords were mapped with real-world coordinates(by geolocating their Internet Protocol addresses).The resulting maps represent web information landscapes consisting of hundreds of populated web pages searched by selected keywords.By creating a Spatial Web Automatic Reasoning and Mapping System prototype,researchers can visualize the spread of web pages associated with specific keywords,concepts,ideas,or news over time and space.These maps may reveal important spatial relationships and spatial context associated with selected keywords.This approach may provide a new research direction for geographers to study the diffusion of human thought and ideas.A better understanding of the spatial and temporal dynamics of the‘collective thinking of human beings’over the Internet may help us understand various innovation diffusion processes,human behaviors,and social movements around the world.展开更多
Semantic annotation of Web objects is a key problem for Web information extraction. The Web contains an abundance of useful semi-structured information about real world objects, and the empirical study shows that stro...Semantic annotation of Web objects is a key problem for Web information extraction. The Web contains an abundance of useful semi-structured information about real world objects, and the empirical study shows that strong two-dimensional sequence characteristics and correlative characteristics exist for Web information about objects of the same type across different Web sites. Conditional Random Fields (CRFs) are the state-of-the-art approaches taking the sequence characteristics to do better labeling. However, as the appearance of correlative characteristics between Web object elements, previous CRFs have their limitations for semantic annotation of Web objects and cannot deal with the long distance dependencies between Web object elements efficiently. To better incorporate the long distance dependencies, on one hand, this paper describes long distance dependencies by correlative edges, which are built by making good use of structured information and the characteristics of records from external databases; and on the other hand, this paper presents a two-dimensional Correlative-Chain Conditional Random Fields (2DCC-CRFs) to do semantic annotation of Web objects. This approach extends a classic model, two-dimensional Conditional Random Fields (2DCRFs), by adding correlative edges. Experimental results using a large number of real-world data collected from diverse domains show that the proposed approach can significantly improve the semantic annotation accuracy of Web objects.展开更多
文摘Web information system(WIS)is frequently-used and indispensable in daily social life.WIS provides information services in many scenarios,such as electronic commerce,communities,and edutainment.Data cleaning plays an essential role in various WIS scenarios to improve the quality of data service.In this paper,we present a review of the state-of-the-art methods for data cleaning in WIS.According to the characteristics of data cleaning,we extract the critical elements of WIS,such as interactive objects,application scenarios,and core technology,to classify the existing works.Then,after elaborating and analyzing each category,we summarize the descriptions and challenges of data cleaning methods with sub-elements such as data&user interaction,data quality rule,model,crowdsourcing,and privacy preservation.Finally,we analyze various types of problems and provide suggestions for future research on data cleaning in WIS from the technology and interactive perspective.
基金the National Natural Science Fundation ofChina (60774041)
文摘To solve the problem of chaining distributed geographic information Web services (GI Web services), this paper provides an ontology-based method. With this method, semantic service description can be achieved by semantic annotation of the elements in a Web service description language(WSDL) document with concepts of geographic ontology, and then a common under-standing about service semantics between customers and providers of Web services is built. Based on the decomposition and formalization of customer requirements, the discovery, composition and execution of GI Web services are explained in detail, and then a chaining of GI Web services is built and used to achieve the customer's requirement. Finally, an example based on Web ontology language for service (OWL-S) is provided for testing the feasibility of this method.
文摘With the propagation of applications on the internet, the internet has become a great information source which supplies users with valuable information. But it is hard for users to quickly acquire the right information on the web. This paper an intelligent agent for internet applications to retrieve and extract web information under user's guidance. The intelligent agent is made up of a retrieval script to identify web sources, an extraction script based on the document object model to express extraction process, a data translator to export the extracted information into knowledge bases with frame structures, and a data reasoning to reply users' questions. A GUI tool named Script Writer helps to generate the extraction script visually, and knowledge rule databases help to extract wanted information and to generate the answer to questions.
基金supported by National Social Science Fund of China(Grant No.08CTQ015)
文摘This paper selects 998 articles as its data sources from four Chinese core journals in the field of Library and Information Science from 2003 to 2007.Some pertinent aspects of reference citations particularly from web resources are selected for a focused analysis and discussion.This includes primarily such items as the number of web citations,web citations per each article,the distribution of domain names of web citations and also certain aspects about the institutional and/or geographical affiliations of the author.The evolving situation of utilizing online networked academic information resources in China is the central thematic discussion of this study.The writing of this paper is augmented by the explicatory presentation of 3 graphic figures,6 tables and 18 references.
基金supported by the Foundation for Humanities and Social Sciences of the Chinese Ministry of Education(Grant No.:08JC870002)
文摘Purpose: The objectives of this study are to explore an effective technique to extract information from weblogs and develop an experimental system to extract structured information as much as possible with this technique. The system will lay a foundation for evaluation, analysis, retrieval, and utilization of the extracted information.Design/methodology/approach: An improved template extraction technique was proposed.Separate templates designed for extracting blog entry titles, posts and their comments were established, and structured information was extracted online step by step. A dozen of data items, such as the entry titles, posts and their commenters and comments, the numbers of views, and the numbers of citations were extracted from eight major Chinese blog websites,including Sina, Sohu and Bokee.Findings: Results showed that the average accuracy of the experimental extraction system reached 94.6%. Because the online and multi-threading extraction technique was adopted, the speed of extraction was improved with the average speed of 15 pages per second without considering the network delay. In addition, entries posted by Ajax technology can be extracted successfully.Research limitations: As the templates need to be established in advance, this extraction technique can be effectively applied to a limited range of blog websites. In addition, the stability of the extraction templates was affected by the source code of the blog pages.Practical implications: This paper has studied and established a blog page extraction system,which can be used to extract structured data, preserve and update the data, and facilitate the collection, study and utilization of the blog resources, especially academic blog resources.Originality/value: This modified template extraction technique outperforms the Web page downloaders and the specialized blog page downloaders with structured and comprehensive data extraction.
文摘Due to the rapid development of electronic information technology,the development of Internet technology and system software development technology has become more and more common.Especially,along with the development of public security,there are more and more provisions for standard administrative department management system,improving office efficiency and enhancing decision encouragement.Therefore,it is of great practical value to design and complete a comprehensive public security business information system.Based on java technology,this paper designs and builds a comprehensive information management platform for public security through the analysis of comprehensive public security business,and also gets good feedback during the actual test,which confirms the feasibility of the system.
文摘In recent years, as there has been a major change in the necessity and importance of sightseeing information, a platform to provide real-time sightseeing information according to the ever-changing circumstances is necessary. Additionally, it is effective to adopt gamification to increase users’ motivation to continuously utilize the system in order to provide them with more information. In the present study, in order to support users’ enjoyment in creating efficient and pleasant sightseeing plans, the system that incorporates gamification to increase motivation was developed combining with web-geographic information systems (Web-GIS) and sightseeing planning and sharing system. The system was operated over a period of 2 weeks in Chofu City, Tokyo Metropolis, Japan. Based on the results of a questionnaire survey for 51 users, though the operability of the 3 main functions incorporated with motivation by gamification was rated lower than those of the 2 basic functions, their usefulness was highly rated. Based on the results of the access log analysis, it was effective to design the system so that the same functions can be used regardless of the type of information terminal. Additionally, it was evident that the continuous utilization of the system could increase the number of sightseeing plans created by the users.
基金Supported by the National Natural Science Foundation of China (No. 60131160743)
文摘The web is an extremely dynamic world where information is updated even every second. A web information monitoring system fetches information from the web continuously and finds changes by compar- ing two versions of the same page. The updating of a specific web page is modeled as a Poisson process with parameter to indicate the change frequency. As the amount of computing resources is limited, it is nec- essary to find some policies for reducing the overall change-detection time. Different allocation schemas are evaluated experimentally to find out which one is the most suitable for the web information monitoring prob- lem. The experimental data shows the runtime characteristics of the overall system performance and the re- lationship to the total amount of resources.
基金This work was supported by the National Key R&D Program of China(Grant No.2017YFD0501806)the Major Program of Applied Technology Research and Development Plan of Heilongjiang Province(Grant No.GA18B203).
文摘The incidence of porcine pasteurellosis in China is so widespread that it is difficult to clearly understand the prevalence and maintain continuous monitoring.In order to reduce immense negative economic impact on the livestock industry;monitoring,early warning,and visual management systems are highly desirable.In this study,a monitoring and early warning systemfor porcine pasteurellosis was established based onWeb Geographical Information System(WebGIS)technology.By establishing a path analysis function,buffer analysis function,and hot spot analysis function,it can provide a method of support and control of infectious diseases.For early warning of disease,four common interpolation methods were tested,all of which showed that the affected area of porcine pasteurellosis in China was mainly concentrated in the south of the mainland.A cross-validation method was used to compare the four interpolation methods.The cross-validation showed that the inverse distance weighting(IDW)method was suitable for forecasting the occurrence of porcine pasteurellosis in China.Finally,using C sharp(C#)as the development language and WebGIS technology,a monitoring and early warning system based on Browser/Server structure was developed.This is the first monitoring and early warning system of porcine pasteurellosis based onWebGIS.The performance of theWebGIS technology indicated a great potential for animal infectious disease applications and provided a foundation for future work.
基金This work was supported by the National Grand Fundamental Research of China ( Grant No. G1999032706).
文摘In this paper, first studied are the distribution characteristics of user behaviors based on log data from a massive web search engine. Analysis shows that stochastic distribution of user queries accords with the characteristics of power-law function and exhibits strong similarity, and the user' s queries and clicked URLs present dramatic locality, which implies that query cache and 'hot click' cache can be employed to improve system performance. Then three typical cache replacement policies are compared, including LRU, FIFO, and LFU with attenuation. In addition, the distribution character-istics of web information are also analyzed, which demonstrates that the link popularity and replica pop-ularity of a URL have positive influence on its importance. Finally, variance between the link popularity and user popularity, and variance between replica popularity and user popularity are analyzed, which give us some important insight that helps us improve the ranking algorithms in a search engine.
文摘We introduce a new method for visualizing and analyzing information landscapes of ideas and events posted on public web pages through customized web-search engines and keywords.This research integrates GIScience and web-search engines to track and analyze public web pages and their web contents with associated spatial relationships.Web pages searched by clusters of keywords were mapped with real-world coordinates(by geolocating their Internet Protocol addresses).The resulting maps represent web information landscapes consisting of hundreds of populated web pages searched by selected keywords.By creating a Spatial Web Automatic Reasoning and Mapping System prototype,researchers can visualize the spread of web pages associated with specific keywords,concepts,ideas,or news over time and space.These maps may reveal important spatial relationships and spatial context associated with selected keywords.This approach may provide a new research direction for geographers to study the diffusion of human thought and ideas.A better understanding of the spatial and temporal dynamics of the‘collective thinking of human beings’over the Internet may help us understand various innovation diffusion processes,human behaviors,and social movements around the world.
基金Supported by the National Natural Science Foundation of China under Grant No.90818001the Natural Science Foundation of Shandong Province of China under Grant No.Y2007G24
文摘Semantic annotation of Web objects is a key problem for Web information extraction. The Web contains an abundance of useful semi-structured information about real world objects, and the empirical study shows that strong two-dimensional sequence characteristics and correlative characteristics exist for Web information about objects of the same type across different Web sites. Conditional Random Fields (CRFs) are the state-of-the-art approaches taking the sequence characteristics to do better labeling. However, as the appearance of correlative characteristics between Web object elements, previous CRFs have their limitations for semantic annotation of Web objects and cannot deal with the long distance dependencies between Web object elements efficiently. To better incorporate the long distance dependencies, on one hand, this paper describes long distance dependencies by correlative edges, which are built by making good use of structured information and the characteristics of records from external databases; and on the other hand, this paper presents a two-dimensional Correlative-Chain Conditional Random Fields (2DCC-CRFs) to do semantic annotation of Web objects. This approach extends a classic model, two-dimensional Conditional Random Fields (2DCRFs), by adding correlative edges. Experimental results using a large number of real-world data collected from diverse domains show that the proposed approach can significantly improve the semantic annotation accuracy of Web objects.