Web information system(WIS)is frequently-used and indispensable in daily social life.WIS provides information services in many scenarios,such as electronic commerce,communities,and edutainment.Data cleaning plays an e...Web information system(WIS)is frequently-used and indispensable in daily social life.WIS provides information services in many scenarios,such as electronic commerce,communities,and edutainment.Data cleaning plays an essential role in various WIS scenarios to improve the quality of data service.In this paper,we present a review of the state-of-the-art methods for data cleaning in WIS.According to the characteristics of data cleaning,we extract the critical elements of WIS,such as interactive objects,application scenarios,and core technology,to classify the existing works.Then,after elaborating and analyzing each category,we summarize the descriptions and challenges of data cleaning methods with sub-elements such as data&user interaction,data quality rule,model,crowdsourcing,and privacy preservation.Finally,we analyze various types of problems and provide suggestions for future research on data cleaning in WIS from the technology and interactive perspective.展开更多
To solve the problem of chaining distributed geographic information Web services (GI Web services), this paper provides an ontology-based method. With this method, semantic service description can be achieved by sem...To solve the problem of chaining distributed geographic information Web services (GI Web services), this paper provides an ontology-based method. With this method, semantic service description can be achieved by semantic annotation of the elements in a Web service description language(WSDL) document with concepts of geographic ontology, and then a common under-standing about service semantics between customers and providers of Web services is built. Based on the decomposition and formalization of customer requirements, the discovery, composition and execution of GI Web services are explained in detail, and then a chaining of GI Web services is built and used to achieve the customer's requirement. Finally, an example based on Web ontology language for service (OWL-S) is provided for testing the feasibility of this method.展开更多
With the propagation of applications on the internet, the internet has become a great information source which supplies users with valuable information. But it is hard for users to quickly acquire the right informatio...With the propagation of applications on the internet, the internet has become a great information source which supplies users with valuable information. But it is hard for users to quickly acquire the right information on the web. This paper an intelligent agent for internet applications to retrieve and extract web information under user's guidance. The intelligent agent is made up of a retrieval script to identify web sources, an extraction script based on the document object model to express extraction process, a data translator to export the extracted information into knowledge bases with frame structures, and a data reasoning to reply users' questions. A GUI tool named Script Writer helps to generate the extraction script visually, and knowledge rule databases help to extract wanted information and to generate the answer to questions.展开更多
The information access is the rich data available for information retrieval, evolved to provide principle approaches or strategies for searching. For building the successful web retrieval search engine model, there ar...The information access is the rich data available for information retrieval, evolved to provide principle approaches or strategies for searching. For building the successful web retrieval search engine model, there are a number of prospects that arise at the different levels where techniques, such as Usenet, support vector machine are employed to have a significant impact. The present investigations explore the number of problems identified its level and related to finding information on web. The authors have attempted to examine the issues and prospects by applying different methods such as web graph analysis, the retrieval and analysis of newsgroup postings and statistical methods for inferring meaning in text. The proposed model thus assists the users in finding the existing formation of data they need. The study proposes three heuristics model to characterize the balancing between query and feedback information, so that adaptive relevance feedback. The authors have made an attempt to discuss the parameter factors that are responsible for the efficient searching. The important parameters can be taken care of for the future extension or development of search engines.展开更多
This paper selects 998 articles as its data sources from four Chinese core journals in the field of Library and Information Science from 2003 to 2007.Some pertinent aspects of reference citations particularly from web...This paper selects 998 articles as its data sources from four Chinese core journals in the field of Library and Information Science from 2003 to 2007.Some pertinent aspects of reference citations particularly from web resources are selected for a focused analysis and discussion.This includes primarily such items as the number of web citations,web citations per each article,the distribution of domain names of web citations and also certain aspects about the institutional and/or geographical affiliations of the author.The evolving situation of utilizing online networked academic information resources in China is the central thematic discussion of this study.The writing of this paper is augmented by the explicatory presentation of 3 graphic figures,6 tables and 18 references.展开更多
Purpose: The objectives of this study are to explore an effective technique to extract information from weblogs and develop an experimental system to extract structured information as much as possible with this techni...Purpose: The objectives of this study are to explore an effective technique to extract information from weblogs and develop an experimental system to extract structured information as much as possible with this technique. The system will lay a foundation for evaluation, analysis, retrieval, and utilization of the extracted information.Design/methodology/approach: An improved template extraction technique was proposed.Separate templates designed for extracting blog entry titles, posts and their comments were established, and structured information was extracted online step by step. A dozen of data items, such as the entry titles, posts and their commenters and comments, the numbers of views, and the numbers of citations were extracted from eight major Chinese blog websites,including Sina, Sohu and Bokee.Findings: Results showed that the average accuracy of the experimental extraction system reached 94.6%. Because the online and multi-threading extraction technique was adopted, the speed of extraction was improved with the average speed of 15 pages per second without considering the network delay. In addition, entries posted by Ajax technology can be extracted successfully.Research limitations: As the templates need to be established in advance, this extraction technique can be effectively applied to a limited range of blog websites. In addition, the stability of the extraction templates was affected by the source code of the blog pages.Practical implications: This paper has studied and established a blog page extraction system,which can be used to extract structured data, preserve and update the data, and facilitate the collection, study and utilization of the blog resources, especially academic blog resources.Originality/value: This modified template extraction technique outperforms the Web page downloaders and the specialized blog page downloaders with structured and comprehensive data extraction.展开更多
Due to the rapid development of electronic information technology,the development of Internet technology and system software development technology has become more and more common.Especially,along with the development...Due to the rapid development of electronic information technology,the development of Internet technology and system software development technology has become more and more common.Especially,along with the development of public security,there are more and more provisions for standard administrative department management system,improving office efficiency and enhancing decision encouragement.Therefore,it is of great practical value to design and complete a comprehensive public security business information system.Based on java technology,this paper designs and builds a comprehensive information management platform for public security through the analysis of comprehensive public security business,and also gets good feedback during the actual test,which confirms the feasibility of the system.展开更多
In recent years, as there has been a major change in the necessity and importance of sightseeing information, a platform to provide real-time sightseeing information according to the ever-changing circumstances is nec...In recent years, as there has been a major change in the necessity and importance of sightseeing information, a platform to provide real-time sightseeing information according to the ever-changing circumstances is necessary. Additionally, it is effective to adopt gamification to increase users’ motivation to continuously utilize the system in order to provide them with more information. In the present study, in order to support users’ enjoyment in creating efficient and pleasant sightseeing plans, the system that incorporates gamification to increase motivation was developed combining with web-geographic information systems (Web-GIS) and sightseeing planning and sharing system. The system was operated over a period of 2 weeks in Chofu City, Tokyo Metropolis, Japan. Based on the results of a questionnaire survey for 51 users, though the operability of the 3 main functions incorporated with motivation by gamification was rated lower than those of the 2 basic functions, their usefulness was highly rated. Based on the results of the access log analysis, it was effective to design the system so that the same functions can be used regardless of the type of information terminal. Additionally, it was evident that the continuous utilization of the system could increase the number of sightseeing plans created by the users.展开更多
An improved decision tree method for web information retrieval with self-mapping attributes is proposed.The self-mapping tree has a value of self-mapping attribute in its internal node,and information based on dissimi...An improved decision tree method for web information retrieval with self-mapping attributes is proposed.The self-mapping tree has a value of self-mapping attribute in its internal node,and information based on dissimilarity between a pair of mapping sequences.This method selects self-mapping which exists between data by exhaustive search based on relation and attribute information.Experimental results confirm that the improved method constructs comprehensive and accurate decision tree.Moreover,an example shows that the self-mapping decision tree is promising for data mining and knowledge discovery.展开更多
The web is an extremely dynamic world where information is updated even every second. A web information monitoring system fetches information from the web continuously and finds changes by compar- ing two versions of ...The web is an extremely dynamic world where information is updated even every second. A web information monitoring system fetches information from the web continuously and finds changes by compar- ing two versions of the same page. The updating of a specific web page is modeled as a Poisson process with parameter to indicate the change frequency. As the amount of computing resources is limited, it is nec- essary to find some policies for reducing the overall change-detection time. Different allocation schemas are evaluated experimentally to find out which one is the most suitable for the web information monitoring prob- lem. The experimental data shows the runtime characteristics of the overall system performance and the re- lationship to the total amount of resources.展开更多
The incidence of porcine pasteurellosis in China is so widespread that it is difficult to clearly understand the prevalence and maintain continuous monitoring.In order to reduce immense negative economic impact on the...The incidence of porcine pasteurellosis in China is so widespread that it is difficult to clearly understand the prevalence and maintain continuous monitoring.In order to reduce immense negative economic impact on the livestock industry;monitoring,early warning,and visual management systems are highly desirable.In this study,a monitoring and early warning systemfor porcine pasteurellosis was established based onWeb Geographical Information System(WebGIS)technology.By establishing a path analysis function,buffer analysis function,and hot spot analysis function,it can provide a method of support and control of infectious diseases.For early warning of disease,four common interpolation methods were tested,all of which showed that the affected area of porcine pasteurellosis in China was mainly concentrated in the south of the mainland.A cross-validation method was used to compare the four interpolation methods.The cross-validation showed that the inverse distance weighting(IDW)method was suitable for forecasting the occurrence of porcine pasteurellosis in China.Finally,using C sharp(C#)as the development language and WebGIS technology,a monitoring and early warning system based on Browser/Server structure was developed.This is the first monitoring and early warning system of porcine pasteurellosis based onWebGIS.The performance of theWebGIS technology indicated a great potential for animal infectious disease applications and provided a foundation for future work.展开更多
In this paper, first studied are the distribution characteristics of user behaviors based on log data from a massive web search engine. Analysis shows that stochastic distribution of user queries accords with the char...In this paper, first studied are the distribution characteristics of user behaviors based on log data from a massive web search engine. Analysis shows that stochastic distribution of user queries accords with the characteristics of power-law function and exhibits strong similarity, and the user' s queries and clicked URLs present dramatic locality, which implies that query cache and 'hot click' cache can be employed to improve system performance. Then three typical cache replacement policies are compared, including LRU, FIFO, and LFU with attenuation. In addition, the distribution character-istics of web information are also analyzed, which demonstrates that the link popularity and replica pop-ularity of a URL have positive influence on its importance. Finally, variance between the link popularity and user popularity, and variance between replica popularity and user popularity are analyzed, which give us some important insight that helps us improve the ranking algorithms in a search engine.展开更多
文摘Web information system(WIS)is frequently-used and indispensable in daily social life.WIS provides information services in many scenarios,such as electronic commerce,communities,and edutainment.Data cleaning plays an essential role in various WIS scenarios to improve the quality of data service.In this paper,we present a review of the state-of-the-art methods for data cleaning in WIS.According to the characteristics of data cleaning,we extract the critical elements of WIS,such as interactive objects,application scenarios,and core technology,to classify the existing works.Then,after elaborating and analyzing each category,we summarize the descriptions and challenges of data cleaning methods with sub-elements such as data&user interaction,data quality rule,model,crowdsourcing,and privacy preservation.Finally,we analyze various types of problems and provide suggestions for future research on data cleaning in WIS from the technology and interactive perspective.
基金the National Natural Science Fundation ofChina (60774041)
文摘To solve the problem of chaining distributed geographic information Web services (GI Web services), this paper provides an ontology-based method. With this method, semantic service description can be achieved by semantic annotation of the elements in a Web service description language(WSDL) document with concepts of geographic ontology, and then a common under-standing about service semantics between customers and providers of Web services is built. Based on the decomposition and formalization of customer requirements, the discovery, composition and execution of GI Web services are explained in detail, and then a chaining of GI Web services is built and used to achieve the customer's requirement. Finally, an example based on Web ontology language for service (OWL-S) is provided for testing the feasibility of this method.
文摘With the propagation of applications on the internet, the internet has become a great information source which supplies users with valuable information. But it is hard for users to quickly acquire the right information on the web. This paper an intelligent agent for internet applications to retrieve and extract web information under user's guidance. The intelligent agent is made up of a retrieval script to identify web sources, an extraction script based on the document object model to express extraction process, a data translator to export the extracted information into knowledge bases with frame structures, and a data reasoning to reply users' questions. A GUI tool named Script Writer helps to generate the extraction script visually, and knowledge rule databases help to extract wanted information and to generate the answer to questions.
文摘The information access is the rich data available for information retrieval, evolved to provide principle approaches or strategies for searching. For building the successful web retrieval search engine model, there are a number of prospects that arise at the different levels where techniques, such as Usenet, support vector machine are employed to have a significant impact. The present investigations explore the number of problems identified its level and related to finding information on web. The authors have attempted to examine the issues and prospects by applying different methods such as web graph analysis, the retrieval and analysis of newsgroup postings and statistical methods for inferring meaning in text. The proposed model thus assists the users in finding the existing formation of data they need. The study proposes three heuristics model to characterize the balancing between query and feedback information, so that adaptive relevance feedback. The authors have made an attempt to discuss the parameter factors that are responsible for the efficient searching. The important parameters can be taken care of for the future extension or development of search engines.
基金supported by National Social Science Fund of China(Grant No.08CTQ015)
文摘This paper selects 998 articles as its data sources from four Chinese core journals in the field of Library and Information Science from 2003 to 2007.Some pertinent aspects of reference citations particularly from web resources are selected for a focused analysis and discussion.This includes primarily such items as the number of web citations,web citations per each article,the distribution of domain names of web citations and also certain aspects about the institutional and/or geographical affiliations of the author.The evolving situation of utilizing online networked academic information resources in China is the central thematic discussion of this study.The writing of this paper is augmented by the explicatory presentation of 3 graphic figures,6 tables and 18 references.
基金supported by the Foundation for Humanities and Social Sciences of the Chinese Ministry of Education(Grant No.:08JC870002)
文摘Purpose: The objectives of this study are to explore an effective technique to extract information from weblogs and develop an experimental system to extract structured information as much as possible with this technique. The system will lay a foundation for evaluation, analysis, retrieval, and utilization of the extracted information.Design/methodology/approach: An improved template extraction technique was proposed.Separate templates designed for extracting blog entry titles, posts and their comments were established, and structured information was extracted online step by step. A dozen of data items, such as the entry titles, posts and their commenters and comments, the numbers of views, and the numbers of citations were extracted from eight major Chinese blog websites,including Sina, Sohu and Bokee.Findings: Results showed that the average accuracy of the experimental extraction system reached 94.6%. Because the online and multi-threading extraction technique was adopted, the speed of extraction was improved with the average speed of 15 pages per second without considering the network delay. In addition, entries posted by Ajax technology can be extracted successfully.Research limitations: As the templates need to be established in advance, this extraction technique can be effectively applied to a limited range of blog websites. In addition, the stability of the extraction templates was affected by the source code of the blog pages.Practical implications: This paper has studied and established a blog page extraction system,which can be used to extract structured data, preserve and update the data, and facilitate the collection, study and utilization of the blog resources, especially academic blog resources.Originality/value: This modified template extraction technique outperforms the Web page downloaders and the specialized blog page downloaders with structured and comprehensive data extraction.
文摘Due to the rapid development of electronic information technology,the development of Internet technology and system software development technology has become more and more common.Especially,along with the development of public security,there are more and more provisions for standard administrative department management system,improving office efficiency and enhancing decision encouragement.Therefore,it is of great practical value to design and complete a comprehensive public security business information system.Based on java technology,this paper designs and builds a comprehensive information management platform for public security through the analysis of comprehensive public security business,and also gets good feedback during the actual test,which confirms the feasibility of the system.
文摘In recent years, as there has been a major change in the necessity and importance of sightseeing information, a platform to provide real-time sightseeing information according to the ever-changing circumstances is necessary. Additionally, it is effective to adopt gamification to increase users’ motivation to continuously utilize the system in order to provide them with more information. In the present study, in order to support users’ enjoyment in creating efficient and pleasant sightseeing plans, the system that incorporates gamification to increase motivation was developed combining with web-geographic information systems (Web-GIS) and sightseeing planning and sharing system. The system was operated over a period of 2 weeks in Chofu City, Tokyo Metropolis, Japan. Based on the results of a questionnaire survey for 51 users, though the operability of the 3 main functions incorporated with motivation by gamification was rated lower than those of the 2 basic functions, their usefulness was highly rated. Based on the results of the access log analysis, it was effective to design the system so that the same functions can be used regardless of the type of information terminal. Additionally, it was evident that the continuous utilization of the system could increase the number of sightseeing plans created by the users.
文摘An improved decision tree method for web information retrieval with self-mapping attributes is proposed.The self-mapping tree has a value of self-mapping attribute in its internal node,and information based on dissimilarity between a pair of mapping sequences.This method selects self-mapping which exists between data by exhaustive search based on relation and attribute information.Experimental results confirm that the improved method constructs comprehensive and accurate decision tree.Moreover,an example shows that the self-mapping decision tree is promising for data mining and knowledge discovery.
基金Supported by the National Natural Science Foundation of China (No. 60131160743)
文摘The web is an extremely dynamic world where information is updated even every second. A web information monitoring system fetches information from the web continuously and finds changes by compar- ing two versions of the same page. The updating of a specific web page is modeled as a Poisson process with parameter to indicate the change frequency. As the amount of computing resources is limited, it is nec- essary to find some policies for reducing the overall change-detection time. Different allocation schemas are evaluated experimentally to find out which one is the most suitable for the web information monitoring prob- lem. The experimental data shows the runtime characteristics of the overall system performance and the re- lationship to the total amount of resources.
基金This work was supported by the National Key R&D Program of China(Grant No.2017YFD0501806)the Major Program of Applied Technology Research and Development Plan of Heilongjiang Province(Grant No.GA18B203).
文摘The incidence of porcine pasteurellosis in China is so widespread that it is difficult to clearly understand the prevalence and maintain continuous monitoring.In order to reduce immense negative economic impact on the livestock industry;monitoring,early warning,and visual management systems are highly desirable.In this study,a monitoring and early warning systemfor porcine pasteurellosis was established based onWeb Geographical Information System(WebGIS)technology.By establishing a path analysis function,buffer analysis function,and hot spot analysis function,it can provide a method of support and control of infectious diseases.For early warning of disease,four common interpolation methods were tested,all of which showed that the affected area of porcine pasteurellosis in China was mainly concentrated in the south of the mainland.A cross-validation method was used to compare the four interpolation methods.The cross-validation showed that the inverse distance weighting(IDW)method was suitable for forecasting the occurrence of porcine pasteurellosis in China.Finally,using C sharp(C#)as the development language and WebGIS technology,a monitoring and early warning system based on Browser/Server structure was developed.This is the first monitoring and early warning system of porcine pasteurellosis based onWebGIS.The performance of theWebGIS technology indicated a great potential for animal infectious disease applications and provided a foundation for future work.
基金This work was supported by the National Grand Fundamental Research of China ( Grant No. G1999032706).
文摘In this paper, first studied are the distribution characteristics of user behaviors based on log data from a massive web search engine. Analysis shows that stochastic distribution of user queries accords with the characteristics of power-law function and exhibits strong similarity, and the user' s queries and clicked URLs present dramatic locality, which implies that query cache and 'hot click' cache can be employed to improve system performance. Then three typical cache replacement policies are compared, including LRU, FIFO, and LFU with attenuation. In addition, the distribution character-istics of web information are also analyzed, which demonstrates that the link popularity and replica pop-ularity of a URL have positive influence on its importance. Finally, variance between the link popularity and user popularity, and variance between replica popularity and user popularity are analyzed, which give us some important insight that helps us improve the ranking algorithms in a search engine.