Web information system(WIS)is frequently-used and indispensable in daily social life.WIS provides information services in many scenarios,such as electronic commerce,communities,and edutainment.Data cleaning plays an e...Web information system(WIS)is frequently-used and indispensable in daily social life.WIS provides information services in many scenarios,such as electronic commerce,communities,and edutainment.Data cleaning plays an essential role in various WIS scenarios to improve the quality of data service.In this paper,we present a review of the state-of-the-art methods for data cleaning in WIS.According to the characteristics of data cleaning,we extract the critical elements of WIS,such as interactive objects,application scenarios,and core technology,to classify the existing works.Then,after elaborating and analyzing each category,we summarize the descriptions and challenges of data cleaning methods with sub-elements such as data&user interaction,data quality rule,model,crowdsourcing,and privacy preservation.Finally,we analyze various types of problems and provide suggestions for future research on data cleaning in WIS from the technology and interactive perspective.展开更多
With the propagation of applications on the internet, the internet has become a great information source which supplies users with valuable information. But it is hard for users to quickly acquire the right informatio...With the propagation of applications on the internet, the internet has become a great information source which supplies users with valuable information. But it is hard for users to quickly acquire the right information on the web. This paper an intelligent agent for internet applications to retrieve and extract web information under user's guidance. The intelligent agent is made up of a retrieval script to identify web sources, an extraction script based on the document object model to express extraction process, a data translator to export the extracted information into knowledge bases with frame structures, and a data reasoning to reply users' questions. A GUI tool named Script Writer helps to generate the extraction script visually, and knowledge rule databases help to extract wanted information and to generate the answer to questions.展开更多
The information access is the rich data available for information retrieval, evolved to provide principle approaches or strategies for searching. For building the successful web retrieval search engine model, there ar...The information access is the rich data available for information retrieval, evolved to provide principle approaches or strategies for searching. For building the successful web retrieval search engine model, there are a number of prospects that arise at the different levels where techniques, such as Usenet, support vector machine are employed to have a significant impact. The present investigations explore the number of problems identified its level and related to finding information on web. The authors have attempted to examine the issues and prospects by applying different methods such as web graph analysis, the retrieval and analysis of newsgroup postings and statistical methods for inferring meaning in text. The proposed model thus assists the users in finding the existing formation of data they need. The study proposes three heuristics model to characterize the balancing between query and feedback information, so that adaptive relevance feedback. The authors have made an attempt to discuss the parameter factors that are responsible for the efficient searching. The important parameters can be taken care of for the future extension or development of search engines.展开更多
The web is an extremely dynamic world where information is updated even every second. A web information monitoring system fetches information from the web continuously and finds changes by compar- ing two versions of ...The web is an extremely dynamic world where information is updated even every second. A web information monitoring system fetches information from the web continuously and finds changes by compar- ing two versions of the same page. The updating of a specific web page is modeled as a Poisson process with parameter to indicate the change frequency. As the amount of computing resources is limited, it is nec- essary to find some policies for reducing the overall change-detection time. Different allocation schemas are evaluated experimentally to find out which one is the most suitable for the web information monitoring prob- lem. The experimental data shows the runtime characteristics of the overall system performance and the re- lationship to the total amount of resources.展开更多
This paper selects 998 articles as its data sources from four Chinese core journals in the field of Library and Information Science from 2003 to 2007.Some pertinent aspects of reference citations particularly from web...This paper selects 998 articles as its data sources from four Chinese core journals in the field of Library and Information Science from 2003 to 2007.Some pertinent aspects of reference citations particularly from web resources are selected for a focused analysis and discussion.This includes primarily such items as the number of web citations,web citations per each article,the distribution of domain names of web citations and also certain aspects about the institutional and/or geographical affiliations of the author.The evolving situation of utilizing online networked academic information resources in China is the central thematic discussion of this study.The writing of this paper is augmented by the explicatory presentation of 3 graphic figures,6 tables and 18 references.展开更多
Due to the rapid development of electronic information technology,the development of Internet technology and system software development technology has become more and more common.Especially,along with the development...Due to the rapid development of electronic information technology,the development of Internet technology and system software development technology has become more and more common.Especially,along with the development of public security,there are more and more provisions for standard administrative department management system,improving office efficiency and enhancing decision encouragement.Therefore,it is of great practical value to design and complete a comprehensive public security business information system.Based on java technology,this paper designs and builds a comprehensive information management platform for public security through the analysis of comprehensive public security business,and also gets good feedback during the actual test,which confirms the feasibility of the system.展开更多
To solve the problem of chaining distributed geographic information Web services (GI Web services), this paper provides an ontology-based method. With this method, semantic service description can be achieved by sem...To solve the problem of chaining distributed geographic information Web services (GI Web services), this paper provides an ontology-based method. With this method, semantic service description can be achieved by semantic annotation of the elements in a Web service description language(WSDL) document with concepts of geographic ontology, and then a common under-standing about service semantics between customers and providers of Web services is built. Based on the decomposition and formalization of customer requirements, the discovery, composition and execution of GI Web services are explained in detail, and then a chaining of GI Web services is built and used to achieve the customer's requirement. Finally, an example based on Web ontology language for service (OWL-S) is provided for testing the feasibility of this method.展开更多
An improved decision tree method for web information retrieval with self-mapping attributes is proposed.The self-mapping tree has a value of self-mapping attribute in its internal node,and information based on dissimi...An improved decision tree method for web information retrieval with self-mapping attributes is proposed.The self-mapping tree has a value of self-mapping attribute in its internal node,and information based on dissimilarity between a pair of mapping sequences.This method selects self-mapping which exists between data by exhaustive search based on relation and attribute information.Experimental results confirm that the improved method constructs comprehensive and accurate decision tree.Moreover,an example shows that the self-mapping decision tree is promising for data mining and knowledge discovery.展开更多
Purpose: The objectives of this study are to explore an effective technique to extract information from weblogs and develop an experimental system to extract structured information as much as possible with this techni...Purpose: The objectives of this study are to explore an effective technique to extract information from weblogs and develop an experimental system to extract structured information as much as possible with this technique. The system will lay a foundation for evaluation, analysis, retrieval, and utilization of the extracted information.Design/methodology/approach: An improved template extraction technique was proposed.Separate templates designed for extracting blog entry titles, posts and their comments were established, and structured information was extracted online step by step. A dozen of data items, such as the entry titles, posts and their commenters and comments, the numbers of views, and the numbers of citations were extracted from eight major Chinese blog websites,including Sina, Sohu and Bokee.Findings: Results showed that the average accuracy of the experimental extraction system reached 94.6%. Because the online and multi-threading extraction technique was adopted, the speed of extraction was improved with the average speed of 15 pages per second without considering the network delay. In addition, entries posted by Ajax technology can be extracted successfully.Research limitations: As the templates need to be established in advance, this extraction technique can be effectively applied to a limited range of blog websites. In addition, the stability of the extraction templates was affected by the source code of the blog pages.Practical implications: This paper has studied and established a blog page extraction system,which can be used to extract structured data, preserve and update the data, and facilitate the collection, study and utilization of the blog resources, especially academic blog resources.Originality/value: This modified template extraction technique outperforms the Web page downloaders and the specialized blog page downloaders with structured and comprehensive data extraction.展开更多
文摘Web information system(WIS)is frequently-used and indispensable in daily social life.WIS provides information services in many scenarios,such as electronic commerce,communities,and edutainment.Data cleaning plays an essential role in various WIS scenarios to improve the quality of data service.In this paper,we present a review of the state-of-the-art methods for data cleaning in WIS.According to the characteristics of data cleaning,we extract the critical elements of WIS,such as interactive objects,application scenarios,and core technology,to classify the existing works.Then,after elaborating and analyzing each category,we summarize the descriptions and challenges of data cleaning methods with sub-elements such as data&user interaction,data quality rule,model,crowdsourcing,and privacy preservation.Finally,we analyze various types of problems and provide suggestions for future research on data cleaning in WIS from the technology and interactive perspective.
文摘With the propagation of applications on the internet, the internet has become a great information source which supplies users with valuable information. But it is hard for users to quickly acquire the right information on the web. This paper an intelligent agent for internet applications to retrieve and extract web information under user's guidance. The intelligent agent is made up of a retrieval script to identify web sources, an extraction script based on the document object model to express extraction process, a data translator to export the extracted information into knowledge bases with frame structures, and a data reasoning to reply users' questions. A GUI tool named Script Writer helps to generate the extraction script visually, and knowledge rule databases help to extract wanted information and to generate the answer to questions.
文摘The information access is the rich data available for information retrieval, evolved to provide principle approaches or strategies for searching. For building the successful web retrieval search engine model, there are a number of prospects that arise at the different levels where techniques, such as Usenet, support vector machine are employed to have a significant impact. The present investigations explore the number of problems identified its level and related to finding information on web. The authors have attempted to examine the issues and prospects by applying different methods such as web graph analysis, the retrieval and analysis of newsgroup postings and statistical methods for inferring meaning in text. The proposed model thus assists the users in finding the existing formation of data they need. The study proposes three heuristics model to characterize the balancing between query and feedback information, so that adaptive relevance feedback. The authors have made an attempt to discuss the parameter factors that are responsible for the efficient searching. The important parameters can be taken care of for the future extension or development of search engines.
基金Supported by the National Natural Science Foundation of China (No. 60131160743)
文摘The web is an extremely dynamic world where information is updated even every second. A web information monitoring system fetches information from the web continuously and finds changes by compar- ing two versions of the same page. The updating of a specific web page is modeled as a Poisson process with parameter to indicate the change frequency. As the amount of computing resources is limited, it is nec- essary to find some policies for reducing the overall change-detection time. Different allocation schemas are evaluated experimentally to find out which one is the most suitable for the web information monitoring prob- lem. The experimental data shows the runtime characteristics of the overall system performance and the re- lationship to the total amount of resources.
基金supported by National Social Science Fund of China(Grant No.08CTQ015)
文摘This paper selects 998 articles as its data sources from four Chinese core journals in the field of Library and Information Science from 2003 to 2007.Some pertinent aspects of reference citations particularly from web resources are selected for a focused analysis and discussion.This includes primarily such items as the number of web citations,web citations per each article,the distribution of domain names of web citations and also certain aspects about the institutional and/or geographical affiliations of the author.The evolving situation of utilizing online networked academic information resources in China is the central thematic discussion of this study.The writing of this paper is augmented by the explicatory presentation of 3 graphic figures,6 tables and 18 references.
文摘Due to the rapid development of electronic information technology,the development of Internet technology and system software development technology has become more and more common.Especially,along with the development of public security,there are more and more provisions for standard administrative department management system,improving office efficiency and enhancing decision encouragement.Therefore,it is of great practical value to design and complete a comprehensive public security business information system.Based on java technology,this paper designs and builds a comprehensive information management platform for public security through the analysis of comprehensive public security business,and also gets good feedback during the actual test,which confirms the feasibility of the system.
基金the National Natural Science Fundation ofChina (60774041)
文摘To solve the problem of chaining distributed geographic information Web services (GI Web services), this paper provides an ontology-based method. With this method, semantic service description can be achieved by semantic annotation of the elements in a Web service description language(WSDL) document with concepts of geographic ontology, and then a common under-standing about service semantics between customers and providers of Web services is built. Based on the decomposition and formalization of customer requirements, the discovery, composition and execution of GI Web services are explained in detail, and then a chaining of GI Web services is built and used to achieve the customer's requirement. Finally, an example based on Web ontology language for service (OWL-S) is provided for testing the feasibility of this method.
文摘An improved decision tree method for web information retrieval with self-mapping attributes is proposed.The self-mapping tree has a value of self-mapping attribute in its internal node,and information based on dissimilarity between a pair of mapping sequences.This method selects self-mapping which exists between data by exhaustive search based on relation and attribute information.Experimental results confirm that the improved method constructs comprehensive and accurate decision tree.Moreover,an example shows that the self-mapping decision tree is promising for data mining and knowledge discovery.
基金supported by the Foundation for Humanities and Social Sciences of the Chinese Ministry of Education(Grant No.:08JC870002)
文摘Purpose: The objectives of this study are to explore an effective technique to extract information from weblogs and develop an experimental system to extract structured information as much as possible with this technique. The system will lay a foundation for evaluation, analysis, retrieval, and utilization of the extracted information.Design/methodology/approach: An improved template extraction technique was proposed.Separate templates designed for extracting blog entry titles, posts and their comments were established, and structured information was extracted online step by step. A dozen of data items, such as the entry titles, posts and their commenters and comments, the numbers of views, and the numbers of citations were extracted from eight major Chinese blog websites,including Sina, Sohu and Bokee.Findings: Results showed that the average accuracy of the experimental extraction system reached 94.6%. Because the online and multi-threading extraction technique was adopted, the speed of extraction was improved with the average speed of 15 pages per second without considering the network delay. In addition, entries posted by Ajax technology can be extracted successfully.Research limitations: As the templates need to be established in advance, this extraction technique can be effectively applied to a limited range of blog websites. In addition, the stability of the extraction templates was affected by the source code of the blog pages.Practical implications: This paper has studied and established a blog page extraction system,which can be used to extract structured data, preserve and update the data, and facilitate the collection, study and utilization of the blog resources, especially academic blog resources.Originality/value: This modified template extraction technique outperforms the Web page downloaders and the specialized blog page downloaders with structured and comprehensive data extraction.