Due to its openness and timeliness,the S&T Web information has become one of the most important resources for strategic intelligence monitoring.However,since S&T Web information is unstructured and lack of sem...Due to its openness and timeliness,the S&T Web information has become one of the most important resources for strategic intelligence monitoring.However,since S&T Web information is unstructured and lack of semantic description,it is a challenge to transfer the unstructured Web information into structured semantic knowledge.To solve this problem,the authors propose a method for structural monitoring of the S&T Web information resources.By using the knowledge extraction technologies,the authors firstly extract the knowledge objects as well as the relationship between objects from the Web resources and convert the free text into calculable structured knowledge units.Based on those extracted structured information,the authors build various kinds of monitoring models to realize research profiling for specific research fields.Based on those ideas,the authors implement the automated Web information monitoring system suitable for research field monitoring.A research profiling experiment also is carried out based on the semantic resources which are converted from the monitored Web data.展开更多
This paper selects 998 articles as its data sources from four Chinese core journals in the field of Library and Information Science from 2003 to 2007.Some pertinent aspects of reference citations particularly from web...This paper selects 998 articles as its data sources from four Chinese core journals in the field of Library and Information Science from 2003 to 2007.Some pertinent aspects of reference citations particularly from web resources are selected for a focused analysis and discussion.This includes primarily such items as the number of web citations,web citations per each article,the distribution of domain names of web citations and also certain aspects about the institutional and/or geographical affiliations of the author.The evolving situation of utilizing online networked academic information resources in China is the central thematic discussion of this study.The writing of this paper is augmented by the explicatory presentation of 3 graphic figures,6 tables and 18 references.展开更多
Purpose: The objectives of this study are to explore an effective technique to extract information from weblogs and develop an experimental system to extract structured information as much as possible with this techni...Purpose: The objectives of this study are to explore an effective technique to extract information from weblogs and develop an experimental system to extract structured information as much as possible with this technique. The system will lay a foundation for evaluation, analysis, retrieval, and utilization of the extracted information.Design/methodology/approach: An improved template extraction technique was proposed.Separate templates designed for extracting blog entry titles, posts and their comments were established, and structured information was extracted online step by step. A dozen of data items, such as the entry titles, posts and their commenters and comments, the numbers of views, and the numbers of citations were extracted from eight major Chinese blog websites,including Sina, Sohu and Bokee.Findings: Results showed that the average accuracy of the experimental extraction system reached 94.6%. Because the online and multi-threading extraction technique was adopted, the speed of extraction was improved with the average speed of 15 pages per second without considering the network delay. In addition, entries posted by Ajax technology can be extracted successfully.Research limitations: As the templates need to be established in advance, this extraction technique can be effectively applied to a limited range of blog websites. In addition, the stability of the extraction templates was affected by the source code of the blog pages.Practical implications: This paper has studied and established a blog page extraction system,which can be used to extract structured data, preserve and update the data, and facilitate the collection, study and utilization of the blog resources, especially academic blog resources.Originality/value: This modified template extraction technique outperforms the Web page downloaders and the specialized blog page downloaders with structured and comprehensive data extraction.展开更多
以往的包装器主要针对仅含有一个数据块的Web页面,而对含有多个信息块的Web页面,简称MIB(Multiple Information Block), Web页面无法处理。该文提出了一个新的抽取规则,结合了基于文档结构的抽取规则和基于特征Pattern匹配的抽取规...以往的包装器主要针对仅含有一个数据块的Web页面,而对含有多个信息块的Web页面,简称MIB(Multiple Information Block), Web页面无法处理。该文提出了一个新的抽取规则,结合了基于文档结构的抽取规则和基于特征Pattern匹配的抽取规则的优点,能够有效地抽取MIB Web页面中的信息。展开更多
基金an outcome of the project "The computing method of subject centrality of texts based on language network"(No.61075047) supported by National Natural Science Foundation of China
文摘Due to its openness and timeliness,the S&T Web information has become one of the most important resources for strategic intelligence monitoring.However,since S&T Web information is unstructured and lack of semantic description,it is a challenge to transfer the unstructured Web information into structured semantic knowledge.To solve this problem,the authors propose a method for structural monitoring of the S&T Web information resources.By using the knowledge extraction technologies,the authors firstly extract the knowledge objects as well as the relationship between objects from the Web resources and convert the free text into calculable structured knowledge units.Based on those extracted structured information,the authors build various kinds of monitoring models to realize research profiling for specific research fields.Based on those ideas,the authors implement the automated Web information monitoring system suitable for research field monitoring.A research profiling experiment also is carried out based on the semantic resources which are converted from the monitored Web data.
基金supported by National Social Science Fund of China(Grant No.08CTQ015)
文摘This paper selects 998 articles as its data sources from four Chinese core journals in the field of Library and Information Science from 2003 to 2007.Some pertinent aspects of reference citations particularly from web resources are selected for a focused analysis and discussion.This includes primarily such items as the number of web citations,web citations per each article,the distribution of domain names of web citations and also certain aspects about the institutional and/or geographical affiliations of the author.The evolving situation of utilizing online networked academic information resources in China is the central thematic discussion of this study.The writing of this paper is augmented by the explicatory presentation of 3 graphic figures,6 tables and 18 references.
基金supported by the Foundation for Humanities and Social Sciences of the Chinese Ministry of Education(Grant No.:08JC870002)
文摘Purpose: The objectives of this study are to explore an effective technique to extract information from weblogs and develop an experimental system to extract structured information as much as possible with this technique. The system will lay a foundation for evaluation, analysis, retrieval, and utilization of the extracted information.Design/methodology/approach: An improved template extraction technique was proposed.Separate templates designed for extracting blog entry titles, posts and their comments were established, and structured information was extracted online step by step. A dozen of data items, such as the entry titles, posts and their commenters and comments, the numbers of views, and the numbers of citations were extracted from eight major Chinese blog websites,including Sina, Sohu and Bokee.Findings: Results showed that the average accuracy of the experimental extraction system reached 94.6%. Because the online and multi-threading extraction technique was adopted, the speed of extraction was improved with the average speed of 15 pages per second without considering the network delay. In addition, entries posted by Ajax technology can be extracted successfully.Research limitations: As the templates need to be established in advance, this extraction technique can be effectively applied to a limited range of blog websites. In addition, the stability of the extraction templates was affected by the source code of the blog pages.Practical implications: This paper has studied and established a blog page extraction system,which can be used to extract structured data, preserve and update the data, and facilitate the collection, study and utilization of the blog resources, especially academic blog resources.Originality/value: This modified template extraction technique outperforms the Web page downloaders and the specialized blog page downloaders with structured and comprehensive data extraction.
文摘以往的包装器主要针对仅含有一个数据块的Web页面,而对含有多个信息块的Web页面,简称MIB(Multiple Information Block), Web页面无法处理。该文提出了一个新的抽取规则,结合了基于文档结构的抽取规则和基于特征Pattern匹配的抽取规则的优点,能够有效地抽取MIB Web页面中的信息。