With the rapid development of Web, there are more and more Web databases available for users to access. At the same time, job searchers often have difficulties in first finding the right sources and then querying over...With the rapid development of Web, there are more and more Web databases available for users to access. At the same time, job searchers often have difficulties in first finding the right sources and then querying over them, providing such an integrated job search system over Web databases has become a Web application in high demand. Based on such consideration, we build a deep Web data integration system that supports unified access for users to multiple job Web sites as a job meta-search engine. In this paper, the architecture of the system is given first, and the key components in the system are introduced.展开更多
This paper investigates how to integrate Web data into a multidimensional data warehouse (cube) for comprehensive on-line analytical processing (OLAP) and decision making. An approach for Web data-based cube const...This paper investigates how to integrate Web data into a multidimensional data warehouse (cube) for comprehensive on-line analytical processing (OLAP) and decision making. An approach for Web data-based cube construction is proposed, which includes Web data modeling based on MIX ( Metadam based Integration model for data X-change ), generic and specific mapping rules design, and a transformation algorithm for mapping Web data to a multidimensional array. Besides, the structure and implementation of the prototype of a Web data base cube are discussed.展开更多
In this paper, we propose a flexible locationbased service (LBS) middleware framework to make the development and deployment of new location based applications much easier. Considering the World Wide Web as a huge d...In this paper, we propose a flexible locationbased service (LBS) middleware framework to make the development and deployment of new location based applications much easier. Considering the World Wide Web as a huge data source of location relative information, we integrate the common used web data extraction techniques into the middleware framework, exposing a unified web data interface for the upper applications to make them more attractive. Besides, the framework also emphasizes some common LBS issues, including positioning, location modeling, location-dependent query processing, privacy and secure management.展开更多
This paper investigates the Web data aggregation issues in multidimensional on-line analytical processing (MOLAP) and presents a rule-driven aggregation approach. The core of the approach is defining aggregate rules...This paper investigates the Web data aggregation issues in multidimensional on-line analytical processing (MOLAP) and presents a rule-driven aggregation approach. The core of the approach is defining aggregate rules. To define the rules for reading warehouse data and computing aggregates, a rule definition language - array aggregation language (AAL) is developed. This language treats an array as a function from indexes to values and provides syntax and semantics based on monads. External functions can be called in aggregation rules to specify array reading, writing, and aggregating. Based on the features of AAL, array operations are unified as function operations, which can be easily expressed and automatically evaluated. To implement the aggregation approach, a processor for computing aggregates over the base cube and for materializing them in the data warehouse is built, and the component structure and working principle of the aggregation processor are introduced.展开更多
A vision based query interface annotation meth od is used to relate attributes and form elements in form based web query interfaces, this method can reach accuracy of 82%. And a user participation method is used to tu...A vision based query interface annotation meth od is used to relate attributes and form elements in form based web query interfaces, this method can reach accuracy of 82%. And a user participation method is used to tune the result; user can answer "yes" or "no" for existing annotations, or manually annotate form elements. Mass feedback is added to the annotation algorithm to produce more accurate result. By this approach, query interface annotation can reach a perfect accuracy.展开更多
Data aggregation from various web sources is very significant for web data analysis domain.In addition,the recognition of coherence micro cluster is one of the most interesting issues in the field of data aggregation....Data aggregation from various web sources is very significant for web data analysis domain.In addition,the recognition of coherence micro cluster is one of the most interesting issues in the field of data aggregation.Until now,many algorithms have been proposed to work on this issue.However,the deficiency of these solutions is that they cannot recognize the micro-cluster data stream accurately.A semantic-based coherent micro-cluster recognition algorithm for hybrid web data stream is proposed.Firsdy,an objective function is proposed to recognize the coherence micro-cluster and then the coherence micro-cluster recognition algorithm for hybrid web data stream based on semantic is raised.Finally,the effectiveness and efficiency evaluation of the algorithm with extensive experiments is verified on real music data sets from Baidu inc.and Migu inc.The experimental results show that the proposed algorithm has better recall rate than the non-semantic micro cluster recognition algorithm and single source data flow micro cluster recognition algorithm.展开更多
采用Web data mining对远程教育进行分析,根据受教育对象存在的个体差异,提出个性化远程学习系统的框架结构思想和个性化服务的理念,对相关信息进行数据挖掘并建立起一个集智能化、个性化为一体的远程教育系统,从而更好地改善远程教育...采用Web data mining对远程教育进行分析,根据受教育对象存在的个体差异,提出个性化远程学习系统的框架结构思想和个性化服务的理念,对相关信息进行数据挖掘并建立起一个集智能化、个性化为一体的远程教育系统,从而更好地改善远程教育服务的现状。展开更多
Web data extraction is to obtain valuable data from the tremendous information resource of the World Wide Web according to the pre - defined pattern. It processes and classifies the data on the Web. Formalization of t...Web data extraction is to obtain valuable data from the tremendous information resource of the World Wide Web according to the pre - defined pattern. It processes and classifies the data on the Web. Formalization of the procedure of Web data extraction is presented, as well as the description of crawling and extraction algorithm. Based on the formalization, an XML - based page structure description language, TIDL, is brought out, including the object model, the HTML object reference model and definition of tags. At the final part, a Web data gathering and querying application based on Internet agent technology, named Web Integration Services Kit (WISK) is mentioned.展开更多
Extracting and mining social networks information from massive Web data is of both theoretical and practical significance. However, one of definite features of this task was a large scale data processing, which remain...Extracting and mining social networks information from massive Web data is of both theoretical and practical significance. However, one of definite features of this task was a large scale data processing, which remained to be a great challenge that would be addressed. MapReduce is a kind of distributed programming model. Just through the implementation of map and reduce those two functions, the distributed tasks can work well. Nevertheless, this model does not directly support heterogeneous datasets processing, while heterogeneous datasets are common in Web. This article proposes a new framework which improves original MapReduce framework into a new one called Map-Reduce-Merge. It adds merge phase that can efficiently solve the problems of heterogeneous data processing. At the same time, some works of optimization and improvement are done based on the features of Web data.展开更多
A vast amount of data (known as big data) may now be collected and stored from a variety of data sources, including event logs, the internet, smartphones, databases, sensors, cloud computing, and Internet of Things (I...A vast amount of data (known as big data) may now be collected and stored from a variety of data sources, including event logs, the internet, smartphones, databases, sensors, cloud computing, and Internet of Things (IoT) devices. The term “big data security” refers to all the safeguards and instruments used to protect both the data and analytics processes against intrusions, theft, and other hostile actions that could endanger or adversely influence them. Beyond being a high-value and desirable target, protecting Big Data has particular difficulties. Big Data security does not fundamentally differ from conventional data security. Big Data security issues are caused by extraneous distinctions rather than fundamental ones. This study meticulously outlines the numerous security difficulties Large Data analytics now faces and encourages additional joint research for reducing both big data security challenges utilizing Ontology Web Language (OWL). Although we focus on the Security Challenges of Big Data in this essay, we will also briefly cover the broader Challenges of Big Data. The proposed classification of Big Data security based on ontology web language resulting from the protégé software has 32 classes and 45 subclasses.展开更多
This work leveraged predictive modeling techniques in machine learning (ML) to predict heart disease using a dataset sourced from the Center for Disease Control and Prevention in the US. The dataset was preprocessed a...This work leveraged predictive modeling techniques in machine learning (ML) to predict heart disease using a dataset sourced from the Center for Disease Control and Prevention in the US. The dataset was preprocessed and used to train five machine learning models: random forest, support vector machine, logistic regression, extreme gradient boosting and light gradient boosting. The goal was to use the best performing model to develop a web application capable of reliably predicting heart disease based on user-provided data. The extreme gradient boosting classifier provided the most reliable results with precision, recall and F1-score of 97%, 72%, and 83% respectively for Class 0 (no heart disease) and 21% (precision), 81% (recall) and 34% (F1-score) for Class 1 (heart disease). The model was further deployed as a web application.展开更多
目的应用文献计量学方法分析近10年来近视研究领域的现状、热点和未来的发展方向。方法检索Web of Science核心数据库中2013年1月1日至2022年12月31日近视相关的研究类和综述类文献,使用VOSviewer软件对国家、研究机构、作者进行共现分...目的应用文献计量学方法分析近10年来近视研究领域的现状、热点和未来的发展方向。方法检索Web of Science核心数据库中2013年1月1日至2022年12月31日近视相关的研究类和综述类文献,使用VOSviewer软件对国家、研究机构、作者进行共现分析,使用CiteSpace软件对关键词和共被引参考文献进行聚类分析。结果最终纳入9745篇文献,涉及123个国家或地区,7150个机构和29343位作者。通过分析发现全球在近视领域的发文量整体呈增长趋势,中国是发文量最多的国家,来自美国的研究总被引用次数最多。关键词分析结果表明,早期近视研究热点主要集中于屈光手术、并发症的诊断与治疗、遗传学研究以及流行病学特征,而近年来研究重点已迅速转向近视的预防和控制。共被引文献聚类分析结果显示,近视领域包含多个聚类模块,如#0学龄儿童、#1小切口角膜基质透镜取出术、#2近视控制、#3屈光不正、#4接触镜等研究方向。研究前沿主要聚焦于近视管理技术、近视与视网膜和脉络膜血管、人工智能在近视领域的应用等方面。结论近十年近视研究领域涵盖眼科学、分子生物学、遗传学、眼视光学、流行病学等多个学科领域。未来需要进一步探索近视的病因和发病机制、早期识别和筛查、管理技术、人工智能辅助诊断等,以制定更加有效、安全的近视防控策略。展开更多
Data are crucial to the growth of e-commerce in today's world of highly demanding hyper-personalized consumer experiences,which are collected using advanced web scraping technologies.However,core data extraction e...Data are crucial to the growth of e-commerce in today's world of highly demanding hyper-personalized consumer experiences,which are collected using advanced web scraping technologies.However,core data extraction engines fail because they cannot adapt to the dynamic changes in website content.This study investigates an intelligent and adaptive web data extraction system with convolutional and Long Short-Term Memory(LSTM)networks to enable automated web page detection using the You only look once(Yolo)algorithm and Tesseract LSTM to extract product details,which are detected as images from web pages.This state-of-the-art system does not need a core data extraction engine,and thus can adapt to dynamic changes in website layout.Experiments conducted on real-world retail cases demonstrate an image detection(precision)and character extraction accuracy(precision)of 97%and 99%,respectively.In addition,a mean average precision of 74%,with an input dataset of 45 objects or images,is obtained.展开更多
基金Supportted by the Natural Science Foundation ofChina (60573091 ,60273018) National Basic Research and Develop-ment Programof China (2003CB317000) the Key Project of Minis-try of Education of China (03044) .
文摘With the rapid development of Web, there are more and more Web databases available for users to access. At the same time, job searchers often have difficulties in first finding the right sources and then querying over them, providing such an integrated job search system over Web databases has become a Web application in high demand. Based on such consideration, we build a deep Web data integration system that supports unified access for users to multiple job Web sites as a job meta-search engine. In this paper, the architecture of the system is given first, and the key components in the system are introduced.
基金The National Natural Science Foundation of China (No.60573165)
文摘This paper investigates how to integrate Web data into a multidimensional data warehouse (cube) for comprehensive on-line analytical processing (OLAP) and decision making. An approach for Web data-based cube construction is proposed, which includes Web data modeling based on MIX ( Metadam based Integration model for data X-change ), generic and specific mapping rules design, and a transformation algorithm for mapping Web data to a multidimensional array. Besides, the structure and implementation of the prototype of a Web data base cube are discussed.
基金Supported by the National Natural Science Foun-dation of China (60573091 ,60273018)National Basic Research andDevelopment Programof China(2003CB317000) +1 种基金the Key Project ofMinistry of Education of China (03044) Programfor NewCentu-ry Excellent Talents in University(NCET) .
文摘In this paper, we propose a flexible locationbased service (LBS) middleware framework to make the development and deployment of new location based applications much easier. Considering the World Wide Web as a huge data source of location relative information, we integrate the common used web data extraction techniques into the middleware framework, exposing a unified web data interface for the upper applications to make them more attractive. Besides, the framework also emphasizes some common LBS issues, including positioning, location modeling, location-dependent query processing, privacy and secure management.
基金The National Natural Science Foundationof China (No60573165)
文摘This paper investigates the Web data aggregation issues in multidimensional on-line analytical processing (MOLAP) and presents a rule-driven aggregation approach. The core of the approach is defining aggregate rules. To define the rules for reading warehouse data and computing aggregates, a rule definition language - array aggregation language (AAL) is developed. This language treats an array as a function from indexes to values and provides syntax and semantics based on monads. External functions can be called in aggregation rules to specify array reading, writing, and aggregating. Based on the features of AAL, array operations are unified as function operations, which can be easily expressed and automatically evaluated. To implement the aggregation approach, a processor for computing aggregates over the base cube and for materializing them in the data warehouse is built, and the component structure and working principle of the aggregation processor are introduced.
基金Supported by the National Natural Science Foun-dation of China (60573091 ,60273018)
文摘A vision based query interface annotation meth od is used to relate attributes and form elements in form based web query interfaces, this method can reach accuracy of 82%. And a user participation method is used to tune the result; user can answer "yes" or "no" for existing annotations, or manually annotate form elements. Mass feedback is added to the annotation algorithm to produce more accurate result. By this approach, query interface annotation can reach a perfect accuracy.
基金Supported by the National High Technology Research and Development Programme of China(No.2011AA120300,2011AA120302)the National Key Technology Support Program of China(No.2013BAH66F02)
文摘Data aggregation from various web sources is very significant for web data analysis domain.In addition,the recognition of coherence micro cluster is one of the most interesting issues in the field of data aggregation.Until now,many algorithms have been proposed to work on this issue.However,the deficiency of these solutions is that they cannot recognize the micro-cluster data stream accurately.A semantic-based coherent micro-cluster recognition algorithm for hybrid web data stream is proposed.Firsdy,an objective function is proposed to recognize the coherence micro-cluster and then the coherence micro-cluster recognition algorithm for hybrid web data stream based on semantic is raised.Finally,the effectiveness and efficiency evaluation of the algorithm with extensive experiments is verified on real music data sets from Baidu inc.and Migu inc.The experimental results show that the proposed algorithm has better recall rate than the non-semantic micro cluster recognition algorithm and single source data flow micro cluster recognition algorithm.
基金Note:Contents discussed in this paper are part of a key project,No.2000-A31-01-04,sponsored by Ministry of Science and Technology of P.R.China
文摘Web data extraction is to obtain valuable data from the tremendous information resource of the World Wide Web according to the pre - defined pattern. It processes and classifies the data on the Web. Formalization of the procedure of Web data extraction is presented, as well as the description of crawling and extraction algorithm. Based on the formalization, an XML - based page structure description language, TIDL, is brought out, including the object model, the HTML object reference model and definition of tags. At the final part, a Web data gathering and querying application based on Internet agent technology, named Web Integration Services Kit (WISK) is mentioned.
文摘Extracting and mining social networks information from massive Web data is of both theoretical and practical significance. However, one of definite features of this task was a large scale data processing, which remained to be a great challenge that would be addressed. MapReduce is a kind of distributed programming model. Just through the implementation of map and reduce those two functions, the distributed tasks can work well. Nevertheless, this model does not directly support heterogeneous datasets processing, while heterogeneous datasets are common in Web. This article proposes a new framework which improves original MapReduce framework into a new one called Map-Reduce-Merge. It adds merge phase that can efficiently solve the problems of heterogeneous data processing. At the same time, some works of optimization and improvement are done based on the features of Web data.
文摘A vast amount of data (known as big data) may now be collected and stored from a variety of data sources, including event logs, the internet, smartphones, databases, sensors, cloud computing, and Internet of Things (IoT) devices. The term “big data security” refers to all the safeguards and instruments used to protect both the data and analytics processes against intrusions, theft, and other hostile actions that could endanger or adversely influence them. Beyond being a high-value and desirable target, protecting Big Data has particular difficulties. Big Data security does not fundamentally differ from conventional data security. Big Data security issues are caused by extraneous distinctions rather than fundamental ones. This study meticulously outlines the numerous security difficulties Large Data analytics now faces and encourages additional joint research for reducing both big data security challenges utilizing Ontology Web Language (OWL). Although we focus on the Security Challenges of Big Data in this essay, we will also briefly cover the broader Challenges of Big Data. The proposed classification of Big Data security based on ontology web language resulting from the protégé software has 32 classes and 45 subclasses.
文摘This work leveraged predictive modeling techniques in machine learning (ML) to predict heart disease using a dataset sourced from the Center for Disease Control and Prevention in the US. The dataset was preprocessed and used to train five machine learning models: random forest, support vector machine, logistic regression, extreme gradient boosting and light gradient boosting. The goal was to use the best performing model to develop a web application capable of reliably predicting heart disease based on user-provided data. The extreme gradient boosting classifier provided the most reliable results with precision, recall and F1-score of 97%, 72%, and 83% respectively for Class 0 (no heart disease) and 21% (precision), 81% (recall) and 34% (F1-score) for Class 1 (heart disease). The model was further deployed as a web application.
文摘目的应用文献计量学方法分析近10年来近视研究领域的现状、热点和未来的发展方向。方法检索Web of Science核心数据库中2013年1月1日至2022年12月31日近视相关的研究类和综述类文献,使用VOSviewer软件对国家、研究机构、作者进行共现分析,使用CiteSpace软件对关键词和共被引参考文献进行聚类分析。结果最终纳入9745篇文献,涉及123个国家或地区,7150个机构和29343位作者。通过分析发现全球在近视领域的发文量整体呈增长趋势,中国是发文量最多的国家,来自美国的研究总被引用次数最多。关键词分析结果表明,早期近视研究热点主要集中于屈光手术、并发症的诊断与治疗、遗传学研究以及流行病学特征,而近年来研究重点已迅速转向近视的预防和控制。共被引文献聚类分析结果显示,近视领域包含多个聚类模块,如#0学龄儿童、#1小切口角膜基质透镜取出术、#2近视控制、#3屈光不正、#4接触镜等研究方向。研究前沿主要聚焦于近视管理技术、近视与视网膜和脉络膜血管、人工智能在近视领域的应用等方面。结论近十年近视研究领域涵盖眼科学、分子生物学、遗传学、眼视光学、流行病学等多个学科领域。未来需要进一步探索近视的病因和发病机制、早期识别和筛查、管理技术、人工智能辅助诊断等,以制定更加有效、安全的近视防控策略。
文摘Data are crucial to the growth of e-commerce in today's world of highly demanding hyper-personalized consumer experiences,which are collected using advanced web scraping technologies.However,core data extraction engines fail because they cannot adapt to the dynamic changes in website content.This study investigates an intelligent and adaptive web data extraction system with convolutional and Long Short-Term Memory(LSTM)networks to enable automated web page detection using the You only look once(Yolo)algorithm and Tesseract LSTM to extract product details,which are detected as images from web pages.This state-of-the-art system does not need a core data extraction engine,and thus can adapt to dynamic changes in website layout.Experiments conducted on real-world retail cases demonstrate an image detection(precision)and character extraction accuracy(precision)of 97%and 99%,respectively.In addition,a mean average precision of 74%,with an input dataset of 45 objects or images,is obtained.