Data are crucial to the growth of e-commerce in today's world of highly demanding hyper-personalized consumer experiences,which are collected using advanced web scraping technologies.However,core data extraction e...Data are crucial to the growth of e-commerce in today's world of highly demanding hyper-personalized consumer experiences,which are collected using advanced web scraping technologies.However,core data extraction engines fail because they cannot adapt to the dynamic changes in website content.This study investigates an intelligent and adaptive web data extraction system with convolutional and Long Short-Term Memory(LSTM)networks to enable automated web page detection using the You only look once(Yolo)algorithm and Tesseract LSTM to extract product details,which are detected as images from web pages.This state-of-the-art system does not need a core data extraction engine,and thus can adapt to dynamic changes in website layout.Experiments conducted on real-world retail cases demonstrate an image detection(precision)and character extraction accuracy(precision)of 97%and 99%,respectively.In addition,a mean average precision of 74%,with an input dataset of 45 objects or images,is obtained.展开更多
A large amount of data is present on the web which can be used for useful purposes like a product recommendation,price comparison and demand forecasting for a particular product.Websites are designed for human underst...A large amount of data is present on the web which can be used for useful purposes like a product recommendation,price comparison and demand forecasting for a particular product.Websites are designed for human understanding and not for machines.Therefore,to make data machine-readable,it requires techniques to grab data from web pages.Researchers have addressed the problem using two approaches,i.e.,knowledge engineering and machine learning.State of the art knowledge engineering approaches use the structure of documents,visual cues,clustering of attributes of data records and text processing techniques to identify data records on a web page.Machine learning approaches use annotated pages to learn rules.These rules are used to extract data from unseen web pages.The structure of web documents is continuously evolving.Therefore,new techniques are needed to handle the emerging requirements of web data extraction.In this paper,we have presented a novel,simple and efficient technique to extract data from web pages using visual styles and structure of documents.The proposed technique detects Rich Data Region(RDR)using query and correlative words of the query.RDR is then divided into data records using style similarity.Noisy elements are removed using a Common Tag Sequence(CTS)and formatting entropy.The system is implemented using JAVA and runs on the dataset of real-world working websites.The effectiveness of results is evaluated using precision,recall,and F-measure and compared with five existing systems.A comparison of the proposed technique to existing systems has shown encouraging results.展开更多
In this paper, we propose a flexible locationbased service (LBS) middleware framework to make the development and deployment of new location based applications much easier. Considering the World Wide Web as a huge d...In this paper, we propose a flexible locationbased service (LBS) middleware framework to make the development and deployment of new location based applications much easier. Considering the World Wide Web as a huge data source of location relative information, we integrate the common used web data extraction techniques into the middleware framework, exposing a unified web data interface for the upper applications to make them more attractive. Besides, the framework also emphasizes some common LBS issues, including positioning, location modeling, location-dependent query processing, privacy and secure management.展开更多
文摘Data are crucial to the growth of e-commerce in today's world of highly demanding hyper-personalized consumer experiences,which are collected using advanced web scraping technologies.However,core data extraction engines fail because they cannot adapt to the dynamic changes in website content.This study investigates an intelligent and adaptive web data extraction system with convolutional and Long Short-Term Memory(LSTM)networks to enable automated web page detection using the You only look once(Yolo)algorithm and Tesseract LSTM to extract product details,which are detected as images from web pages.This state-of-the-art system does not need a core data extraction engine,and thus can adapt to dynamic changes in website layout.Experiments conducted on real-world retail cases demonstrate an image detection(precision)and character extraction accuracy(precision)of 97%and 99%,respectively.In addition,a mean average precision of 74%,with an input dataset of 45 objects or images,is obtained.
文摘A large amount of data is present on the web which can be used for useful purposes like a product recommendation,price comparison and demand forecasting for a particular product.Websites are designed for human understanding and not for machines.Therefore,to make data machine-readable,it requires techniques to grab data from web pages.Researchers have addressed the problem using two approaches,i.e.,knowledge engineering and machine learning.State of the art knowledge engineering approaches use the structure of documents,visual cues,clustering of attributes of data records and text processing techniques to identify data records on a web page.Machine learning approaches use annotated pages to learn rules.These rules are used to extract data from unseen web pages.The structure of web documents is continuously evolving.Therefore,new techniques are needed to handle the emerging requirements of web data extraction.In this paper,we have presented a novel,simple and efficient technique to extract data from web pages using visual styles and structure of documents.The proposed technique detects Rich Data Region(RDR)using query and correlative words of the query.RDR is then divided into data records using style similarity.Noisy elements are removed using a Common Tag Sequence(CTS)and formatting entropy.The system is implemented using JAVA and runs on the dataset of real-world working websites.The effectiveness of results is evaluated using precision,recall,and F-measure and compared with five existing systems.A comparison of the proposed technique to existing systems has shown encouraging results.
基金Supported by the National Natural Science Foun-dation of China (60573091 ,60273018)National Basic Research andDevelopment Programof China(2003CB317000) +1 种基金the Key Project ofMinistry of Education of China (03044) Programfor NewCentu-ry Excellent Talents in University(NCET) .
文摘In this paper, we propose a flexible locationbased service (LBS) middleware framework to make the development and deployment of new location based applications much easier. Considering the World Wide Web as a huge data source of location relative information, we integrate the common used web data extraction techniques into the middleware framework, exposing a unified web data interface for the upper applications to make them more attractive. Besides, the framework also emphasizes some common LBS issues, including positioning, location modeling, location-dependent query processing, privacy and secure management.