期刊文献+
共找到25篇文章
< 1 2 >
每页显示 20 50 100
Recommendation Algorithm Integrating CNN and Attention System in Data Extraction 被引量:1
1
作者 Yang Li Fei Yin Xianghui Hui 《Computers, Materials & Continua》 SCIE EI 2023年第5期4047-4063,共17页
With the rapid development of the Internet globally since the 21st century,the amount of data information has increased exponentially.Data helps improve people’s livelihood and working conditions,as well as learning ... With the rapid development of the Internet globally since the 21st century,the amount of data information has increased exponentially.Data helps improve people’s livelihood and working conditions,as well as learning efficiency.Therefore,data extraction,analysis,and processing have become a hot issue for people from all walks of life.Traditional recommendation algorithm still has some problems,such as inaccuracy,less diversity,and low performance.To solve these problems and improve the accuracy and variety of the recommendation algorithms,the research combines the convolutional neural networks(CNN)and the attention model to design a recommendation algorithm based on the neural network framework.Through the text convolutional network,the input layer in CNN has transformed into two channels:static ones and non-static ones.Meanwhile,the self-attention system focuses on the system so that data can be better processed and the accuracy of feature extraction becomes higher.The recommendation algorithm combines CNN and attention system and divides the embedding layer into user information feature embedding and data name feature extraction embedding.It obtains data name features through a convolution kernel.Finally,the top pooling layer obtains the length vector.The attention system layer obtains the characteristics of the data type.Experimental results show that the proposed recommendation algorithm that combines CNN and the attention system can perform better in data extraction than the traditional CNN algorithm and other recommendation algorithms that are popular at the present stage.The proposed algorithm shows excellent accuracy and robustness. 展开更多
关键词 data extraction recommendation algorithm CNN algorithm attention model
下载PDF
Intelligent and Adaptive Web Data Extraction System Using Convolutional and Long Short-Term Memory Deep Learning Networks 被引量:4
2
作者 Sudhir Kumar Patnaik C.Narendra Babu Mukul Bhave 《Big Data Mining and Analytics》 EI 2021年第4期279-297,共19页
Data are crucial to the growth of e-commerce in today's world of highly demanding hyper-personalized consumer experiences,which are collected using advanced web scraping technologies.However,core data extraction e... Data are crucial to the growth of e-commerce in today's world of highly demanding hyper-personalized consumer experiences,which are collected using advanced web scraping technologies.However,core data extraction engines fail because they cannot adapt to the dynamic changes in website content.This study investigates an intelligent and adaptive web data extraction system with convolutional and Long Short-Term Memory(LSTM)networks to enable automated web page detection using the You only look once(Yolo)algorithm and Tesseract LSTM to extract product details,which are detected as images from web pages.This state-of-the-art system does not need a core data extraction engine,and thus can adapt to dynamic changes in website layout.Experiments conducted on real-world retail cases demonstrate an image detection(precision)and character extraction accuracy(precision)of 97%and 99%,respectively.In addition,a mean average precision of 74%,with an input dataset of 45 objects or images,is obtained. 展开更多
关键词 adaptive web scraping deep learning Long Short-Term Memory(LSTM) Web data extraction You only look once(Yolo)
原文传递
An Efficient Mechanism for Product Data Extraction from E-Commerce Websites
3
作者 Malik Javed Akhtar Zahur Ahmad +3 位作者 Rashid Amin Sultan H.Almotiri Mohammed A.Al Ghamdi Hamza Aldabbas 《Computers, Materials & Continua》 SCIE EI 2020年第12期2639-2663,共25页
A large amount of data is present on the web which can be used for useful purposes like a product recommendation,price comparison and demand forecasting for a particular product.Websites are designed for human underst... A large amount of data is present on the web which can be used for useful purposes like a product recommendation,price comparison and demand forecasting for a particular product.Websites are designed for human understanding and not for machines.Therefore,to make data machine-readable,it requires techniques to grab data from web pages.Researchers have addressed the problem using two approaches,i.e.,knowledge engineering and machine learning.State of the art knowledge engineering approaches use the structure of documents,visual cues,clustering of attributes of data records and text processing techniques to identify data records on a web page.Machine learning approaches use annotated pages to learn rules.These rules are used to extract data from unseen web pages.The structure of web documents is continuously evolving.Therefore,new techniques are needed to handle the emerging requirements of web data extraction.In this paper,we have presented a novel,simple and efficient technique to extract data from web pages using visual styles and structure of documents.The proposed technique detects Rich Data Region(RDR)using query and correlative words of the query.RDR is then divided into data records using style similarity.Noisy elements are removed using a Common Tag Sequence(CTS)and formatting entropy.The system is implemented using JAVA and runs on the dataset of real-world working websites.The effectiveness of results is evaluated using precision,recall,and F-measure and compared with five existing systems.A comparison of the proposed technique to existing systems has shown encouraging results. 展开更多
关键词 Document object model rich data region common tag sequence web data extraction deep web mining
下载PDF
Automatic Data Extraction from Websites for Generating Aquatic Product Market Information
4
作者 袁红春 陈莹 孙越夫 《Journal of Donghua University(English Edition)》 EI CAS 2006年第6期15-19,共5页
The massive web-based information resources have led to an increasing demand for effective automatic retrieval of target information for web applications. This paper introduces a web-based data extraction tool that de... The massive web-based information resources have led to an increasing demand for effective automatic retrieval of target information for web applications. This paper introduces a web-based data extraction tool that deploys various algorithms to locate, extract and filter tabular data from HTML pages and to transform them into new web-based representations. The tool has been applied in an aquaculture web application platform for extracting and generating aquatic product market information. Results prove that this tool is very effective in extracting the required data from web pages. 展开更多
关键词 web data table localization algorithm distance algorithm data filtering algorithm data extraction tool.
下载PDF
Semi-structured Data Extraction and Schema Knowledge Mining
5
作者 陈恩红 WANG Xufa 《High Technology Letters》 EI CAS 2001年第1期1-5,共5页
A semi structured data extraction method to get the useful information embedded in a group of relevant web pages and store it with OEM(Object Exchange Model) is proposed. Then, the data mining method is adopted to dis... A semi structured data extraction method to get the useful information embedded in a group of relevant web pages and store it with OEM(Object Exchange Model) is proposed. Then, the data mining method is adopted to discover schema knowledge implicit in the semi structured data. This knowledge can make users understand the information structure on the web more deeply and thourouly. At the same time, it can also provide a kind of effective schema for the querying of web information. 展开更多
关键词 Semi-structured data SCHEMA data extraction.
下载PDF
Extraction of Mineral Alteration Zone from ETM+ Data in Northwestern Yunnan,China
6
作者 赵志芳 张玉君 +1 位作者 成秋明 陈建平 《Journal of China University of Geosciences》 SCIE CSCD 2008年第4期416-420,共5页
Alteration is regarded as significant information for mineral exploration. In this study, ETM+ remote sensing data are used for recognizing and extracting alteration zones in northwestern Yunnan (云南), China. The ... Alteration is regarded as significant information for mineral exploration. In this study, ETM+ remote sensing data are used for recognizing and extracting alteration zones in northwestern Yunnan (云南), China. The principal component analysis (PCA) of ETM+ bands 1, 4, 5, and 7 was employed for OH alteration extractions. The PCA of ETM+ bands 1, 3, 4, and 5 was used for extracting Fe^2+ (Fe^3+) alterations. Interfering factors, such as vegetation, snow, and shadows, were masked. Alteration components were defined in the principal components (PCs) by the contributions of their diagnostic spectral bands. The zones of alteration identified from remote sensing were analyzed in detail along with geological surveys and field verification. The results show that the OH^- alteration is a main indicator of K-feldspar, phyllic, and prophilized alterations. These alterations are closely related to porphyry copper deposits. The Fe^2+ (Fe^3+) alteration indicates pyritization, which is mainly related to hydrothermal or skarn type polymetallic deposits. 展开更多
关键词 mineral alteration extraction from ETM+ data PCA OH^- alteration Fe^2+ (Fe^3+) alteration northwestern Yunnan China
下载PDF
Big Data Bot with a Special Reference to Bioinformatics
7
作者 Ahmad M.Al-Omari Shefa M.Tawalbeh +4 位作者 Yazan H.Akkam Mohammad Al-Tawalbeh Shima’a Younis Abdullah A.Mustafa Jonathan Arnold 《Computers, Materials & Continua》 SCIE EI 2023年第5期4155-4173,共19页
There are quintillions of data on deoxyribonucleic acid(DNA)and protein in publicly accessible data banks,and that number is expanding at an exponential rate.Many scientific fields,such as bioinformatics and drug disc... There are quintillions of data on deoxyribonucleic acid(DNA)and protein in publicly accessible data banks,and that number is expanding at an exponential rate.Many scientific fields,such as bioinformatics and drug discovery,rely on such data;nevertheless,gathering and extracting data from these resources is a tough undertaking.This data should go through several processes,including mining,data processing,analysis,and classification.This study proposes software that extracts data from big data repositories automatically and with the particular ability to repeat data extraction phases as many times as needed without human intervention.This software simulates the extraction of data from web-based(point-and-click)resources or graphical user interfaces that cannot be accessed using command-line tools.The software was evaluated by creating a novel database of 34 parameters for 1360 physicochemical properties of antimicrobial peptides(AMP)sequences(46240 hits)from various MARVIN software panels,which can be later utilized to develop novel AMPs.Furthermore,for machine learning research,the program was validated by extracting 10,000 protein tertiary structures from the Protein Data Bank.As a result,data collection from the web will become faster and less expensive,with no need for manual data extraction.The software is critical as a first step to preparing large datasets for subsequent stages of analysis,such as those using machine and deep-learning applications. 展开更多
关键词 BIOINFORMATICS big data data extraction BOT drug design
下载PDF
MUTS-Based Cooperative Target Stalking for A Multi-USV System
8
作者 Chengcheng Wang Yulong Wang +1 位作者 Qing-Long Han Yunkai Wu 《IEEE/CAA Journal of Automatica Sinica》 SCIE EI CSCD 2023年第7期1582-1592,共11页
This paper is concerned with the cooperative target stalking for a multi-unmanned surface vehicle(multi-USV)system.Based on the multi-agent deep deterministic policy gradient(MADDPG)algorithm,a multi-USV target stalki... This paper is concerned with the cooperative target stalking for a multi-unmanned surface vehicle(multi-USV)system.Based on the multi-agent deep deterministic policy gradient(MADDPG)algorithm,a multi-USV target stalking(MUTS)algorithm is proposed.Firstly,a V-type probabilistic data extraction method is proposed for the first time to overcome shortcomings of the MADDPG algorithm.The advantages of the proposed method are twofold:1)it can reduce the amount of data and shorten training time;2)it can filter out more important data in the experience buffer for training.Secondly,in order to avoid the collisions of USVs during the stalking process,an action constraint method called Safe DDPG is introduced.Finally,the MUTS algorithm and some existing algorithms are compared in cooperative target stalking scenarios.In order to demonstrate the effectiveness of the proposed MUTS algorithm in stalking tasks,mission operating scenarios and reward functions are well designed in this paper.The proposed MUTS algorithm can help the multi-USV system avoid internal collisions during the mission execution.Moreover,compared with some existing algorithms,the newly proposed one can provide a higher convergence speed and a narrower convergence domain. 展开更多
关键词 Cooperative target stalking improved deep reinforcement learning multi-unmanned surface vehicle(multi-USV)systems V-type probabilistic data extraction
下载PDF
A Review on the Recent Trends of Image Steganography for VANET Applications
9
作者 Arshiya S.Ansari 《Computers, Materials & Continua》 SCIE EI 2024年第3期2865-2892,共28页
Image steganography is a technique of concealing confidential information within an image without dramatically changing its outside look.Whereas vehicular ad hoc networks(VANETs),which enable vehicles to communicate w... Image steganography is a technique of concealing confidential information within an image without dramatically changing its outside look.Whereas vehicular ad hoc networks(VANETs),which enable vehicles to communicate with one another and with roadside infrastructure to enhance safety and traffic flow provide a range of value-added services,as they are an essential component of modern smart transportation systems.VANETs steganography has been suggested by many authors for secure,reliable message transfer between terminal/hope to terminal/hope and also to secure it from attack for privacy protection.This paper aims to determine whether using steganography is possible to improve data security and secrecy in VANET applications and to analyze effective steganography techniques for incorporating data into images while minimizing visual quality loss.According to simulations in literature and real-world studies,Image steganography proved to be an effectivemethod for secure communication on VANETs,even in difficult network conditions.In this research,we also explore a variety of steganography approaches for vehicular ad-hoc network transportation systems like vector embedding,statistics,spatial domain(SD),transform domain(TD),distortion,masking,and filtering.This study possibly shall help researchers to improve vehicle networks’ability to communicate securely and lay the door for innovative steganography methods. 展开更多
关键词 STEGANOGRAPHY image steganography image steganography techniques information exchange data embedding and extracting vehicular ad hoc network(VANET) transportation system
下载PDF
Ontology-based Knowledge Extraction from Hidden Web 被引量:1
10
作者 宋晖 马范援 刘晓强 《Journal of Donghua University(English Edition)》 EI CAS 2004年第5期73-78,共6页
Hidden Web provides great amount of domain-specific data for constructing knowledge services. Most previous knowledge extraction researches ignore the valuable data hidden in Web database, and related works do not ref... Hidden Web provides great amount of domain-specific data for constructing knowledge services. Most previous knowledge extraction researches ignore the valuable data hidden in Web database, and related works do not refer how to make extracted information available for knowledge system. This paper describes a novel approach to build a domain-specific knowledge service with the data retrieved from Hidden Web. Ontology serves to model the domain knowledge. Queries forms of different Web sites are translated into machine-understandable format, defined knowledge concepts, so that they can be accessed automatically. Also knowledge data are extracted from Web pages and organized in ontology format knowledge. The experiment proves the algorithm achieves high accuracy and the system facilitates constructing knowledge services greatly. 展开更多
关键词 knowledge service hidden web ONTOLOGY data extraction
下载PDF
Mesh Generation from Dense 3D Scattered Data Using Neural Network 被引量:8
11
作者 ZHANGWei JIANGXian-feng +1 位作者 CHENLi-neng MAYa-liang 《Computer Aided Drafting,Design and Manufacturing》 2004年第1期30-35,共6页
An improved self-organizing feature map (SOFM) neural network is presented to generate rectangular and hexagonal lattic with normal vector attached to each vertex. After the neural network was trained, the whole scatt... An improved self-organizing feature map (SOFM) neural network is presented to generate rectangular and hexagonal lattic with normal vector attached to each vertex. After the neural network was trained, the whole scattered data were divided into sub-regions where classified core were represented by the weight vectors of neurons at the output layer of neural network. The weight vectors of the neurons were used to approximate the dense 3-D scattered points, so the dense scattered points could be reduced to a reasonable scale, while the topological feature of the whole scattered points were remained. 展开更多
关键词 reverse engineering mesh generation neural network scattered points data extraction
下载PDF
Complex Network Formation and Analysis of Online Social Media Systems
12
作者 Hafiz Abid Mahmood Malik 《Computer Modeling in Engineering & Sciences》 SCIE EI 2022年第3期1737-1750,共14页
To discover and identify the influential nodes in any complex network has been an important issue.It is a significant factor in order to control over the network.Through control on a network,any information can be spr... To discover and identify the influential nodes in any complex network has been an important issue.It is a significant factor in order to control over the network.Through control on a network,any information can be spread and stopped in a short span of time.Both targets can be achieved,since network of information can be extended and as well destroyed.So,information spread and community formation have become one of the most crucial issues in the world of SNA(Social Network Analysis).In this work,the complex network of twitter social network has been formalized and results are analyzed.For this purpose,different network metrics have been utilized.Visualization of the network is provided in its original form and then filter out(different percentages)from the network to eliminate the less impacting nodes and edges for better analysis.This network is analyzed according to different centrality measures,like edge-betweenness,betweenness centrality,closeness centrality and eigenvector centrality.Influential nodes are detected and their impact is observed on the network.The communities are analyzed in terms of network coverage considering theMinimum Spanning Tree,shortest path distribution and network diameter.It is found that these are the very effective ways to find influential and central nodes from such big social networks like Facebook,Instagram,Twitter,LinkedIn,etc. 展开更多
关键词 Complex network data extraction nodes and edges network visualization social media network main hubs centrality measures
下载PDF
Information Classification and Extraction on Official Web Pages of Organizations
13
作者 Jinlin Wang Xing Wang +3 位作者 Hongli Zhang Binxing Fang Yuchen Yang Jianan Liu 《Computers, Materials & Continua》 SCIE EI 2020年第9期2057-2073,共17页
As a real-time and authoritative source,the official Web pages of organizations contain a large amount of information.The diversity of Web content and format makes it essential for pre-processing to get the unified at... As a real-time and authoritative source,the official Web pages of organizations contain a large amount of information.The diversity of Web content and format makes it essential for pre-processing to get the unified attributed data,which has the value of organizational analysis and mining.The existing research on dealing with multiple Web scenarios and accuracy performance is insufficient.This paper aims to propose a method to transform organizational official Web pages into the data with attributes.After locating the active blocks in the Web pages,the structural and content features are proposed to classify information with the specific model.The extraction methods based on trigger lexicon and LSTM(Long Short-Term Memory)are proposed,which efficiently process the classified information and extract data that matches the attributes.Finally,an accurate and efficient method to classify and extract information from organizational official Web pages is formed.Experimental results show that our approach improves the performing indicators and exceeds the level of state of the art on real data set from organizational official Web pages. 展开更多
关键词 Web pre-process feature classification data extraction trigger lexicon LSTM
下载PDF
Web Database Query Interface Annotation Based on User Collaboration
14
作者 LIU Wei LIN Can MENG Xiaofeng 《Wuhan University Journal of Natural Sciences》 CAS 2006年第5期1403-1406,共4页
A vision based query interface annotation meth od is used to relate attributes and form elements in form based web query interfaces, this method can reach accuracy of 82%. And a user participation method is used to tu... A vision based query interface annotation meth od is used to relate attributes and form elements in form based web query interfaces, this method can reach accuracy of 82%. And a user participation method is used to tune the result; user can answer "yes" or "no" for existing annotations, or manually annotate form elements. Mass feedback is added to the annotation algorithm to produce more accurate result. By this approach, query interface annotation can reach a perfect accuracy. 展开更多
关键词 Web database data integration data extraction
下载PDF
A Framework of Web Data Integrated LBS Middleware
15
作者 MENG Xiaofeng YIN Shaoyi XIAO Zhen 《Wuhan University Journal of Natural Sciences》 CAS 2006年第5期1187-1191,共5页
In this paper, we propose a flexible locationbased service (LBS) middleware framework to make the development and deployment of new location based applications much easier. Considering the World Wide Web as a huge d... In this paper, we propose a flexible locationbased service (LBS) middleware framework to make the development and deployment of new location based applications much easier. Considering the World Wide Web as a huge data source of location relative information, we integrate the common used web data extraction techniques into the middleware framework, exposing a unified web data interface for the upper applications to make them more attractive. Besides, the framework also emphasizes some common LBS issues, including positioning, location modeling, location-dependent query processing, privacy and secure management. 展开更多
关键词 location-based service (LBS) MIDDLEWARE web data extraction
下载PDF
Creating customized data services from web pages
16
作者 季光 Wang Guiling Han Yanbo 《High Technology Letters》 EI CAS 2013年第2期203-207,共5页
To extract structured data from a web page with customized requirements,a user labels some DOM elements on the page with attribute names.The common features of the labeled elements are utilized to guide the user throu... To extract structured data from a web page with customized requirements,a user labels some DOM elements on the page with attribute names.The common features of the labeled elements are utilized to guide the user through the labeling process to minimize user efforts,and are also utilized to retrieve attribute values.To turn the attribute values into a structured result,the attribute pattern needs to be induced.For this purpose,a space-optimized suffix tree called attribute tree is built to transform the document object model(DOM) tree into a simpler form while preserving its useful properties such as attribute sequence order.The pattern is induced bottom-up on the attribute tree,and is further used to build the structured result.Experiments are conducted and show high performance of our approach in terms of precision,recall and structural correctness. 展开更多
关键词 web data extraction structured data user labeling CUSTOMIZATION data service
下载PDF
Feedback on a shared big dataset for intelligent TBM PartⅠ:Feature extraction and machine learning methods 被引量:4
17
作者 Jian-Bin Li Zu-Yu Chen +10 位作者 Xu Li Liu-Jie Jing Yun-Pei Zhangf Hao-Han Xiao Shuang-Jing Wang Wen-Kun Yang Lei-Jie Wu Peng-Yu Li Hai-Bo Li Min Yao Li-Tao Fan 《Underground Space》 SCIE EI CSCD 2023年第4期1-25,共25页
This review summarizes the research outcomes and findings documented in 45 journal papers using a shared tunnel boring machine(TBM)dataset for performance prediction and boring efficiency optimization using machine le... This review summarizes the research outcomes and findings documented in 45 journal papers using a shared tunnel boring machine(TBM)dataset for performance prediction and boring efficiency optimization using machine learning methods.The big dataset was col-lected during the Yinsong water diversion project construction in China,covering the tunnel excavation of a 20 km-section with 199 items of monitoring metrics taken with an interval of one second.The research papers were the result of a call for contributions during a TBM machine learning contest in 2019 and covered a variety of topics related to the intelligent construction of TBM.This review com-prises two parts.Part I is concerned with the data processing,feature extraction,and machine learning methods applied by the contrib-utors.The review finds that the data-driven and knowledge-driven approaches in extracting important features applied by various authors are diversified,requiring further studies to achieve commonly accepted criteria.The techniques for cleaning and amending the raw data adopted by the contributors were summarized,indicating some highlights such as the importance of sufficiently high fre-quency of data acquisition(higher than 1 second),classification and standardization for the data preprocessing process,and the appro-priate selections of features in a boring cycle.The review finds that both supervised and unsupervised machine learning methods have been utilized by various researchers.The ensemble and deep learning methods have found wide applications.Part I highlights the impor-tant features of the individual methods applied by the contributors,including the structures of the algorithm,selection of hyperparam-eters,and model validation approaches. 展开更多
关键词 Big data Machine learning method TBM construction data extraction Machine learning contest
原文传递
From beasts to bytes:Revolutionizing zoological research with artificial intelligence 被引量:2
18
作者 Yu-Juan Zhang Zeyu Luo +2 位作者 Yawen Sun Junhao Liu Zongqing Chen 《Zoological Research》 SCIE CSCD 2023年第6期1115-1131,共17页
Since the late 2010s,Artificial Intelligence(AI)including machine learning,boosted through deep learning,has boomed as a vital tool to leverage computer vision,natural language processing and speech recognition in rev... Since the late 2010s,Artificial Intelligence(AI)including machine learning,boosted through deep learning,has boomed as a vital tool to leverage computer vision,natural language processing and speech recognition in revolutionizing zoological research.This review provides an overview of the primary tasks,core models,datasets,and applications of AI in zoological research,including animal classification,resource conservation,behavior,development,genetics and evolution,breeding and health,disease models,and paleontology.Additionally,we explore the challenges and future directions of integrating AI into this field.Based on numerous case studies,this review outlines various avenues for incorporating AI into zoological research and underscores its potential to enhance our understanding of the intricate relationships that exist within the animal kingdom.As we build a bridge between beast and byte realms,this review serves as a resource for envisioning novel AI applications in zoological research that have not yet been explored. 展开更多
关键词 Animal science data extraction Classification model Behavior analysis Biomolecular sequences analysis
下载PDF
Feature Extraction of Time Series Data Based on CNN-CBAM
19
作者 Jiaji Qin Dapeng Lang Chao Gao 《国际计算机前沿大会会议论文集》 EI 2023年第1期233-245,共13页
Methods for extracting features from time series data using deep learning have been widely studied,but they still suffer from problems of severe loss of feature information across different network layers and paramete... Methods for extracting features from time series data using deep learning have been widely studied,but they still suffer from problems of severe loss of feature information across different network layers and parameter redun-dancy.Therefore,a new time-series data feature extraction model(CNN-CBAM)that integrates convolutional neural networks(CNN)and convolutional attention mechanisms(CBAM)is proposed.First,the parameters of the CNN and BiGRU prediction models are optimized through uniform design methods.Next,the CNN is used to extract features from the time series data,outputting multiple feature maps.These feature maps are then subjected to feature re-extraction by the CBAM attention mechanism at both the spatial and channel levels.Finally,the feature maps are input into the BiGRU model for prediction.Experimental results show that after CNN-CBAM processing,the stability and accuracy of the BiGRU pre-diction model improved by 77.6%and 76.3%,respectively,outperforming other feature extraction methods.Meanwhile,the training time of the model has only increased by 7.1%,demonstrating excellent time efficiency. 展开更多
关键词 Uniform Design CNN CBAM Time-series data Feature extraction
原文传递
L-Tree Match: A New Data Extraction Model and Algorithm for Huge Text Stream with Noises 被引量:4
20
作者 邓绪斌 朱扬勇 《Journal of Computer Science & Technology》 SCIE EI CSCD 2005年第6期763-773,共11页
In this paper, a new method, named as L-tree match, is presented for extracting data from complex data sources. Firstly, based on data extraction logic presented in this work, a new data extraction model is constructe... In this paper, a new method, named as L-tree match, is presented for extracting data from complex data sources. Firstly, based on data extraction logic presented in this work, a new data extraction model is constructed in which model components are structurally correlated via a generalized template. Secondly, a database-populating mechanism is built, along with some object-manipulating operations needed for flexible database design, to support data extraction from huge text stream. Thirdly, top-down and bottom-up strategies are combined to design a new extraction algorithm that can extract data from data sources with optional, unordered, nested, and/or noisy components. Lastly, this method is applied to extract accurate data from biological documents amounting to 100GB for the first online integrated biological data warehouse of China. 展开更多
关键词 data extraction data model extraction algorithm regular expression WRAPPER
原文传递
上一页 1 2 下一页 到第
使用帮助 返回顶部