Information Classification and Extraction on Official Web Pages of Organizations

下载PDF

导出

摘要 As a real-time and authoritative source,the official Web pages of organizations contain a large amount of information.The diversity of Web content and format makes it essential for pre-processing to get the unified attributed data,which has the value of organizational analysis and mining.The existing research on dealing with multiple Web scenarios and accuracy performance is insufficient.This paper aims to propose a method to transform organizational official Web pages into the data with attributes.After locating the active blocks in the Web pages,the structural and content features are proposed to classify information with the specific model.The extraction methods based on trigger lexicon and LSTM(Long Short-Term Memory)are proposed,which efficiently process the classified information and extract data that matches the attributes.Finally,an accurate and efficient method to classify and extract information from organizational official Web pages is formed.Experimental results show that our approach improves the performing indicators and exceeds the level of state of the art on real data set from organizational official Web pages.

作者 Jinlin Wang Xing Wang Hongli Zhang Binxing Fang Yuchen Yang Jianan Liu

机构地区 School of Computer Science and Technology China Electronic Equipment System Engineering Company

出处《Computers, Materials & Continua》 SCIE EI 2020年第9期2057-2073,共17页 计算机、材料和连续体（英文）

基金 This work was supported by the National Key Research and Development Program of China(Nos.2016QY03D0501,2017YFB0803300) the National Natural Science Foundation of China(Nos.61601146,61732022) Sichuan Science and Technology Program(No.2019YFSY0049).

关键词 Web pre-process feature classification data extraction trigger lexicon LSTM

分类号 TP3 [自动化与计算机技术—计算机科学与技术]

引文网络
相关文献

1Yun Wang,Fazli Subhan,Shahaboddin Shamshirband,Muhammad Zubair Asghar,Ikram UllahAmmara Habib.Fuzzy-Based Sentiment Analysis System for Analyzing Student Feedback and Satisfaction[J].Computers, Materials & Continua,2020(2):631-655. 被引量：1
2Jiachen Sun,Peter Gloor.“Towards Re-Inventing Psychohistory”:Predicting the Popularity of Tomorrow’s News from Yesterday’s Twitter and News Feeds[J].Journal of Systems Science and Systems Engineering,2021,30(1):85-104. 被引量：1
3Hai-Feng Cao,Li-Li Wang.Content Features of Medical Journal Zhong Xi Yi Xue Bao(《中西医学报》The International Medical Journal)during the Republican Period and Its Impact on Medicine[J].Chinese Medicine and Culture,2021,4(1):66-69.
4Wenbin Bi,Fang Yu,Ning Cao,Wei Huo,Guangsheng Cao,Xiuli Han,Lili Sun,Russell Higgs.Research on Data Extraction and Analysis of Software Defect in IoT Communication Software[J].Computers, Materials & Continua,2020(11):1837-1854.
5Xue-Feng Jiang,Yi-Gui Lai,Hui-Jie Fan,Ye-Jian Hu,Fang-Min Chen,Fang-Hua Yang,Qiang Wang,Ying-Chang Fu.Clinical efficacy of integrated Chinese and western medicine in the treatment of the corona virus disease 2019(COVID-19):A meta-analysis[J].Clinical Research Communications,2021,4(4):4-17.
6Samantha Robinson,Ellie Vicha.Twitter Sentiment at the Hospital and Patient Level as a Measure of Pediatric Patient Experience[J].Open Journal of Pediatrics,2021,11(4):706-722.
7Shuai-Qi Ji,Rui Han,Ping-Ping Huang,Shuang-Yi Wang,Hao Lin,Lei Ma.Iron deficiency and early childhood caries:a systematic review and meta-analysis[J].Chinese Medical Journal,2021(23):2832-2837.
8T.T.Vu,N.V.A.Vu,H.P.Phung,L.D.Nguyen.Enhanced urban functional land use map with free and open-source data[J].International Journal of Digital Earth,2021,14(11):1744-1757. 被引量：2
9Yuheng Sun,Ye Mu,Qin Feng,Tianli Hu,He Gong,Shijun Li,Jing Zhou.Deer Body Adaptive Threshold Segmentation Algorithm Based on Color Space[J].Computers, Materials & Continua,2020(8):1317-1328. 被引量：5
10Yang-hao-tian Wu,Wen-bo He,Yin-yan Gao,Xue-mei Han.Effects of traditional Chinese exercises and general aerobic exercises on older adults with sleep disorders: A systematic review and meta-analysis[J].Journal of Integrative Medicine,2021,19(6):493-502.

Computers, Materials & Continua

2020年第9期

浏览历史

内容加载中请稍等...

Information Classification and Extraction on Official Web Pages of Organizations

相关作者

相关机构

相关主题

浏览历史