The volume of academic literature,such as academic conference papers and journals,has increased rapidly worldwide,and research on metadata extraction is ongoing.However,high-performing metadata extraction is still cha...The volume of academic literature,such as academic conference papers and journals,has increased rapidly worldwide,and research on metadata extraction is ongoing.However,high-performing metadata extraction is still challenging due to diverse layout formats according to journal publishers.To accommodate the diversity of the layouts of academic journals,we propose a novel LAyout-aware Metadata Extraction(LAME)framework equipped with the three characteristics(e.g.,design of automatic layout analysis,construction of a large meta-data training set,and implementation of metadata extractor).In the framework,we designed an automatic layout analysis using PDF Miner.Based on the layout analysis,a large volume of metadata-separated training data,including the title,abstract,author name,author affiliated organization,and keywords,were automatically extracted.Moreover,we constructed a pre-trainedmodel,Layout-Meta BERT,to extract the metadata from academic journals with varying layout formats.The experimental results with our metadata extractor exhibited robust performance(Macro-F1,93.27%)in metadata extraction for unseen journals with different layout formats.展开更多
基金supported by the Korea Institute of Science and Technology Information(KISTI)through Construction on Science&Technology Content Curation Program(K-20-L01-C01)the National Research Foundation of Korea(NRF)under a grant funded by the Korean Government(MSIT)(No.NRF-2018R1C1B5031408).
文摘The volume of academic literature,such as academic conference papers and journals,has increased rapidly worldwide,and research on metadata extraction is ongoing.However,high-performing metadata extraction is still challenging due to diverse layout formats according to journal publishers.To accommodate the diversity of the layouts of academic journals,we propose a novel LAyout-aware Metadata Extraction(LAME)framework equipped with the three characteristics(e.g.,design of automatic layout analysis,construction of a large meta-data training set,and implementation of metadata extractor).In the framework,we designed an automatic layout analysis using PDF Miner.Based on the layout analysis,a large volume of metadata-separated training data,including the title,abstract,author name,author affiliated organization,and keywords,were automatically extracted.Moreover,we constructed a pre-trainedmodel,Layout-Meta BERT,to extract the metadata from academic journals with varying layout formats.The experimental results with our metadata extractor exhibited robust performance(Macro-F1,93.27%)in metadata extraction for unseen journals with different layout formats.