Faced with the evolving attacks in recommender systems, many detection features have been proposed by human engineering and used in supervised or unsupervised detection methods. However, the detection features extract...Faced with the evolving attacks in recommender systems, many detection features have been proposed by human engineering and used in supervised or unsupervised detection methods. However, the detection features extracted by human engineering are usually aimed at some specific types of attacks. To further detect other new types of attacks, the traditional methods have to re-extract detection features with high knowledge cost. To address these limitations, the method for automatic extraction of robust features is proposed and then an Adaboost-based detection method is presented. Firstly, to obtain robust representation with prior knowledge, unlike uniform corruption rate in traditional mLDA(marginalized Linear Denoising Autoencoder), different corruption rates for items are calculated according to the ratings’ distribution. Secondly, the ratings sparsity is used to weight the mapping matrix to extract low-dimensional representation. Moreover, the uniform corruption rate is also set to the next layer in mSLDA(marginalized Stacked Linear Denoising Autoencoder) to extract the stable and robust user features. Finally, under the robust feature space, an Adaboost-based detection method is proposed to alleviate the imbalanced classification problem. Experimental results on the Netflix and Amazon review datasets indicate that the proposed method can effectively detect various attacks.展开更多
Purpose:Automatic keyphrase extraction(AKE)is an important task for grasping the main points of the text.In this paper,we aim to combine the benefits of sequence labeling formulation and pretrained language model to p...Purpose:Automatic keyphrase extraction(AKE)is an important task for grasping the main points of the text.In this paper,we aim to combine the benefits of sequence labeling formulation and pretrained language model to propose an automatic keyphrase extraction model for Chinese scientific research.Design/methodology/approach:We regard AKE from Chinese text as a character-level sequence labeling task to avoid segmentation errors of Chinese tokenizer and initialize our model with pretrained language model BERT,which was released by Google in 2018.We collect data from Chinese Science Citation Database and construct a large-scale dataset from medical domain,which contains 100,000 abstracts as training set,6,000 abstracts as development set and 3,094 abstracts as test set.We use unsupervised keyphrase extraction methods including term frequency(TF),TF-IDF,TextRank and supervised machine learning methods including Conditional Random Field(CRF),Bidirectional Long Short Term Memory Network(BiLSTM),and BiLSTM-CRF as baselines.Experiments are designed to compare word-level and character-level sequence labeling approaches on supervised machine learning models and BERT-based models.Findings:Compared with character-level BiLSTM-CRF,the best baseline model with F1 score of 50.16%,our character-level sequence labeling model based on BERT obtains F1 score of 59.80%,getting 9.64%absolute improvement.Research limitations:We just consider automatic keyphrase extraction task rather than keyphrase generation task,so only keyphrases that are occurred in the given text can be extracted.In addition,our proposed dataset is not suitable for dealing with nested keyphrases.Practical implications:We make our character-level IOB format dataset of Chinese Automatic Keyphrase Extraction from scientific Chinese medical abstracts(CAKE)publicly available for the benefits of research community,which is available at:https://github.com/possible1402/Dataset-For-Chinese-Medical-Keyphrase-Extraction.Originality/value:By designing comparative experiments,our study demonstrates that character-level formulation is more suitable for Chinese automatic keyphrase extraction task under the general trend of pretrained language models.And our proposed dataset provides a unified method for model evaluation and can promote the development of Chinese automatic keyphrase extraction to some extent.展开更多
The commercial high-resolution imaging satellite with 1 m spatial resolution IKONOS is an important data source of information for urban planning and geographical information system (GIS) applications. In this paper, ...The commercial high-resolution imaging satellite with 1 m spatial resolution IKONOS is an important data source of information for urban planning and geographical information system (GIS) applications. In this paper, a morphological method is proposed. The proposed method combines the automatic thresholding and morphological operation techniques to extract the road centerline of the urban environment. This method intends to solve urban road centerline problems, vehicle, vegetation, building etc. Based on this morphological method, an object extractor is designed to extract road networks from highly remote sensing images. Some filters are applied in this experiment such as line reconstruction and region filling techniques to connect the disconnected road segments and remove the small redundant. Finally, the thinning algorithm is used to extract the road centerline. Experiments have been conducted on a high-resolution IKONOS and QuickBird images showing the efficiency of the proposed method.展开更多
Purpose:The main objective of this work is to show the potentialities of recently developed approaches for automatic knowledge extraction directly from the universities’websites.The information automatically extracte...Purpose:The main objective of this work is to show the potentialities of recently developed approaches for automatic knowledge extraction directly from the universities’websites.The information automatically extracted can be potentially updated with a frequency higher than once per year,and be safe from manipulations or misinterpretations.Moreover,this approach allows us flexibility in collecting indicators about the efficiency of universities’websites and their effectiveness in disseminating key contents.These new indicators can complement traditional indicators of scientific research(e.g.number of articles and number of citations)and teaching(e.g.number of students and graduates)by introducing further dimensions to allow new insights for“profiling”the analyzed universities.Design/methodology/approach:Webometrics relies on web mining methods and techniques to perform quantitative analyses of the web.This study implements an advanced application of the webometric approach,exploiting all the three categories of web mining:web content mining;web structure mining;web usage mining.The information to compute our indicators has been extracted from the universities’websites by using web scraping and text mining techniques.The scraped information has been stored in a NoSQL DB according to a semistructured form to allow for retrieving information efficiently by text mining techniques.This provides increased flexibility in the design of new indicators,opening the door to new types of analyses.Some data have also been collected by means of batch interrogations of search engines(Bing,www.bing.com)or from a leading provider of Web analytics(SimilarWeb,http://www.similarweb.com).The information extracted from the Web has been combined with the University structural information taken from the European Tertiary Education Register(https://eter.joanneum.at/#/home),a database collecting information on Higher Education Institutions(HEIs)at European level.All the above was used to perform a clusterization of 79 Italian universities based on structural and digital indicators.Findings:The main findings of this study concern the evaluation of the potential in digitalization of universities,in particular by presenting techniques for the automatic extraction of information from the web to build indicators of quality and impact of universities’websites.These indicators can complement traditional indicators and can be used to identify groups of universities with common features using clustering techniques working with the above indicators.Research limitations:The results reported in this study refers to Italian universities only,but the approach could be extended to other university systems abroad.Practical implications:The approach proposed in this study and its illustration on Italian universities show the usefulness of recently introduced automatic data extraction and web scraping approaches and its practical relevance for characterizing and profiling the activities of universities on the basis of their websites.The approach could be applied to other university systems.Originality/value:This work applies for the first time to university websites some recently introduced techniques for automatic knowledge extraction based on web scraping,optical character recognition and nontrivial text mining operations(Bruni&Bianchi,2020).展开更多
The vast availability of information sources has created a need for research on automatic summarization. Current methods perform either by extraction or abstraction. The extraction methods are interesting, because the...The vast availability of information sources has created a need for research on automatic summarization. Current methods perform either by extraction or abstraction. The extraction methods are interesting, because they are robust and independent of the language used. An extractive summary is obtained by selecting sentences of the original source based on information content. This selection can be automated using a classification function induced by a machine learning algorithm. This function classifies sentences into two groups: important or non-important. The important sentences then form the summary. But, the efficiency of this function directly depends on the used training set to induce it. This paper proposes an original way of optimizing this training set by inserting lexemes obtained from ontological knowledge bases. The training set optimized is reinforced by ontological knowledge. An experiment with four machine learning algorithms was made to validate this proposition. The improvement achieved is clearly significant for each of these algorithms.展开更多
This paper proposed a new method of semi-automatic extraction for semantic structures from unlabelled corpora in specific domains. The approach is statistical in nature. The extracted structures can be used for shallo...This paper proposed a new method of semi-automatic extraction for semantic structures from unlabelled corpora in specific domains. The approach is statistical in nature. The extracted structures can be used for shallow parsing and semantic labeling. By iteratively extracting new words and clustering words, we get an inital semantic lexicon that groups words of the same semantic meaning together as a class. After that, a bootstrapping algorithm is adopted to extract semantic structures. Then the semantic structures are used to extract展开更多
In this paper, we propose a novel automatic object extraction algorithm, named the Template Guided Live Wire, based on the popularly used live-wire techniques. We discuss in details the novel method’s applications on...In this paper, we propose a novel automatic object extraction algorithm, named the Template Guided Live Wire, based on the popularly used live-wire techniques. We discuss in details the novel method’s applications on tongue extraction in digital images. With the guides of a given template curve which approximates the tongue’s shape, our method can finish the extraction of tongue without any human intervention. In the paper, we also discussed in details how the template guides the live wire, and why our method functions more effectively than other boundary based segmentation methods especially the snake algorithm. Experimental results on some tongue images are as well provided to show our method’s better accuracy and robustness than the snake algorithm.展开更多
Epilepsy is a common neurological disorder that occurs at all ages.Epilepsy not only brings physical pain to patients,but also brings a huge burden to the lives of patients and their families.At present,epilepsy detec...Epilepsy is a common neurological disorder that occurs at all ages.Epilepsy not only brings physical pain to patients,but also brings a huge burden to the lives of patients and their families.At present,epilepsy detection is still achieved through the observation of electroencephalography(EEG)by medical staff.However,this process takes a long time and consumes energy,which will create a huge workload to medical staff.Therefore,it is particularly important to realize the automatic detection of epilepsy.This paper introduces,in detail,the overall framework of EEG-based automatic epilepsy identification and the typical methods involved in each step.Aiming at the core modules,that is,signal acquisition analog front end(AFE),feature extraction and classifier selection,method summary and theoretical explanation are carried out.Finally,the future research directions in the field of automatic detection of epilepsy are prospected.展开更多
In the last two decades,significant research has been conducted in the field of automated extraction of rock mass discontinuity characteristics from three-dimensional(3D)models.This provides several methodologies for ...In the last two decades,significant research has been conducted in the field of automated extraction of rock mass discontinuity characteristics from three-dimensional(3D)models.This provides several methodologies for acquiring discontinuity measurements from 3D models,such as point clouds generated using laser scanning or photogrammetry.However,even with numerous automated and semiautomated methods presented in the literature,there is not one single method that can automatically characterize discontinuities accurately in a minimum of time.In this paper,we critically review all the existing methods proposed in the literature for the extraction of discontinuity characteristics such as joint sets and orientations,persistence,joint spacing,roughness and block size using point clouds,digital elevation maps,or meshes.As a result of this review,we identify the strengths and drawbacks of each method used for extracting those characteristics.We found that the approaches based on voxels and region growing are superior in extracting joint planes from 3D point clouds.Normal tensor voting with trace growth algorithm is a robust method for measuring joint trace length from 3D meshes.Spacing is estimated by calculating the perpendicular distance between joint planes.Several independent roughness indices are presented to quantify roughness from 3D surface models,but there is a need to incorporate these indices into automated methodologies.There is a lack of efficient algorithms for direct computation of block size from 3D rock mass surface models.展开更多
The field of sentiment analysis(SA)has grown in tandem with the aid of social networking platforms to exchange opinions and ideas.Many people share their views and ideas around the world through social media like Face...The field of sentiment analysis(SA)has grown in tandem with the aid of social networking platforms to exchange opinions and ideas.Many people share their views and ideas around the world through social media like Facebook and Twitter.The goal of opinion mining,commonly referred to as sentiment analysis,is to categorise and forecast a target’s opinion.Depending on if they provide a positive or negative perspective on a given topic,text documents or sentences can be classified.When compared to sentiment analysis,text categorization may appear to be a simple process,but number of challenges have prompted numerous studies in this area.A feature selection-based classification algorithm in conjunction with the firefly with levy and multilayer perceptron(MLP)techniques has been proposed as a way to automate sentiment analysis(SA).In this study,online product reviews can be enhanced by integrating classification and feature election.The firefly(FF)algorithm was used to extract features from online product reviews,and a multi-layer perceptron was used to classify sentiment(MLP).The experiment employs two datasets,and the results are assessed using a variety of criteria.On account of these tests,it is possible to conclude that the FFL-MLP algorithm has the better classification performance for Canon(98%accuracy)and iPod(99%accuracy).展开更多
Relative radiometric normalization (RRN) minimizes radiometric differences among images caused by inconsistencies of acquisition conditions rather than changes in surface. Scale invariant feature transform (SIFT) has ...Relative radiometric normalization (RRN) minimizes radiometric differences among images caused by inconsistencies of acquisition conditions rather than changes in surface. Scale invariant feature transform (SIFT) has the ability to automatically extract control points (CPs) and is commonly used for remote sensing images. However, its results are mostly inaccurate and sometimes contain incorrect matching caused by generating a small number of false CP pairs. These CP pairs have high false alarm matching. This paper presents a modified method to improve the performance of SIFT CPs matching by applying sum of absolute difference (SAD) in a different manner for the new optical satellite generation called near-equatorial orbit satellite and multi-sensor images. The proposed method, which has a significantly high rate of correct matches, improves CP matching. The data in this study were obtained from the RazakSAT satellite a new near equatorial satellite system. The proposed method involves six steps: 1) data reduction, 2) applying the SIFT to automatically extract CPs, 3) refining CPs matching by using SAD algorithm with empirical threshold, and 4) calculation of true CPs intensity values over all image’ bands, 5) preforming a linear regression model between the intensity values of CPs locate in reverence and sensed image’ bands, 6) Relative radiometric normalization conducting using regression transformation functions. Different thresholds have experimentally tested and used in conducting this study (50 and 70), by followed the proposed method, and it removed the false extracted SIFT CPs to be from 775, 1125, 883, 804, 883 and 681 false pairs to 342, 424, 547, 706, 547, and 469 corrected and matched pairs, respectively.展开更多
The scale-invariant feature transform(SIFT)ability to automatic control points(CPs)extraction is very well known on remote sensing images,however,its result inaccurate and sometimes has incorrect matching from generat...The scale-invariant feature transform(SIFT)ability to automatic control points(CPs)extraction is very well known on remote sensing images,however,its result inaccurate and sometimes has incorrect matching from generating a small number of false CPs pairs,their matching has high false alarm.This paper presents a method containing a modification to improve the performance of the SIFT CPs matching by applying sum of absolute difference(SAD)in different manner for the new optical satellite generation called near-equatorial orbit satellite(NEqO)and multi-sensor images.The proposed method leads to improving CPs matching with a significantly higher rate of correct matches.The data in this study were obtained from the RazakSAT satellite covering the Kuala Lumpur-Pekan area.The proposed method consists of three parts:(1)applying the SIFT to extract CPs automatically,(2)refining CPs matching by SAD algorithm with empirical threshold,and(3)evaluating the refined CPs scenario by comparing the result of the original SIFT with that of the proposed method.The result indicates an accurate and precise performance of the model,which showed the effectiveness and robustness of the proposed approach.展开更多
Craters are salient terrain features on planetary surfaces, and provide useful information about the relative dating of geological unit of planets. In addition, they are ideal landmarks for spacecraft navigation. Due ...Craters are salient terrain features on planetary surfaces, and provide useful information about the relative dating of geological unit of planets. In addition, they are ideal landmarks for spacecraft navigation. Due to low contrast and uneven illumination, automatic extraction of craters remains a challenging task. This paper presents a saliency detection method for crater edges and a feature matching algorithm based on edges informa- tion. The craters are extracted through saliency edges detection, edge extraction and selection, feature matching of the same crater edges and robust ellipse fitting. In the edges matching algorithm, a crater feature model is proposed by analyzing the relationship between highlight region edges and shadow region ones. Then, crater edges are paired through the effective matching algorithm. Experiments of real planetary images show that the proposed approach is robust to different lights and topographies, and the detection rate is larger than 90%.展开更多
Purpose: This study aims to build an automatic survey generation tool, named CitationAS, based on citation content as represented by the set of citing sentences in the original articles.Design/methodology/approach: ...Purpose: This study aims to build an automatic survey generation tool, named CitationAS, based on citation content as represented by the set of citing sentences in the original articles.Design/methodology/approach: Firstly, we apply LDA to analyse topic distribution of citation content. Secondly, in CitationAS, we use bisecting K-means, Lingo and STC to cluster retrieved citation content. Then Word2Vec, Word Net and combination of them are applied to generate cluster labels. Next, we employ TF-IDF, MMR, as well as considering sentence location information, to extract important sentences, which are used to generate surveys. Finally, we adopt manual evaluation for the generated surveys.Findings: In experiments, we choose 20 high-frequency phrases as search terms. Results show that Lingo-Word2Vec, STC-Word Net and bisecting K-means-Word2Vec have better clustering effects. In 5 points evaluation system, survey quality scores obtained by designing methods are close to 3, indicating surveys are within acceptable limits. When considering sentence location information, survey quality will be improved. Combination of Lingo, Word2Vec, TF-IDF or MMR can acquire higher survey quality.Research limitations: The manual evaluation method may have a certain subjectivity. We use a simple linear function to combine Word2Vec and Word Net that may not bring out their strengths. The generated surveys may not contain some newly created knowledge of some articles which may concentrate on sentences with no citing.Practical implications: CitationAS tool can automatically generate a comprehensive, detailed and accurate survey according to user’s search terms. It can also help researchers learn about research status in a certain field.Originality/value: Citaiton AS tool is of practicability. It merges cluster labels from semantic level to improve clustering results. The tool also considers sentence location information when calculating sentence score by TF-IDF and MMR.展开更多
Automatic classification of blog entries is generally treated as a semi-supervised machine learning task, in which the blog entries are automatically assigned to one of a set of pre-defined classes based on the featur...Automatic classification of blog entries is generally treated as a semi-supervised machine learning task, in which the blog entries are automatically assigned to one of a set of pre-defined classes based on the features extracted from their textual content. This paper attempts automatic classification of unstructured blog entries by following pre-processing steps like tokenization, stop-word elimination and stemming;statistical techniques for feature set extraction, and feature set enhancement using semantic resources followed by modeling using two alternative machine learning models—the na?ve Bayesian model and the artificial neural network model. Empirical evaluations indicate that this multi-step classification approach has resulted in good overall classification accuracy over unstructured blog text datasets with both machine learning model alternatives. However, the na?ve Bayesian classification model clearly out-performs the ANN based classification model when a smaller feature-set is available which is usually the case when a blog topic is recent and the number of training datasets available is restricted.展开更多
Web pages contain more abundant contents than pure text ,such as hyperlinks,html tags and metadata et al.So that Web page categorization is different from pure text. According to Internet Chinese news pages, a practic...Web pages contain more abundant contents than pure text ,such as hyperlinks,html tags and metadata et al.So that Web page categorization is different from pure text. According to Internet Chinese news pages, a practical algorithm for extracting subject concepts from web page without thesaurus was proposed, when incorporated these category-subject concepts into knowledge base, Web pages was classified by hybrid algorithm, with experiment corpus extracting from Xinhua net. Experimental result shows that the categorization performance is improved using Web page feature.展开更多
Water on the Earth’s surface is an essential part of the hydrological cycle. Water resources include surface waters, groundwater, lakes, inland waters, rivers, coastal waters, and aquifers. Monitoring lake dynamics i...Water on the Earth’s surface is an essential part of the hydrological cycle. Water resources include surface waters, groundwater, lakes, inland waters, rivers, coastal waters, and aquifers. Monitoring lake dynamics is critical to favor sustainable management of water resources on Earth. In cryosphere, lake ice cover is a robust indicator of local climate variability and change. Therefore, it is necessary to review recent methods, technologies, and satellite sensors employed for the extraction of lakes from satellite imagery. The present review focuses on the comprehensive evaluation of existing methods for extraction of lake or water body features from remotely sensed optical data. We summarize pixel-based, object-based, hybrid, spectral index based, target and spectral matching methods employed in extracting lake features in urban and cryospheric environments. To our knowledge, almost all of the published research studies on the extraction of surface lakes in cryospheric environments have essentially used satellite remote sensing data and geospatial methods. Satellite sensors of varying spatial, temporal and spectral resolutions have been used to extract and analyze the information regarding surface water. Multispectral remote sensing has been widely utilized in cryospheric studies and has employed a variety of electro-optical satellite sensor systems for characterization and extraction of various cryospheric features, such as glaciers, sea ice, lakes and rivers, the extent of snow and ice, and icebergs. It is apparent that the most common methods for extracting water bodies use single band-based threshold methods, spectral index ratio (SIR)-based multiband methods, image segmentation methods, spectral-matching methods, and target detection methods (unsupervised, supervised and hybrid). A Synergetic fusion of various remote sensing methods is also proposed to improve water information extraction accuracies. The methods developed so far are not generic rather they are specific to either the location or satellite imagery or to the type of the feature to be extracted. Lots of factors are responsible for leading to inaccurate results of lake-feature extraction in cryospheric regions, e.g. the mountain shadow which also appears as a dark pixel is often misclassified as an open lake. The methods which are working well in the cryospheric environment for feature extraction or landcover classification does not really guarantee that they will be working in the same manner for the urban environment. Thus, in coming years, it is expected that much of the work will be done on object-based approach or hybrid approach involving both pixel as well as object-based technology. A more accurate, versatile and robust method is necessary to be developed that would work independent of geographical location (for both urban and cryosphere) and type of optical sensor.展开更多
The volume of academic literature,such as academic conference papers and journals,has increased rapidly worldwide,and research on metadata extraction is ongoing.However,high-performing metadata extraction is still cha...The volume of academic literature,such as academic conference papers and journals,has increased rapidly worldwide,and research on metadata extraction is ongoing.However,high-performing metadata extraction is still challenging due to diverse layout formats according to journal publishers.To accommodate the diversity of the layouts of academic journals,we propose a novel LAyout-aware Metadata Extraction(LAME)framework equipped with the three characteristics(e.g.,design of automatic layout analysis,construction of a large meta-data training set,and implementation of metadata extractor).In the framework,we designed an automatic layout analysis using PDF Miner.Based on the layout analysis,a large volume of metadata-separated training data,including the title,abstract,author name,author affiliated organization,and keywords,were automatically extracted.Moreover,we constructed a pre-trainedmodel,Layout-Meta BERT,to extract the metadata from academic journals with varying layout formats.The experimental results with our metadata extractor exhibited robust performance(Macro-F1,93.27%)in metadata extraction for unseen journals with different layout formats.展开更多
Fingerprints are an extraordinary source for recognizable proof of people. Unique finger impression acknowledgment is one of the most seasoned types of biometric identification. However, getting a decent unique finger...Fingerprints are an extraordinary source for recognizable proof of people. Unique finger impression acknowledgment is one of the most seasoned types of biometric identification. However, getting a decent unique finger impression picture isn’t that simple. So we must process unique finger impression picture before coordinating. A crucial advance in measurements of fingerprint minutiae is to obtain minutiae from the finger impression pictures dependably. However, fingerprint images are occasionally of perfect quality. They might be debased and defiled because of varieties in skin and impression conditions. Along these lines, image enhancement strategies utilize other details extraction to acquire a more reliable estimation of minutiae areas. The primary objective of this research work is to introduce a superior and improved unique fingerprint image. We studied the elements identifying with getting elite component focuses detection algorithm, for example, picture quality, segmentation, picture upgrade and highlight recognition. Usually utilized features for enhancing unique finger impression picture quality are Fourier spectrum energy, Sobel filter energy, and local orientation. Precise segmentation of unique finger impression edges from a broad foundation is vital. For productive improvement and feature extraction algorithms, we zero the commotion in segmented features. As a pre-processing method, we need to perform comprising of field introduction, ridge frequency estimation, Sobel filtering, division. Then connect the resulting picture to a thinning algorithm and consequent minutiae extraction. After resultant extraction of these minutiae focuses, we will utilize the picture with focuses for coordinating or finding the offenders and also for other security issues. The procedure of image pre-processing and minutiae extraction is explored. The simulations are performed in the MATLAB environment to assess the execution of the implemented algorithm.展开更多
基金supported by the National Natural Science Foundation of China [Nos. 61772452, 61379116]the Scientific and Technological Innovation Programs of Higher Education Institutions in Shanxi [No.2019L0847]the Natural Science Foundation of Hebei Province, China [No. F2015203046]
文摘Faced with the evolving attacks in recommender systems, many detection features have been proposed by human engineering and used in supervised or unsupervised detection methods. However, the detection features extracted by human engineering are usually aimed at some specific types of attacks. To further detect other new types of attacks, the traditional methods have to re-extract detection features with high knowledge cost. To address these limitations, the method for automatic extraction of robust features is proposed and then an Adaboost-based detection method is presented. Firstly, to obtain robust representation with prior knowledge, unlike uniform corruption rate in traditional mLDA(marginalized Linear Denoising Autoencoder), different corruption rates for items are calculated according to the ratings’ distribution. Secondly, the ratings sparsity is used to weight the mapping matrix to extract low-dimensional representation. Moreover, the uniform corruption rate is also set to the next layer in mSLDA(marginalized Stacked Linear Denoising Autoencoder) to extract the stable and robust user features. Finally, under the robust feature space, an Adaboost-based detection method is proposed to alleviate the imbalanced classification problem. Experimental results on the Netflix and Amazon review datasets indicate that the proposed method can effectively detect various attacks.
基金This work is supported by the project“Research on Methods and Technologies of Scientific Researcher Entity Linking and Subject Indexing”(Grant No.G190091)from the National Science Library,Chinese Academy of Sciencesthe project“Design and Research on a Next Generation of Open Knowledge Services System and Key Technologies”(2019XM55).
文摘Purpose:Automatic keyphrase extraction(AKE)is an important task for grasping the main points of the text.In this paper,we aim to combine the benefits of sequence labeling formulation and pretrained language model to propose an automatic keyphrase extraction model for Chinese scientific research.Design/methodology/approach:We regard AKE from Chinese text as a character-level sequence labeling task to avoid segmentation errors of Chinese tokenizer and initialize our model with pretrained language model BERT,which was released by Google in 2018.We collect data from Chinese Science Citation Database and construct a large-scale dataset from medical domain,which contains 100,000 abstracts as training set,6,000 abstracts as development set and 3,094 abstracts as test set.We use unsupervised keyphrase extraction methods including term frequency(TF),TF-IDF,TextRank and supervised machine learning methods including Conditional Random Field(CRF),Bidirectional Long Short Term Memory Network(BiLSTM),and BiLSTM-CRF as baselines.Experiments are designed to compare word-level and character-level sequence labeling approaches on supervised machine learning models and BERT-based models.Findings:Compared with character-level BiLSTM-CRF,the best baseline model with F1 score of 50.16%,our character-level sequence labeling model based on BERT obtains F1 score of 59.80%,getting 9.64%absolute improvement.Research limitations:We just consider automatic keyphrase extraction task rather than keyphrase generation task,so only keyphrases that are occurred in the given text can be extracted.In addition,our proposed dataset is not suitable for dealing with nested keyphrases.Practical implications:We make our character-level IOB format dataset of Chinese Automatic Keyphrase Extraction from scientific Chinese medical abstracts(CAKE)publicly available for the benefits of research community,which is available at:https://github.com/possible1402/Dataset-For-Chinese-Medical-Keyphrase-Extraction.Originality/value:By designing comparative experiments,our study demonstrates that character-level formulation is more suitable for Chinese automatic keyphrase extraction task under the general trend of pretrained language models.And our proposed dataset provides a unified method for model evaluation and can promote the development of Chinese automatic keyphrase extraction to some extent.
文摘The commercial high-resolution imaging satellite with 1 m spatial resolution IKONOS is an important data source of information for urban planning and geographical information system (GIS) applications. In this paper, a morphological method is proposed. The proposed method combines the automatic thresholding and morphological operation techniques to extract the road centerline of the urban environment. This method intends to solve urban road centerline problems, vehicle, vegetation, building etc. Based on this morphological method, an object extractor is designed to extract road networks from highly remote sensing images. Some filters are applied in this experiment such as line reconstruction and region filling techniques to connect the disconnected road segments and remove the small redundant. Finally, the thinning algorithm is used to extract the road centerline. Experiments have been conducted on a high-resolution IKONOS and QuickBird images showing the efficiency of the proposed method.
基金This work is developed with the support of the H2020 RISIS 2 Project(No.824091)and of the“Sapienza”Research Awards No.RM1161550376E40E of 2016 and RM11916B8853C925 of 2019.This article is a largely extended version of Bianchi et al.(2019)presented at the ISSI 2019 Conference held in Rome,2–5 September 2019.
文摘Purpose:The main objective of this work is to show the potentialities of recently developed approaches for automatic knowledge extraction directly from the universities’websites.The information automatically extracted can be potentially updated with a frequency higher than once per year,and be safe from manipulations or misinterpretations.Moreover,this approach allows us flexibility in collecting indicators about the efficiency of universities’websites and their effectiveness in disseminating key contents.These new indicators can complement traditional indicators of scientific research(e.g.number of articles and number of citations)and teaching(e.g.number of students and graduates)by introducing further dimensions to allow new insights for“profiling”the analyzed universities.Design/methodology/approach:Webometrics relies on web mining methods and techniques to perform quantitative analyses of the web.This study implements an advanced application of the webometric approach,exploiting all the three categories of web mining:web content mining;web structure mining;web usage mining.The information to compute our indicators has been extracted from the universities’websites by using web scraping and text mining techniques.The scraped information has been stored in a NoSQL DB according to a semistructured form to allow for retrieving information efficiently by text mining techniques.This provides increased flexibility in the design of new indicators,opening the door to new types of analyses.Some data have also been collected by means of batch interrogations of search engines(Bing,www.bing.com)or from a leading provider of Web analytics(SimilarWeb,http://www.similarweb.com).The information extracted from the Web has been combined with the University structural information taken from the European Tertiary Education Register(https://eter.joanneum.at/#/home),a database collecting information on Higher Education Institutions(HEIs)at European level.All the above was used to perform a clusterization of 79 Italian universities based on structural and digital indicators.Findings:The main findings of this study concern the evaluation of the potential in digitalization of universities,in particular by presenting techniques for the automatic extraction of information from the web to build indicators of quality and impact of universities’websites.These indicators can complement traditional indicators and can be used to identify groups of universities with common features using clustering techniques working with the above indicators.Research limitations:The results reported in this study refers to Italian universities only,but the approach could be extended to other university systems abroad.Practical implications:The approach proposed in this study and its illustration on Italian universities show the usefulness of recently introduced automatic data extraction and web scraping approaches and its practical relevance for characterizing and profiling the activities of universities on the basis of their websites.The approach could be applied to other university systems.Originality/value:This work applies for the first time to university websites some recently introduced techniques for automatic knowledge extraction based on web scraping,optical character recognition and nontrivial text mining operations(Bruni&Bianchi,2020).
文摘The vast availability of information sources has created a need for research on automatic summarization. Current methods perform either by extraction or abstraction. The extraction methods are interesting, because they are robust and independent of the language used. An extractive summary is obtained by selecting sentences of the original source based on information content. This selection can be automated using a classification function induced by a machine learning algorithm. This function classifies sentences into two groups: important or non-important. The important sentences then form the summary. But, the efficiency of this function directly depends on the used training set to induce it. This paper proposes an original way of optimizing this training set by inserting lexemes obtained from ontological knowledge bases. The training set optimized is reinforced by ontological knowledge. An experiment with four machine learning algorithms was made to validate this proposition. The improvement achieved is clearly significant for each of these algorithms.
文摘This paper proposed a new method of semi-automatic extraction for semantic structures from unlabelled corpora in specific domains. The approach is statistical in nature. The extracted structures can be used for shallow parsing and semantic labeling. By iteratively extracting new words and clustering words, we get an inital semantic lexicon that groups words of the same semantic meaning together as a class. After that, a bootstrapping algorithm is adopted to extract semantic structures. Then the semantic structures are used to extract
文摘In this paper, we propose a novel automatic object extraction algorithm, named the Template Guided Live Wire, based on the popularly used live-wire techniques. We discuss in details the novel method’s applications on tongue extraction in digital images. With the guides of a given template curve which approximates the tongue’s shape, our method can finish the extraction of tongue without any human intervention. In the paper, we also discussed in details how the template guides the live wire, and why our method functions more effectively than other boundary based segmentation methods especially the snake algorithm. Experimental results on some tongue images are as well provided to show our method’s better accuracy and robustness than the snake algorithm.
基金supported by the Strategic Priority Research Program of Chinese Academy of Sciences,Grant No.XDA0330000 and Grant No.XDB44000000。
文摘Epilepsy is a common neurological disorder that occurs at all ages.Epilepsy not only brings physical pain to patients,but also brings a huge burden to the lives of patients and their families.At present,epilepsy detection is still achieved through the observation of electroencephalography(EEG)by medical staff.However,this process takes a long time and consumes energy,which will create a huge workload to medical staff.Therefore,it is particularly important to realize the automatic detection of epilepsy.This paper introduces,in detail,the overall framework of EEG-based automatic epilepsy identification and the typical methods involved in each step.Aiming at the core modules,that is,signal acquisition analog front end(AFE),feature extraction and classifier selection,method summary and theoretical explanation are carried out.Finally,the future research directions in the field of automatic detection of epilepsy are prospected.
基金funded by the U.S.National Institute for Occupational Safety and Health(NIOSH)under the Contract No.75D30119C06044。
文摘In the last two decades,significant research has been conducted in the field of automated extraction of rock mass discontinuity characteristics from three-dimensional(3D)models.This provides several methodologies for acquiring discontinuity measurements from 3D models,such as point clouds generated using laser scanning or photogrammetry.However,even with numerous automated and semiautomated methods presented in the literature,there is not one single method that can automatically characterize discontinuities accurately in a minimum of time.In this paper,we critically review all the existing methods proposed in the literature for the extraction of discontinuity characteristics such as joint sets and orientations,persistence,joint spacing,roughness and block size using point clouds,digital elevation maps,or meshes.As a result of this review,we identify the strengths and drawbacks of each method used for extracting those characteristics.We found that the approaches based on voxels and region growing are superior in extracting joint planes from 3D point clouds.Normal tensor voting with trace growth algorithm is a robust method for measuring joint trace length from 3D meshes.Spacing is estimated by calculating the perpendicular distance between joint planes.Several independent roughness indices are presented to quantify roughness from 3D surface models,but there is a need to incorporate these indices into automated methodologies.There is a lack of efficient algorithms for direct computation of block size from 3D rock mass surface models.
文摘The field of sentiment analysis(SA)has grown in tandem with the aid of social networking platforms to exchange opinions and ideas.Many people share their views and ideas around the world through social media like Facebook and Twitter.The goal of opinion mining,commonly referred to as sentiment analysis,is to categorise and forecast a target’s opinion.Depending on if they provide a positive or negative perspective on a given topic,text documents or sentences can be classified.When compared to sentiment analysis,text categorization may appear to be a simple process,but number of challenges have prompted numerous studies in this area.A feature selection-based classification algorithm in conjunction with the firefly with levy and multilayer perceptron(MLP)techniques has been proposed as a way to automate sentiment analysis(SA).In this study,online product reviews can be enhanced by integrating classification and feature election.The firefly(FF)algorithm was used to extract features from online product reviews,and a multi-layer perceptron was used to classify sentiment(MLP).The experiment employs two datasets,and the results are assessed using a variety of criteria.On account of these tests,it is possible to conclude that the FFL-MLP algorithm has the better classification performance for Canon(98%accuracy)and iPod(99%accuracy).
文摘Relative radiometric normalization (RRN) minimizes radiometric differences among images caused by inconsistencies of acquisition conditions rather than changes in surface. Scale invariant feature transform (SIFT) has the ability to automatically extract control points (CPs) and is commonly used for remote sensing images. However, its results are mostly inaccurate and sometimes contain incorrect matching caused by generating a small number of false CP pairs. These CP pairs have high false alarm matching. This paper presents a modified method to improve the performance of SIFT CPs matching by applying sum of absolute difference (SAD) in a different manner for the new optical satellite generation called near-equatorial orbit satellite and multi-sensor images. The proposed method, which has a significantly high rate of correct matches, improves CP matching. The data in this study were obtained from the RazakSAT satellite a new near equatorial satellite system. The proposed method involves six steps: 1) data reduction, 2) applying the SIFT to automatically extract CPs, 3) refining CPs matching by using SAD algorithm with empirical threshold, and 4) calculation of true CPs intensity values over all image’ bands, 5) preforming a linear regression model between the intensity values of CPs locate in reverence and sensed image’ bands, 6) Relative radiometric normalization conducting using regression transformation functions. Different thresholds have experimentally tested and used in conducting this study (50 and 70), by followed the proposed method, and it removed the false extracted SIFT CPs to be from 775, 1125, 883, 804, 883 and 681 false pairs to 342, 424, 547, 706, 547, and 469 corrected and matched pairs, respectively.
文摘The scale-invariant feature transform(SIFT)ability to automatic control points(CPs)extraction is very well known on remote sensing images,however,its result inaccurate and sometimes has incorrect matching from generating a small number of false CPs pairs,their matching has high false alarm.This paper presents a method containing a modification to improve the performance of the SIFT CPs matching by applying sum of absolute difference(SAD)in different manner for the new optical satellite generation called near-equatorial orbit satellite(NEqO)and multi-sensor images.The proposed method leads to improving CPs matching with a significantly higher rate of correct matches.The data in this study were obtained from the RazakSAT satellite covering the Kuala Lumpur-Pekan area.The proposed method consists of three parts:(1)applying the SIFT to extract CPs automatically,(2)refining CPs matching by SAD algorithm with empirical threshold,and(3)evaluating the refined CPs scenario by comparing the result of the original SIFT with that of the proposed method.The result indicates an accurate and precise performance of the model,which showed the effectiveness and robustness of the proposed approach.
基金supported by the National Natural Science Foundation of China(61210012)
文摘Craters are salient terrain features on planetary surfaces, and provide useful information about the relative dating of geological unit of planets. In addition, they are ideal landmarks for spacecraft navigation. Due to low contrast and uneven illumination, automatic extraction of craters remains a challenging task. This paper presents a saliency detection method for crater edges and a feature matching algorithm based on edges informa- tion. The craters are extracted through saliency edges detection, edge extraction and selection, feature matching of the same crater edges and robust ellipse fitting. In the edges matching algorithm, a crater feature model is proposed by analyzing the relationship between highlight region edges and shadow region ones. Then, crater edges are paired through the effective matching algorithm. Experiments of real planetary images show that the proposed approach is robust to different lights and topographies, and the detection rate is larger than 90%.
基金supported by Major Projects of National Social Science Fund (No. 17ZDA291)Fujian Provincial Key Laboratory of Information Processing and Intelligent Control (Minjiang University) (No. MJUKF201704)Qing Lan Project
文摘Purpose: This study aims to build an automatic survey generation tool, named CitationAS, based on citation content as represented by the set of citing sentences in the original articles.Design/methodology/approach: Firstly, we apply LDA to analyse topic distribution of citation content. Secondly, in CitationAS, we use bisecting K-means, Lingo and STC to cluster retrieved citation content. Then Word2Vec, Word Net and combination of them are applied to generate cluster labels. Next, we employ TF-IDF, MMR, as well as considering sentence location information, to extract important sentences, which are used to generate surveys. Finally, we adopt manual evaluation for the generated surveys.Findings: In experiments, we choose 20 high-frequency phrases as search terms. Results show that Lingo-Word2Vec, STC-Word Net and bisecting K-means-Word2Vec have better clustering effects. In 5 points evaluation system, survey quality scores obtained by designing methods are close to 3, indicating surveys are within acceptable limits. When considering sentence location information, survey quality will be improved. Combination of Lingo, Word2Vec, TF-IDF or MMR can acquire higher survey quality.Research limitations: The manual evaluation method may have a certain subjectivity. We use a simple linear function to combine Word2Vec and Word Net that may not bring out their strengths. The generated surveys may not contain some newly created knowledge of some articles which may concentrate on sentences with no citing.Practical implications: CitationAS tool can automatically generate a comprehensive, detailed and accurate survey according to user’s search terms. It can also help researchers learn about research status in a certain field.Originality/value: Citaiton AS tool is of practicability. It merges cluster labels from semantic level to improve clustering results. The tool also considers sentence location information when calculating sentence score by TF-IDF and MMR.
文摘Automatic classification of blog entries is generally treated as a semi-supervised machine learning task, in which the blog entries are automatically assigned to one of a set of pre-defined classes based on the features extracted from their textual content. This paper attempts automatic classification of unstructured blog entries by following pre-processing steps like tokenization, stop-word elimination and stemming;statistical techniques for feature set extraction, and feature set enhancement using semantic resources followed by modeling using two alternative machine learning models—the na?ve Bayesian model and the artificial neural network model. Empirical evaluations indicate that this multi-step classification approach has resulted in good overall classification accuracy over unstructured blog text datasets with both machine learning model alternatives. However, the na?ve Bayesian classification model clearly out-performs the ANN based classification model when a smaller feature-set is available which is usually the case when a blog topic is recent and the number of training datasets available is restricted.
基金The National Natural Science Foundation of China(No60082003)
文摘Web pages contain more abundant contents than pure text ,such as hyperlinks,html tags and metadata et al.So that Web page categorization is different from pure text. According to Internet Chinese news pages, a practical algorithm for extracting subject concepts from web page without thesaurus was proposed, when incorporated these category-subject concepts into knowledge base, Web pages was classified by hybrid algorithm, with experiment corpus extracting from Xinhua net. Experimental result shows that the categorization performance is improved using Web page feature.
文摘Water on the Earth’s surface is an essential part of the hydrological cycle. Water resources include surface waters, groundwater, lakes, inland waters, rivers, coastal waters, and aquifers. Monitoring lake dynamics is critical to favor sustainable management of water resources on Earth. In cryosphere, lake ice cover is a robust indicator of local climate variability and change. Therefore, it is necessary to review recent methods, technologies, and satellite sensors employed for the extraction of lakes from satellite imagery. The present review focuses on the comprehensive evaluation of existing methods for extraction of lake or water body features from remotely sensed optical data. We summarize pixel-based, object-based, hybrid, spectral index based, target and spectral matching methods employed in extracting lake features in urban and cryospheric environments. To our knowledge, almost all of the published research studies on the extraction of surface lakes in cryospheric environments have essentially used satellite remote sensing data and geospatial methods. Satellite sensors of varying spatial, temporal and spectral resolutions have been used to extract and analyze the information regarding surface water. Multispectral remote sensing has been widely utilized in cryospheric studies and has employed a variety of electro-optical satellite sensor systems for characterization and extraction of various cryospheric features, such as glaciers, sea ice, lakes and rivers, the extent of snow and ice, and icebergs. It is apparent that the most common methods for extracting water bodies use single band-based threshold methods, spectral index ratio (SIR)-based multiband methods, image segmentation methods, spectral-matching methods, and target detection methods (unsupervised, supervised and hybrid). A Synergetic fusion of various remote sensing methods is also proposed to improve water information extraction accuracies. The methods developed so far are not generic rather they are specific to either the location or satellite imagery or to the type of the feature to be extracted. Lots of factors are responsible for leading to inaccurate results of lake-feature extraction in cryospheric regions, e.g. the mountain shadow which also appears as a dark pixel is often misclassified as an open lake. The methods which are working well in the cryospheric environment for feature extraction or landcover classification does not really guarantee that they will be working in the same manner for the urban environment. Thus, in coming years, it is expected that much of the work will be done on object-based approach or hybrid approach involving both pixel as well as object-based technology. A more accurate, versatile and robust method is necessary to be developed that would work independent of geographical location (for both urban and cryosphere) and type of optical sensor.
基金supported by the Korea Institute of Science and Technology Information(KISTI)through Construction on Science&Technology Content Curation Program(K-20-L01-C01)the National Research Foundation of Korea(NRF)under a grant funded by the Korean Government(MSIT)(No.NRF-2018R1C1B5031408).
文摘The volume of academic literature,such as academic conference papers and journals,has increased rapidly worldwide,and research on metadata extraction is ongoing.However,high-performing metadata extraction is still challenging due to diverse layout formats according to journal publishers.To accommodate the diversity of the layouts of academic journals,we propose a novel LAyout-aware Metadata Extraction(LAME)framework equipped with the three characteristics(e.g.,design of automatic layout analysis,construction of a large meta-data training set,and implementation of metadata extractor).In the framework,we designed an automatic layout analysis using PDF Miner.Based on the layout analysis,a large volume of metadata-separated training data,including the title,abstract,author name,author affiliated organization,and keywords,were automatically extracted.Moreover,we constructed a pre-trainedmodel,Layout-Meta BERT,to extract the metadata from academic journals with varying layout formats.The experimental results with our metadata extractor exhibited robust performance(Macro-F1,93.27%)in metadata extraction for unseen journals with different layout formats.
文摘Fingerprints are an extraordinary source for recognizable proof of people. Unique finger impression acknowledgment is one of the most seasoned types of biometric identification. However, getting a decent unique finger impression picture isn’t that simple. So we must process unique finger impression picture before coordinating. A crucial advance in measurements of fingerprint minutiae is to obtain minutiae from the finger impression pictures dependably. However, fingerprint images are occasionally of perfect quality. They might be debased and defiled because of varieties in skin and impression conditions. Along these lines, image enhancement strategies utilize other details extraction to acquire a more reliable estimation of minutiae areas. The primary objective of this research work is to introduce a superior and improved unique fingerprint image. We studied the elements identifying with getting elite component focuses detection algorithm, for example, picture quality, segmentation, picture upgrade and highlight recognition. Usually utilized features for enhancing unique finger impression picture quality are Fourier spectrum energy, Sobel filter energy, and local orientation. Precise segmentation of unique finger impression edges from a broad foundation is vital. For productive improvement and feature extraction algorithms, we zero the commotion in segmented features. As a pre-processing method, we need to perform comprising of field introduction, ridge frequency estimation, Sobel filtering, division. Then connect the resulting picture to a thinning algorithm and consequent minutiae extraction. After resultant extraction of these minutiae focuses, we will utilize the picture with focuses for coordinating or finding the offenders and also for other security issues. The procedure of image pre-processing and minutiae extraction is explored. The simulations are performed in the MATLAB environment to assess the execution of the implemented algorithm.