Fine-grained recognition of ships based on remote sensing images is crucial to safeguarding maritime rights and interests and maintaining national security.Currently,with the emergence of massive high-resolution multi...Fine-grained recognition of ships based on remote sensing images is crucial to safeguarding maritime rights and interests and maintaining national security.Currently,with the emergence of massive high-resolution multi-modality images,the use of multi-modality images for fine-grained recognition has become a promising technology.Fine-grained recognition of multi-modality images imposes higher requirements on the dataset samples.The key to the problem is how to extract and fuse the complementary features of multi-modality images to obtain more discriminative fusion features.The attention mechanism helps the model to pinpoint the key information in the image,resulting in a significant improvement in the model’s performance.In this paper,a dataset for fine-grained recognition of ships based on visible and near-infrared multi-modality remote sensing images has been proposed first,named Dataset for Multimodal Fine-grained Recognition of Ships(DMFGRS).It includes 1,635 pairs of visible and near-infrared remote sensing images divided into 20 categories,collated from digital orthophotos model provided by commercial remote sensing satellites.DMFGRS provides two types of annotation format files,as well as segmentation mask images corresponding to the ship targets.Then,a Multimodal Information Cross-Enhancement Network(MICE-Net)fusing features of visible and near-infrared remote sensing images,has been proposed.In the network,a dual-branch feature extraction and fusion module has been designed to obtain more expressive features.The Feature Cross Enhancement Module(FCEM)achieves the fusion enhancement of the two modal features by making the channel attention and spatial attention work cross-functionally on the feature map.A benchmark is established by evaluating state-of-the-art object recognition algorithms on DMFGRS.MICE-Net conducted experiments on DMFGRS,and the precision,recall,mAP0.5 and mAP0.5:0.95 reached 87%,77.1%,83.8%and 63.9%,respectively.Extensive experiments demonstrate that the proposed MICE-Net has more excellent performance on DMFGRS.Built on lightweight network YOLO,the model has excellent generalizability,and thus has good potential for application in real-life scenarios.展开更多
Mining more discriminative temporal features to enrich temporal context representation is considered the key to fine-grained action recog-nition.Previous action recognition methods utilize a fixed spatiotemporal windo...Mining more discriminative temporal features to enrich temporal context representation is considered the key to fine-grained action recog-nition.Previous action recognition methods utilize a fixed spatiotemporal window to learn local video representation.However,these methods failed to capture complex motion patterns due to their limited receptive field.To solve the above problems,this paper proposes a lightweight Temporal Pyramid Excitation(TPE)module to capture the short,medium,and long-term temporal context.In this method,Temporal Pyramid(TP)module can effectively expand the temporal receptive field of the network by using the multi-temporal kernel decomposition without significantly increasing the computational cost.In addition,the Multi Excitation module can emphasize temporal importance to enhance the temporal feature representation learning.TPE can be integrated into ResNet50,and building a compact video learning framework-TPENet.Extensive validation experiments on several challenging benchmark(Something-Something V1,Something-Something V2,UCF-101,and HMDB51)datasets demonstrate that our method achieves a preferable balance between computation and accuracy.展开更多
The task of food image recognition,a nuanced subset of fine-grained image recognition,grapples with substantial intra-class variation and minimal inter-class differences.These challenges are compounded by the irregula...The task of food image recognition,a nuanced subset of fine-grained image recognition,grapples with substantial intra-class variation and minimal inter-class differences.These challenges are compounded by the irregular and multi-scale nature of food images.Addressing these complexities,our study introduces an advanced model that leverages multiple attention mechanisms and multi-stage local fusion,grounded in the ConvNeXt architecture.Our model employs hybrid attention(HA)mechanisms to pinpoint critical discriminative regions within images,substantially mitigating the influence of background noise.Furthermore,it introduces a multi-stage local fusion(MSLF)module,fostering long-distance dependencies between feature maps at varying stages.This approach facilitates the assimilation of complementary features across scales,significantly bolstering the model’s capacity for feature extraction.Furthermore,we constructed a dataset named Roushi60,which consists of 60 different categories of common meat dishes.Empirical evaluation of the ETH Food-101,ChineseFoodNet,and Roushi60 datasets reveals that our model achieves recognition accuracies of 91.12%,82.86%,and 92.50%,respectively.These figures not only mark an improvement of 1.04%,3.42%,and 1.36%over the foundational ConvNeXt network but also surpass the performance of most contemporary food image recognition methods.Such advancements underscore the efficacy of our proposed model in navigating the intricate landscape of food image recognition,setting a new benchmark for the field.展开更多
The fine-grained ship image recognition task aims to identify various classes of ships.However,small inter-class,large intra-class differences between ships,and lacking of training samples are the reasons that make th...The fine-grained ship image recognition task aims to identify various classes of ships.However,small inter-class,large intra-class differences between ships,and lacking of training samples are the reasons that make the task difficult.Therefore,to enhance the accuracy of the fine-grained ship image recognition,we design a fine-grained ship image recognition network based on bilinear convolutional neural network(BCNN)with Inception and additive margin Softmax(AM-Softmax).This network improves the BCNN in two aspects.Firstly,by introducing Inception branches to the BCNN network,it is helpful to enhance the ability of extracting comprehensive features from ships.Secondly,by adding margin values to the decision boundary,the AM-Softmax function can better extend the inter-class differences and reduce the intra-class differences.In addition,as there are few publicly available datasets for fine-grained ship image recognition,we construct a Ship-43 dataset containing 47,300 ship images belonging to 43 categories.Experimental results on the constructed Ship-43 dataset demonstrate that our method can effectively improve the accuracy of ship image recognition,which is 4.08%higher than the BCNN model.Moreover,comparison results on the other three public fine-grained datasets(Cub,Cars,and Aircraft)further validate the effectiveness of the proposed method.展开更多
Localizing discriminative object parts(e.g.,bird head)is crucial for fine-grained classification tasks,especially for the more challenging fine-grained few-shot scenario.Previous work always relies on the learned obje...Localizing discriminative object parts(e.g.,bird head)is crucial for fine-grained classification tasks,especially for the more challenging fine-grained few-shot scenario.Previous work always relies on the learned object parts in a unified manner,where they attend the same object parts(even with common attention weights)for different few-shot episodic tasks.In this paper,we propose that it should adaptively capture the task-specific object parts that require attention for each few-shot task,since the parts that can distinguish different tasks are naturally different.Specifically for a few-shot task,after obtaining part-level deep features,we learn a task-specific part-based dictionary for both aligning and reweighting part features in an episode.Then,part-level categorical prototypes are generated based on the part features of support data,which are later employed by calculating distances to classify query data for evaluation.To retain the discriminative ability of the part-level representations(i.e.,part features and part prototypes),we design an optimal transport solution that also utilizes query data in a transductive way to optimize the aforementioned distance calculation for the final predictions.Extensive experiments on five fine-grained benchmarks show the superiority of our method,especially for the 1-shot setting,gaining 0.12%,8.56%and 5.87%improvements over state-of-the-art methods on CUB,Stanford Dogs,and Stanford Cars,respectively.展开更多
In this paper, we propose a locally enhanced PCANet neural network for fine-grained classification of vehicles. The proposed method adopts the PCANet unsupervised network with a smaller number of layers and simple par...In this paper, we propose a locally enhanced PCANet neural network for fine-grained classification of vehicles. The proposed method adopts the PCANet unsupervised network with a smaller number of layers and simple parameters compared with the majority of state-of-the-art machine learning methods. It simplifies calculation steps and manual labeling, and enables vehicle types to be recognized without time-consuming training. Experimental results show that compared with the traditional pattern recognition methods and the multi-layer CNN methods, the proposed method achieves optimal balance in terms of varying scales of sample libraries, angle deviations, and training speed. It also indicates that introducing appropriate local features that have different scales from the general feature is very instrumental in improving recognition rate. The 7-angle in 180° (12-angle in 360°) classification modeling scheme is proven to be an effective approach, which can solve the problem of suffering decrease in recognition rate due to angle deviations, and add the recognition accuracy in practice.展开更多
Model recognition of second-hand mobile phones has been considered as an essential process to improve the efficiency of phone recycling. However, due to the diversity of mobile phone appearances, it is difficult to re...Model recognition of second-hand mobile phones has been considered as an essential process to improve the efficiency of phone recycling. However, due to the diversity of mobile phone appearances, it is difficult to realize accurate recognition. To solve this problem, a mobile phone recognition method based on bilinear-convolutional neural network(B-CNN) is proposed in this paper.First, a feature extraction model, based on B-CNN, is designed to adaptively extract local features from the images of secondhand mobile phones. Second, a joint loss function, constructed by center distance and softmax, is developed to reduce the interclass feature distance during the training process. Third, a parameter downscaling method, derived from the kernel discriminant analysis algorithm, is introduced to eliminate redundant features in B-CNN. Finally, the experimental results demonstrate that the B-CNN method can achieve higher accuracy than some existing methods.展开更多
In this study,an in-depth analysis of the types,characteristics,and formation mechanisms of microlaminae and microscopic laminae was conducted in order to precisely examine the link or intersection of stratigraphy and...In this study,an in-depth analysis of the types,characteristics,and formation mechanisms of microlaminae and microscopic laminae was conducted in order to precisely examine the link or intersection of stratigraphy and petrology.This study was essentially a sedimentary examination of the minuteness-macro and micro-tiny layers between laminae and pore structure,as well as the types of structures and sedimentation.The results of this study bear important basic subject attributes and significance,as well as practical value for the basic theories and exploration applications of unconventional oil and gas geology.The quantitative data were obtained using the following:field macroscopic observations;measurements;intensive sampling processes;XRD mineral content analysis;scanning electron microscopy;high-power polarizing microscope observations;and micro-scale measurements.The quantitative parameters,such as laminae thicknesses,laminae properties,organic matter laminae,and laminae spatial distributions were unified within a framework,and the correlations among them were established for the purpose of forming a fine-grained deposition micro-laminae evaluation system.The results obtained in this research investigation established a basis for the classification of micro-laminae,and divided the micro-laminae into four categories and 20 sub-categories according to the development thicknesses,material compositions,organic matter content levels,and the spatial distributions of the micro-laminae.The classification scheme of the micro-laminae was divided into two categories and 12 sub-categories.Then,in accordance with the comprehensive characteristics of spatial morphology,the micro-laminae was further divided into the following categories:continuous horizontal laminae;near horizontal laminae;slow wavy laminae;wavy laminae;discontinuous laminae;and lenticular laminae.According to the structural properties of the laminae development,the micro-laminae was divided into the following categories:single laminae structures;laminated laminae structures;interlaminar structures;multiple mixed laminae structures;cyclic laminae structures;and progressive laminae structures.The research results were considered to be applicable for the scientific evaluations of reservoir spaces related to unconventional oil and gas resources.展开更多
Fine-grained visual parsing, including fine-grained part segmentation and fine-grained object recognition, has attracted considerable critical attention due to its importance in many real-world applications, e.g., agr...Fine-grained visual parsing, including fine-grained part segmentation and fine-grained object recognition, has attracted considerable critical attention due to its importance in many real-world applications, e.g., agriculture, remote sensing, and space technologies. Predominant research efforts tackle these fine-grained sub-tasks following different paradigms, while the inherent relations between these tasks are neglected. Moreover, given most of the research remains fragmented, we conduct an in-depth study of the advanced work from a new perspective of learning the part relationship. In this perspective, we first consolidate recent research and benchmark syntheses with new taxonomies. Based on this consolidation, we revisit the universal challenges in fine-grained part segmentation and recognition tasks and propose new solutions by part relationship learning for these important challenges. Furthermore, we conclude several promising lines of research in fine-grained visual parsing for future research.展开更多
Hierarchical multi-granularity image classification is a challenging task that aims to tag each given image with multiple granularity labels simultaneously.Existing methods tend to overlook that different image region...Hierarchical multi-granularity image classification is a challenging task that aims to tag each given image with multiple granularity labels simultaneously.Existing methods tend to overlook that different image regions contribute differently to label prediction at different granularities,and also insufficiently consider relationships between the hierarchical multi-granularity labels.We introduce a sequence-to-sequence mechanism to overcome these two problems and propose a multi-granularity sequence generation(MGSG)approach for the hierarchical multi-granularity image classification task.Specifically,we introduce a transformer architecture to encode the image into visual representation sequences.Next,we traverse the taxonomic tree and organize the multi-granularity labels into sequences,and vectorize them and add positional information.The proposed multi-granularity sequence generation method builds a decoder that takes visual representation sequences and semantic label embedding as inputs,and outputs the predicted multi-granularity label sequence.The decoder models dependencies and correlations between multi-granularity labels through a masked multi-head self-attention mechanism,and relates visual information to the semantic label information through a crossmodality attention mechanism.In this way,the proposed method preserves the relationships between labels at different granularity levels and takes into account the influence of different image regions on labels with different granularities.Evaluations on six public benchmarks qualitatively and quantitatively demonstrate the advantages of the proposed method.Our project is available at https://github.com/liuxindazz/mgs.展开更多
文摘Fine-grained recognition of ships based on remote sensing images is crucial to safeguarding maritime rights and interests and maintaining national security.Currently,with the emergence of massive high-resolution multi-modality images,the use of multi-modality images for fine-grained recognition has become a promising technology.Fine-grained recognition of multi-modality images imposes higher requirements on the dataset samples.The key to the problem is how to extract and fuse the complementary features of multi-modality images to obtain more discriminative fusion features.The attention mechanism helps the model to pinpoint the key information in the image,resulting in a significant improvement in the model’s performance.In this paper,a dataset for fine-grained recognition of ships based on visible and near-infrared multi-modality remote sensing images has been proposed first,named Dataset for Multimodal Fine-grained Recognition of Ships(DMFGRS).It includes 1,635 pairs of visible and near-infrared remote sensing images divided into 20 categories,collated from digital orthophotos model provided by commercial remote sensing satellites.DMFGRS provides two types of annotation format files,as well as segmentation mask images corresponding to the ship targets.Then,a Multimodal Information Cross-Enhancement Network(MICE-Net)fusing features of visible and near-infrared remote sensing images,has been proposed.In the network,a dual-branch feature extraction and fusion module has been designed to obtain more expressive features.The Feature Cross Enhancement Module(FCEM)achieves the fusion enhancement of the two modal features by making the channel attention and spatial attention work cross-functionally on the feature map.A benchmark is established by evaluating state-of-the-art object recognition algorithms on DMFGRS.MICE-Net conducted experiments on DMFGRS,and the precision,recall,mAP0.5 and mAP0.5:0.95 reached 87%,77.1%,83.8%and 63.9%,respectively.Extensive experiments demonstrate that the proposed MICE-Net has more excellent performance on DMFGRS.Built on lightweight network YOLO,the model has excellent generalizability,and thus has good potential for application in real-life scenarios.
基金supported by the research team of Xi’an Traffic Engineering Institute and the Young and middle-aged fund project of Xi’an Traffic Engineering Institute (2022KY-02).
文摘Mining more discriminative temporal features to enrich temporal context representation is considered the key to fine-grained action recog-nition.Previous action recognition methods utilize a fixed spatiotemporal window to learn local video representation.However,these methods failed to capture complex motion patterns due to their limited receptive field.To solve the above problems,this paper proposes a lightweight Temporal Pyramid Excitation(TPE)module to capture the short,medium,and long-term temporal context.In this method,Temporal Pyramid(TP)module can effectively expand the temporal receptive field of the network by using the multi-temporal kernel decomposition without significantly increasing the computational cost.In addition,the Multi Excitation module can emphasize temporal importance to enhance the temporal feature representation learning.TPE can be integrated into ResNet50,and building a compact video learning framework-TPENet.Extensive validation experiments on several challenging benchmark(Something-Something V1,Something-Something V2,UCF-101,and HMDB51)datasets demonstrate that our method achieves a preferable balance between computation and accuracy.
基金The support of this research was by Hubei Provincial Natural Science Foundation(2022CFB449)Science Research Foundation of Education Department of Hubei Province(B2020061),are gratefully acknowledged.
文摘The task of food image recognition,a nuanced subset of fine-grained image recognition,grapples with substantial intra-class variation and minimal inter-class differences.These challenges are compounded by the irregular and multi-scale nature of food images.Addressing these complexities,our study introduces an advanced model that leverages multiple attention mechanisms and multi-stage local fusion,grounded in the ConvNeXt architecture.Our model employs hybrid attention(HA)mechanisms to pinpoint critical discriminative regions within images,substantially mitigating the influence of background noise.Furthermore,it introduces a multi-stage local fusion(MSLF)module,fostering long-distance dependencies between feature maps at varying stages.This approach facilitates the assimilation of complementary features across scales,significantly bolstering the model’s capacity for feature extraction.Furthermore,we constructed a dataset named Roushi60,which consists of 60 different categories of common meat dishes.Empirical evaluation of the ETH Food-101,ChineseFoodNet,and Roushi60 datasets reveals that our model achieves recognition accuracies of 91.12%,82.86%,and 92.50%,respectively.These figures not only mark an improvement of 1.04%,3.42%,and 1.36%over the foundational ConvNeXt network but also surpass the performance of most contemporary food image recognition methods.Such advancements underscore the efficacy of our proposed model in navigating the intricate landscape of food image recognition,setting a new benchmark for the field.
基金This work is supported by the National Natural Science Foundation of China(61806013,61876010,62176009,and 61906005)General project of Science and Technology Planof Beijing Municipal Education Commission(KM202110005028)+2 种基金Beijing Municipal Education Commission Project(KZ201910005008)Project of Interdisciplinary Research Institute of Beijing University of Technology(2021020101)International Research Cooperation Seed Fund of Beijing University of Technology(2021A01).
文摘The fine-grained ship image recognition task aims to identify various classes of ships.However,small inter-class,large intra-class differences between ships,and lacking of training samples are the reasons that make the task difficult.Therefore,to enhance the accuracy of the fine-grained ship image recognition,we design a fine-grained ship image recognition network based on bilinear convolutional neural network(BCNN)with Inception and additive margin Softmax(AM-Softmax).This network improves the BCNN in two aspects.Firstly,by introducing Inception branches to the BCNN network,it is helpful to enhance the ability of extracting comprehensive features from ships.Secondly,by adding margin values to the decision boundary,the AM-Softmax function can better extend the inter-class differences and reduce the intra-class differences.In addition,as there are few publicly available datasets for fine-grained ship image recognition,we construct a Ship-43 dataset containing 47,300 ship images belonging to 43 categories.Experimental results on the constructed Ship-43 dataset demonstrate that our method can effectively improve the accuracy of ship image recognition,which is 4.08%higher than the BCNN model.Moreover,comparison results on the other three public fine-grained datasets(Cub,Cars,and Aircraft)further validate the effectiveness of the proposed method.
基金supported by National Natural Science Foundation of China(No.62272231)Natural Science Foundation of Jiangsu Province of China(No.BK 20210340)+2 种基金National Key R&D Program of China(No.2021YFA1001100)the Fundamental Research Funds for the Central Universities,China(No.NJ2022028)CAAI-Huawei MindSpore Open Fund,China.
文摘Localizing discriminative object parts(e.g.,bird head)is crucial for fine-grained classification tasks,especially for the more challenging fine-grained few-shot scenario.Previous work always relies on the learned object parts in a unified manner,where they attend the same object parts(even with common attention weights)for different few-shot episodic tasks.In this paper,we propose that it should adaptively capture the task-specific object parts that require attention for each few-shot task,since the parts that can distinguish different tasks are naturally different.Specifically for a few-shot task,after obtaining part-level deep features,we learn a task-specific part-based dictionary for both aligning and reweighting part features in an episode.Then,part-level categorical prototypes are generated based on the part features of support data,which are later employed by calculating distances to classify query data for evaluation.To retain the discriminative ability of the part-level representations(i.e.,part features and part prototypes),we design an optimal transport solution that also utilizes query data in a transductive way to optimize the aforementioned distance calculation for the final predictions.Extensive experiments on five fine-grained benchmarks show the superiority of our method,especially for the 1-shot setting,gaining 0.12%,8.56%and 5.87%improvements over state-of-the-art methods on CUB,Stanford Dogs,and Stanford Cars,respectively.
文摘In this paper, we propose a locally enhanced PCANet neural network for fine-grained classification of vehicles. The proposed method adopts the PCANet unsupervised network with a smaller number of layers and simple parameters compared with the majority of state-of-the-art machine learning methods. It simplifies calculation steps and manual labeling, and enables vehicle types to be recognized without time-consuming training. Experimental results show that compared with the traditional pattern recognition methods and the multi-layer CNN methods, the proposed method achieves optimal balance in terms of varying scales of sample libraries, angle deviations, and training speed. It also indicates that introducing appropriate local features that have different scales from the general feature is very instrumental in improving recognition rate. The 7-angle in 180° (12-angle in 360°) classification modeling scheme is proven to be an effective approach, which can solve the problem of suffering decrease in recognition rate due to angle deviations, and add the recognition accuracy in practice.
基金supported by the National Key Program of China(Grant No.2018YFC1900800-5)the National Natural Science Foundation of China(Grant Nos.61890930-5 and 61622301)the Beijing University Outstanding Young Scientist Program(Grant No.BJJWZYJH0120191000-5020)。
文摘Model recognition of second-hand mobile phones has been considered as an essential process to improve the efficiency of phone recycling. However, due to the diversity of mobile phone appearances, it is difficult to realize accurate recognition. To solve this problem, a mobile phone recognition method based on bilinear-convolutional neural network(B-CNN) is proposed in this paper.First, a feature extraction model, based on B-CNN, is designed to adaptively extract local features from the images of secondhand mobile phones. Second, a joint loss function, constructed by center distance and softmax, is developed to reduce the interclass feature distance during the training process. Third, a parameter downscaling method, derived from the kernel discriminant analysis algorithm, is introduced to eliminate redundant features in B-CNN. Finally, the experimental results demonstrate that the B-CNN method can achieve higher accuracy than some existing methods.
文摘In this study,an in-depth analysis of the types,characteristics,and formation mechanisms of microlaminae and microscopic laminae was conducted in order to precisely examine the link or intersection of stratigraphy and petrology.This study was essentially a sedimentary examination of the minuteness-macro and micro-tiny layers between laminae and pore structure,as well as the types of structures and sedimentation.The results of this study bear important basic subject attributes and significance,as well as practical value for the basic theories and exploration applications of unconventional oil and gas geology.The quantitative data were obtained using the following:field macroscopic observations;measurements;intensive sampling processes;XRD mineral content analysis;scanning electron microscopy;high-power polarizing microscope observations;and micro-scale measurements.The quantitative parameters,such as laminae thicknesses,laminae properties,organic matter laminae,and laminae spatial distributions were unified within a framework,and the correlations among them were established for the purpose of forming a fine-grained deposition micro-laminae evaluation system.The results obtained in this research investigation established a basis for the classification of micro-laminae,and divided the micro-laminae into four categories and 20 sub-categories according to the development thicknesses,material compositions,organic matter content levels,and the spatial distributions of the micro-laminae.The classification scheme of the micro-laminae was divided into two categories and 12 sub-categories.Then,in accordance with the comprehensive characteristics of spatial morphology,the micro-laminae was further divided into the following categories:continuous horizontal laminae;near horizontal laminae;slow wavy laminae;wavy laminae;discontinuous laminae;and lenticular laminae.According to the structural properties of the laminae development,the micro-laminae was divided into the following categories:single laminae structures;laminated laminae structures;interlaminar structures;multiple mixed laminae structures;cyclic laminae structures;and progressive laminae structures.The research results were considered to be applicable for the scientific evaluations of reservoir spaces related to unconventional oil and gas resources.
基金supported in part by National Natural Science Foundation of China(Nos.62132002,61825101 and 62202010)the Key-Area Research and Development Program of Guangdong Province,China(No.2021B0101400002)the China Postdoctoral Science Foundation(No.2022M710212).
文摘Fine-grained visual parsing, including fine-grained part segmentation and fine-grained object recognition, has attracted considerable critical attention due to its importance in many real-world applications, e.g., agriculture, remote sensing, and space technologies. Predominant research efforts tackle these fine-grained sub-tasks following different paradigms, while the inherent relations between these tasks are neglected. Moreover, given most of the research remains fragmented, we conduct an in-depth study of the advanced work from a new perspective of learning the part relationship. In this perspective, we first consolidate recent research and benchmark syntheses with new taxonomies. Based on this consolidation, we revisit the universal challenges in fine-grained part segmentation and recognition tasks and propose new solutions by part relationship learning for these important challenges. Furthermore, we conclude several promising lines of research in fine-grained visual parsing for future research.
基金supported by National Key R&D Program of China(2019YFC1521102)the National Natural Science Foundation of China(61932003)Beijing Science and Technology Plan(Z221100007722004).
文摘Hierarchical multi-granularity image classification is a challenging task that aims to tag each given image with multiple granularity labels simultaneously.Existing methods tend to overlook that different image regions contribute differently to label prediction at different granularities,and also insufficiently consider relationships between the hierarchical multi-granularity labels.We introduce a sequence-to-sequence mechanism to overcome these two problems and propose a multi-granularity sequence generation(MGSG)approach for the hierarchical multi-granularity image classification task.Specifically,we introduce a transformer architecture to encode the image into visual representation sequences.Next,we traverse the taxonomic tree and organize the multi-granularity labels into sequences,and vectorize them and add positional information.The proposed multi-granularity sequence generation method builds a decoder that takes visual representation sequences and semantic label embedding as inputs,and outputs the predicted multi-granularity label sequence.The decoder models dependencies and correlations between multi-granularity labels through a masked multi-head self-attention mechanism,and relates visual information to the semantic label information through a crossmodality attention mechanism.In this way,the proposed method preserves the relationships between labels at different granularity levels and takes into account the influence of different image regions on labels with different granularities.Evaluations on six public benchmarks qualitatively and quantitatively demonstrate the advantages of the proposed method.Our project is available at https://github.com/liuxindazz/mgs.