期刊文献+
共找到489篇文章
< 1 2 25 >
每页显示 20 50 100
A Video Captioning Method by Semantic Topic-Guided Generation
1
作者 Ou Ye Xinli Wei +2 位作者 Zhenhua Yu Yan Fu Ying Yang 《Computers, Materials & Continua》 SCIE EI 2024年第1期1071-1093,共23页
In the video captioning methods based on an encoder-decoder,limited visual features are extracted by an encoder,and a natural sentence of the video content is generated using a decoder.However,this kind ofmethod is de... In the video captioning methods based on an encoder-decoder,limited visual features are extracted by an encoder,and a natural sentence of the video content is generated using a decoder.However,this kind ofmethod is dependent on a single video input source and few visual labels,and there is a problem with semantic alignment between video contents and generated natural sentences,which are not suitable for accurately comprehending and describing the video contents.To address this issue,this paper proposes a video captioning method by semantic topic-guided generation.First,a 3D convolutional neural network is utilized to extract the spatiotemporal features of videos during the encoding.Then,the semantic topics of video data are extracted using the visual labels retrieved from similar video data.In the decoding,a decoder is constructed by combining a novel Enhance-TopK sampling algorithm with a Generative Pre-trained Transformer-2 deep neural network,which decreases the influence of“deviation”in the semantic mapping process between videos and texts by jointly decoding a baseline and semantic topics of video contents.During this process,the designed Enhance-TopK sampling algorithm can alleviate a long-tail problem by dynamically adjusting the probability distribution of the predicted words.Finally,the experiments are conducted on two publicly used Microsoft Research Video Description andMicrosoft Research-Video to Text datasets.The experimental results demonstrate that the proposed method outperforms several state-of-art approaches.Specifically,the performance indicators Bilingual Evaluation Understudy,Metric for Evaluation of Translation with Explicit Ordering,Recall Oriented Understudy for Gisting Evaluation-longest common subsequence,and Consensus-based Image Description Evaluation of the proposed method are improved by 1.2%,0.1%,0.3%,and 2.4% on the Microsoft Research Video Description dataset,and 0.1%,1.0%,0.1%,and 2.8% on the Microsoft Research-Video to Text dataset,respectively,compared with the existing video captioning methods.As a result,the proposed method can generate video captioning that is more closely aligned with human natural language expression habits. 展开更多
关键词 Video captioning encoder-decoder semantic topic jointly decoding Enhance-TopK sampling
下载PDF
Trends in Event Understanding and Caption Generation/Reconstruction in Dense Video:A Review
2
作者 Ekanayake Mudiyanselage Chulabhaya Lankanatha Ekanayake Abubakar Sulaiman Gezawa Yunqi Lei 《Computers, Materials & Continua》 SCIE EI 2024年第3期2941-2965,共25页
Video description generates natural language sentences that describe the subject,verb,and objects of the targeted Video.The video description has been used to help visually impaired people to understand the content.It... Video description generates natural language sentences that describe the subject,verb,and objects of the targeted Video.The video description has been used to help visually impaired people to understand the content.It is also playing an essential role in devolving human-robot interaction.The dense video description is more difficult when compared with simple Video captioning because of the object’s interactions and event overlapping.Deep learning is changing the shape of computer vision(CV)technologies and natural language processing(NLP).There are hundreds of deep learning models,datasets,and evaluations that can improve the gaps in current research.This article filled this gap by evaluating some state-of-the-art approaches,especially focusing on deep learning and machine learning for video caption in a dense environment.In this article,some classic techniques concerning the existing machine learning were reviewed.And provides deep learning models,a detail of benchmark datasets with their respective domains.This paper reviews various evaluation metrics,including Bilingual EvaluationUnderstudy(BLEU),Metric for Evaluation of Translation with Explicit Ordering(METEOR),WordMover’s Distance(WMD),and Recall-Oriented Understudy for Gisting Evaluation(ROUGE)with their pros and cons.Finally,this article listed some future directions and proposed work for context enhancement using key scene extraction with object detection in a particular frame.Especially,how to improve the context of video description by analyzing key frames detection through morphological image analysis.Additionally,the paper discusses a novel approach involving sentence reconstruction and context improvement through key frame object detection,which incorporates the fusion of large languagemodels for refining results.The ultimate results arise fromenhancing the generated text of the proposedmodel by improving the predicted text and isolating objects using various keyframes.These keyframes identify dense events occurring in the video sequence. 展开更多
关键词 Video description video to text video caption sentence reconstruction
下载PDF
VLCA: vision-language aligning model with cross-modal attention for bilingual remote sensing image captioning 被引量:1
3
作者 WEI Tingting YUAN Weilin +2 位作者 LUO Junren ZHANG Wanpeng LU Lina 《Journal of Systems Engineering and Electronics》 SCIE EI CSCD 2023年第1期9-18,共10页
In the field of satellite imagery, remote sensing image captioning(RSIC) is a hot topic with the challenge of overfitting and difficulty of image and text alignment. To address these issues, this paper proposes a visi... In the field of satellite imagery, remote sensing image captioning(RSIC) is a hot topic with the challenge of overfitting and difficulty of image and text alignment. To address these issues, this paper proposes a vision-language aligning paradigm for RSIC to jointly represent vision and language. First, a new RSIC dataset DIOR-Captions is built for augmenting object detection in optical remote(DIOR) sensing images dataset with manually annotated Chinese and English contents. Second, a Vision-Language aligning model with Cross-modal Attention(VLCA) is presented to generate accurate and abundant bilingual descriptions for remote sensing images. Third, a crossmodal learning network is introduced to address the problem of visual-lingual alignment. Notably, VLCA is also applied to end-toend Chinese captions generation by using the pre-training language model of Chinese. The experiments are carried out with various baselines to validate VLCA on the proposed dataset. The results demonstrate that the proposed algorithm is more descriptive and informative than existing algorithms in producing captions. 展开更多
关键词 remote sensing image captioning(RSIC) vision-language representation remote sensing image caption dataset attention mechanism
下载PDF
Natural Language Processing with Optimal Deep Learning-Enabled Intelligent Image Captioning System
4
作者 Radwa Marzouk Eatedal Alabdulkreem +5 位作者 Mohamed KNour Mesfer Al Duhayyim Mahmoud Othman Abu Sarwar Zamani Ishfaq Yaseen Abdelwahed Motwakel 《Computers, Materials & Continua》 SCIE EI 2023年第2期4435-4451,共17页
The recent developments in Multimedia Internet of Things(MIoT)devices,empowered with Natural Language Processing(NLP)model,seem to be a promising future of smart devices.It plays an important role in industrial models... The recent developments in Multimedia Internet of Things(MIoT)devices,empowered with Natural Language Processing(NLP)model,seem to be a promising future of smart devices.It plays an important role in industrial models such as speech understanding,emotion detection,home automation,and so on.If an image needs to be captioned,then the objects in that image,its actions and connections,and any silent feature that remains under-projected or missing from the images should be identified.The aim of the image captioning process is to generate a caption for image.In next step,the image should be provided with one of the most significant and detailed descriptions that is syntactically as well as semantically correct.In this scenario,computer vision model is used to identify the objects and NLP approaches are followed to describe the image.The current study develops aNatural Language Processing with Optimal Deep Learning Enabled Intelligent Image Captioning System(NLPODL-IICS).The aim of the presented NLPODL-IICS model is to produce a proper description for input image.To attain this,the proposed NLPODL-IICS follows two stages such as encoding and decoding processes.Initially,at the encoding side,the proposed NLPODL-IICS model makes use of Hunger Games Search(HGS)with Neural Search Architecture Network(NASNet)model.This model represents the input data appropriately by inserting it into a predefined length vector.Besides,during decoding phase,Chimp Optimization Algorithm(COA)with deeper Long Short Term Memory(LSTM)approach is followed to concatenate the description sentences 4436 CMC,2023,vol.74,no.2 produced by the method.The application of HGS and COA algorithms helps in accomplishing proper parameter tuning for NASNet and LSTM models respectively.The proposed NLPODL-IICS model was experimentally validated with the help of two benchmark datasets.Awidespread comparative analysis confirmed the superior performance of NLPODL-IICS model over other models. 展开更多
关键词 Natural language processing information retrieval image captioning deep learning metaheuristics
下载PDF
Fine-Grained Features for Image Captioning
5
作者 Mengyue Shao Jie Feng +2 位作者 Jie Wu Haixiang Zhang Yayu Zheng 《Computers, Materials & Continua》 SCIE EI 2023年第6期4697-4712,共16页
Image captioning involves two different major modalities(image and sentence)that convert a given image into a language that adheres to visual semantics.Almost all methods first extract image features to reduce the dif... Image captioning involves two different major modalities(image and sentence)that convert a given image into a language that adheres to visual semantics.Almost all methods first extract image features to reduce the difficulty of visual semantic embedding and then use the caption model to generate fluent sentences.The Convolutional Neural Network(CNN)is often used to extract image features in image captioning,and the use of object detection networks to extract region features has achieved great success.However,the region features retrieved by this method are object-level and do not pay attention to fine-grained details because of the detection model’s limitation.We offer an approach to address this issue that more properly generates captions by fusing fine-grained features and region features.First,we extract fine-grained features using a panoramic segmentation algorithm.Second,we suggest two fusion methods and contrast their fusion outcomes.An X-linear Attention Network(X-LAN)serves as the foundation for both fusion methods.According to experimental findings on the COCO dataset,the two-branch fusion approach is superior.It is important to note that on the COCO Karpathy test split,CIDEr is increased up to 134.3%in comparison to the baseline,highlighting the potency and viability of our method. 展开更多
关键词 Image captioning region features fine-grained features FUSION
下载PDF
Traffic Scene Captioning with Multi-Stage Feature Enhancement
6
作者 Dehai Zhang Yu Ma +3 位作者 Qing Liu Haoxing Wang Anquan Ren Jiashu Liang 《Computers, Materials & Continua》 SCIE EI 2023年第9期2901-2920,共20页
Traffic scene captioning technology automatically generates one or more sentences to describe the content of traffic scenes by analyzing the content of the input traffic scene images,ensuring road safety while providi... Traffic scene captioning technology automatically generates one or more sentences to describe the content of traffic scenes by analyzing the content of the input traffic scene images,ensuring road safety while providing an important decision-making function for sustainable transportation.In order to provide a comprehensive and reasonable description of complex traffic scenes,a traffic scene semantic captioningmodel withmulti-stage feature enhancement is proposed in this paper.In general,the model follows an encoder-decoder structure.First,multilevel granularity visual features are used for feature enhancement during the encoding process,which enables the model to learn more detailed content in the traffic scene image.Second,the scene knowledge graph is applied to the decoding process,and the semantic features provided by the scene knowledge graph are used to enhance the features learned by the decoder again,so that themodel can learn the attributes of objects in the traffic scene and the relationships between objects to generate more reasonable captions.This paper reports extensive experiments on the challenging MS-COCO dataset,evaluated by five standard automatic evaluation metrics,and the results show that the proposed model has improved significantly in all metrics compared with the state-of-the-art methods,especially achieving a score of 129.0 on the CIDEr-D evaluation metric,which also indicates that the proposed model can effectively provide a more reasonable and comprehensive description of the traffic scene. 展开更多
关键词 Traffic scene captioning sustainable transportation feature enhancement encoder-decoder structure multi-level granularity scene knowledge graph
下载PDF
A Sentence Retrieval Generation Network Guided Video Captioning
7
作者 Ou Ye Mimi Wang +3 位作者 Zhenhua Yu Yan Fu Shun Yi Jun Deng 《Computers, Materials & Continua》 SCIE EI 2023年第6期5675-5696,共22页
Currently,the video captioning models based on an encoder-decoder mainly rely on a single video input source.The contents of video captioning are limited since few studies employed external corpus information to guide... Currently,the video captioning models based on an encoder-decoder mainly rely on a single video input source.The contents of video captioning are limited since few studies employed external corpus information to guide the generation of video captioning,which is not conducive to the accurate descrip-tion and understanding of video content.To address this issue,a novel video captioning method guided by a sentence retrieval generation network(ED-SRG)is proposed in this paper.First,a ResNeXt network model,an efficient convolutional network for online video understanding(ECO)model,and a long short-term memory(LSTM)network model are integrated to construct an encoder-decoder,which is utilized to extract the 2D features,3D features,and object features of video data respectively.These features are decoded to generate textual sentences that conform to video content for sentence retrieval.Then,a sentence-transformer network model is employed to retrieve different sentences in an external corpus that are semantically similar to the above textual sentences.The candidate sentences are screened out through similarity measurement.Finally,a novel GPT-2 network model is constructed based on GPT-2 network structure.The model introduces a designed random selector to randomly select predicted words with a high probability in the corpus,which is used to guide and generate textual sentences that are more in line with human natural language expressions.The proposed method in this paper is compared with several existing works by experiments.The results show that the indicators BLEU-4,CIDEr,ROUGE_L,and METEOR are improved by 3.1%,1.3%,0.3%,and 1.5%on a public dataset MSVD and 1.3%,0.5%,0.2%,1.9%on a public dataset MSR-VTT respectively.It can be seen that the proposed method in this paper can generate video captioning with richer semantics than several state-of-the-art approaches. 展开更多
关键词 Video captioning encoder-decoder sentence retrieval external corpus RS GPT-2 network model
下载PDF
Enhanced Image Captioning Using Features Concatenation and Efficient Pre-Trained Word Embedding
8
作者 Samar Elbedwehy T.Medhat +1 位作者 Taher Hamza Mohammed F.Alrahmawy 《Computer Systems Science & Engineering》 SCIE EI 2023年第9期3637-3652,共16页
One of the issues in Computer Vision is the automatic development of descriptions for images,sometimes known as image captioning.Deep Learning techniques have made significant progress in this area.The typical archite... One of the issues in Computer Vision is the automatic development of descriptions for images,sometimes known as image captioning.Deep Learning techniques have made significant progress in this area.The typical architecture of image captioning systems consists mainly of an image feature extractor subsystem followed by a caption generation lingual subsystem.This paper aims to find optimized models for these two subsystems.For the image feature extraction subsystem,the research tested eight different concatenations of pairs of vision models to get among them the most expressive extracted feature vector of the image.For the caption generation lingual subsystem,this paper tested three different pre-trained language embedding models:Glove(Global Vectors for Word Representation),BERT(Bidirectional Encoder Representations from Transformers),and TaCL(Token-aware Contrastive Learning),to select from them the most accurate pre-trained language embedding model.Our experiments showed that building an image captioning system that uses a concatenation of the two Transformer based models SWIN(Shiftedwindow)and PVT(PyramidVision Transformer)as an image feature extractor,combined with the TaCL language embedding model is the best result among the other combinations. 展开更多
关键词 Image captioning word embedding CONCATENATION TRANSFORMER
下载PDF
Red Deer Optimization with Artificial Intelligence Enabled Image Captioning System for Visually Impaired People
9
作者 Anwer Mustafa Hilal Fadwa Alrowais +1 位作者 Fahd N.Al-Wesabi Radwa Marzouk 《Computer Systems Science & Engineering》 SCIE EI 2023年第8期1929-1945,共17页
The problem of producing a natural language description of an image for describing the visual content has gained more attention in natural language processing(NLP)and computer vision(CV).It can be driven by applicatio... The problem of producing a natural language description of an image for describing the visual content has gained more attention in natural language processing(NLP)and computer vision(CV).It can be driven by applications like image retrieval or indexing,virtual assistants,image understanding,and support of visually impaired people(VIP).Though the VIP uses other senses,touch and hearing,for recognizing objects and events,the quality of life of those persons is lower than the standard level.Automatic Image captioning generates captions that will be read loudly to the VIP,thereby realizing matters happening around them.This article introduces a Red Deer Optimization with Artificial Intelligence Enabled Image Captioning System(RDOAI-ICS)for Visually Impaired People.The presented RDOAI-ICS technique aids in generating image captions for VIPs.The presented RDOAIICS technique utilizes a neural architectural search network(NASNet)model to produce image representations.Besides,the RDOAI-ICS technique uses the radial basis function neural network(RBFNN)method to generate a textual description.To enhance the performance of the RDOAI-ICS method,the parameter optimization process takes place using the RDO algorithm for NasNet and the butterfly optimization algorithm(BOA)for the RBFNN model,showing the novelty of the work.The experimental evaluation of the RDOAI-ICS method can be tested using a benchmark dataset.The outcomes show the enhancements of the RDOAI-ICS method over other recent Image captioning approaches. 展开更多
关键词 Machine learning image captioning visually impaired people parameter tuning artificial intelligence metaheuristics
下载PDF
PCATNet: Position-Class Awareness Transformer for Image Captioning
10
作者 Ziwei Tang Yaohua Yi +1 位作者 Changhui Yu Aiguo Yin 《Computers, Materials & Continua》 SCIE EI 2023年第6期6007-6022,共16页
Existing image captioning models usually build the relation between visual information and words to generate captions,which lack spatial infor-mation and object classes.To address the issue,we propose a novel Position... Existing image captioning models usually build the relation between visual information and words to generate captions,which lack spatial infor-mation and object classes.To address the issue,we propose a novel Position-Class Awareness Transformer(PCAT)network which can serve as a bridge between the visual features and captions by embedding spatial information and awareness of object classes.In our proposal,we construct our PCAT network by proposing a novel Grid Mapping Position Encoding(GMPE)method and refining the encoder-decoder framework.First,GMPE includes mapping the regions of objects to grids,calculating the relative distance among objects and quantization.Meanwhile,we also improve the Self-attention to adapt the GMPE.Then,we propose a Classes Semantic Quantization strategy to extract semantic information from the object classes,which is employed to facilitate embedding features and refining the encoder-decoder framework.To capture the interaction between multi-modal features,we propose Object Classes Awareness(OCA)to refine the encoder and decoder,namely OCAE and OCAD,respectively.Finally,we apply GMPE,OCAE and OCAD to form various combinations and to complete the entire PCAT.We utilize the MSCOCO dataset to evaluate the performance of our method.The results demonstrate that PCAT outperforms the other competitive methods. 展开更多
关键词 Image captioning relative position encoding object classes awareness
下载PDF
Oppositional Harris Hawks Optimization with Deep Learning-Based Image Captioning
11
作者 V.R.Kavitha K.Nimala +4 位作者 A.Beno K.C.Ramya Seifedine Kadry Byeong-Gwon Kang Yunyoung Nam 《Computer Systems Science & Engineering》 SCIE EI 2023年第1期579-593,共15页
Image Captioning is an emergent topic of research in the domain of artificial intelligence(AI).It utilizes an integration of Computer Vision(CV)and Natural Language Processing(NLP)for generating the image descriptions... Image Captioning is an emergent topic of research in the domain of artificial intelligence(AI).It utilizes an integration of Computer Vision(CV)and Natural Language Processing(NLP)for generating the image descriptions.Itfinds use in several application areas namely recommendation in editing applications,utilization in virtual assistance,etc.The development of NLP and deep learning(DL)modelsfind useful to derive a bridge among the visual details and textual semantics.In this view,this paper introduces an Oppositional Harris Hawks Optimization with Deep Learning based Image Captioning(OHHO-DLIC)technique.The OHHO-DLIC technique involves the design of distinct levels of pre-processing.Moreover,the feature extraction of the images is carried out by the use of EfficientNet model.Furthermore,the image captioning is performed by bidirectional long short term memory(BiLSTM)model,comprising encoder as well as decoder.At last,the oppositional Harris Hawks optimization(OHHO)based hyperparameter tuning process is performed for effectively adjusting the hyperparameter of the EfficientNet and BiLSTM models.The experimental analysis of the OHHO-DLIC technique is carried out on the Flickr 8k Dataset and a comprehensive comparative analysis highlighted the better performance over the recent approaches. 展开更多
关键词 Image captioning natural language processing artificial intelligence machine learning deep learning
下载PDF
MOOCDR-VSI:一种融合视频字幕信息的MOOC资源动态推荐模型
12
作者 吴水秀 罗贤增 +2 位作者 钟茂生 吴如萍 罗玮 《计算机研究与发展》 EI CSCD 北大核心 2024年第2期470-480,共11页
学习者在面对浩如烟海的在线学习课程资源时往往存在“信息过载”和“信息迷航”等问题,基于学习者的学习记录,向学习者推荐与其知识偏好和学习需求相符的MOOC资源变得愈加重要.针对现有MOOC推荐方法没有充分利用MOOC视频中所蕴含的隐... 学习者在面对浩如烟海的在线学习课程资源时往往存在“信息过载”和“信息迷航”等问题,基于学习者的学习记录,向学习者推荐与其知识偏好和学习需求相符的MOOC资源变得愈加重要.针对现有MOOC推荐方法没有充分利用MOOC视频中所蕴含的隐式信息,容易形成“蚕茧效应”以及难以捕获学习者动态变化的学习需求和兴趣等问题,提出了一种融合视频字幕信息的动态MOOC推荐模型MOOCDR-VSI,模型以BERT为编码器,通过融入多头注意力机制深度挖掘MOOC视频字幕文本的语义信息,采用基于LSTM架构的网络动态捕捉学习者随着学习不断变化的知识偏好状态,引入注意力机制挖掘MOOC视频之间的个性信息和共性信息,最后结合学习者的知识偏好状态推荐出召回概率Top N的MOOC视频.实验在真实学习场景下收集的数据集MOOCCube分析了MOOCDR-VSI的性能,结果表明,提出的模型在HR@5,HR@10,NDCG@5,NDCG@10,NDCG@20评价指标上比目前最优方法分别提高了2.35%,2.79%,0.69%,2.2%,3.32%. 展开更多
关键词 MOOC推荐 BERT 多头注意力机制 字幕信息 长短期记忆
下载PDF
基于全局与序列混合变分Transformer的多样化图像描述生成方法
13
作者 刘兵 李穗 +1 位作者 刘明明 刘浩 《电子学报》 EI CAS CSCD 北大核心 2024年第4期1305-1314,共10页
多样化图像描述生成已成为图像描述领域研究热点.然而,现有方法忽视了全局和序列隐向量之间的依赖关系,严重限制了图像描述性能的提升.针对该问题,本文提出了基于混合变分Transformer的多样化图像描述生成框架.具体地,首先构建全局与序... 多样化图像描述生成已成为图像描述领域研究热点.然而,现有方法忽视了全局和序列隐向量之间的依赖关系,严重限制了图像描述性能的提升.针对该问题,本文提出了基于混合变分Transformer的多样化图像描述生成框架.具体地,首先构建全局与序列混合条件变分自编码模型,解决全局与序列隐向量之间依赖关系表示的问题.其次,通过最大化条件似然推导混合模型的变分证据下界,解决多样化图像描述目标函数设计问题.最后,无缝融合Transformer和混合变分自编码模型,通过联合优化提升多样化图像描述的泛化性能.在MSCOCO数据集上实验结果表明,与当前最优基准方法相比,在随机生成20和100个描述语句时,多样性指标m-BLEU(mutual overlap-BiLingual Evaluation Understudy)分别提升了4.2%和4.7%,同时准确性指标CIDEr(Consensus-based Image Description Evaluation)分别提升了4.4%和15.2%. 展开更多
关键词 图像理解 图像描述 变分自编码 隐嵌入 多模态学习 生成模型
下载PDF
融合多时间维度视觉与语义信息的图像描述方法
14
作者 陈善学 王程 《数据采集与处理》 CSCD 北大核心 2024年第4期922-932,共11页
传统的图像描述方法仅使用当前时刻的视觉信息和语义信息来生成预测词,而没有考虑过去时刻的视觉信息和语义信息,从而导致模型输出的信息在时间维度上比较单一,因此生成的描述语句在准确性上有所欠缺。针对此问题,提出一种融合多时间维... 传统的图像描述方法仅使用当前时刻的视觉信息和语义信息来生成预测词,而没有考虑过去时刻的视觉信息和语义信息,从而导致模型输出的信息在时间维度上比较单一,因此生成的描述语句在准确性上有所欠缺。针对此问题,提出一种融合多时间维度视觉与语义信息的图像描述方法,有效地融合了过去时刻的视觉信息和语义信息,并设计一种门控机制动态地对两种信息进行选择利用。在MSCOCO数据集上进行实验验证,结果表明该方法能够更准确地生成描述语句,和当前最主流的图像描述方法进行对比,性能在各项评价指标上都得到了可观的提升。 展开更多
关键词 图像描述 视觉信息 语义信息 时间维度 门控机制
下载PDF
长视频的超级帧切割视觉内容解释方法
15
作者 魏英姿 刘王杰 《北京工业大学学报》 CAS CSCD 北大核心 2024年第7期805-813,共9页
针对现有基于编码解码的视频描述方法存在的对视频较长、在视频场景切换频繁情况下视觉特征提取能力不足或关键性片段捕获能力不足等视频描述不佳的问题,提出一种基于超级帧切割长视频的视频字幕方法。首先,提出超级帧提取算法,计算关... 针对现有基于编码解码的视频描述方法存在的对视频较长、在视频场景切换频繁情况下视觉特征提取能力不足或关键性片段捕获能力不足等视频描述不佳的问题,提出一种基于超级帧切割长视频的视频字幕方法。首先,提出超级帧提取算法,计算关键视频时间占比率以满足视频浏览时长限制,缩短视频检索时间。然后,构建两层筛选模型以自适应提取超级帧,过滤冗余关键帧,执行多场景语义描述。将保留的关键帧嵌入周围帧,利用深层网络模型以及小卷积核池化采样域获取更多的视频特征,克服了经典视频标题方法不能直接用于处理长视频的困难。最后,通过用长短时记忆模型代替循环神经网络解码生成视频标题,给出视频内容的分段解释信息。在YouTube数据集视频、合成视频和监控长视频上进行测试,采用多种机器翻译评价指标评估了该方法的性能,均获得了不同程度的提升。实验结果表明,该方法在应对视频场景切换频繁、视频较长等挑战时,能够获得较好的片段描述。 展开更多
关键词 超级帧切割 时间占比率 多场景语义 视觉特征 长短时记忆模型 视频标题
下载PDF
民族服装图像描述生成的局部属性注意网络
16
作者 张绪辉 刘骊 +2 位作者 付晓东 刘利军 彭玮 《计算机辅助设计与图形学学报》 EI CSCD 北大核心 2024年第3期399-412,共14页
针对民族服装图像属性信息复杂、类间相似度高且语义属性与视觉信息关联性低,导致图像描述生成结果不准确的问题,提出民族服装图像描述生成的局部属性注意网络.首先构建包含55个类别、30000幅图像,约3600 MB的民族服装图像描述生成数据... 针对民族服装图像属性信息复杂、类间相似度高且语义属性与视觉信息关联性低,导致图像描述生成结果不准确的问题,提出民族服装图像描述生成的局部属性注意网络.首先构建包含55个类别、30000幅图像,约3600 MB的民族服装图像描述生成数据集;然后定义民族服装208种局部关键属性词汇和30089条文本信息,通过局部属性学习模块进行视觉特征提取和文本信息嵌入,并采用多实例学习得到局部属性;最后基于双层长短期记忆网络定义包含语义、视觉、门控注意力的注意力感知模块,将局部属性、基于属性的视觉特征和文本编码信息进行融合,优化得到民族服装图像描述生成结果.在构建的民族服装描述生成数据集上的实验结果表明,所提出的网络能够生成包含民族类别、服装风格等关键属性的图像描述,较已有方法在精确性指标BLEU和语义丰富程度指标CIDEr上分别提升1.4%和2.2%. 展开更多
关键词 民族服装图像 图像描述生成 文本信息嵌入 局部属性学习 注意力感知
下载PDF
基于特征融合的多波段图像描述生成方法
17
作者 贺姗 蔺素珍 +1 位作者 王彦博 李大威 《计算机工程》 CAS CSCD 北大核心 2024年第6期236-244,共9页
针对现有图像描述生成方法普遍存在的对夜间场景、目标被遮挡情景和拍摄模糊图像描述效果不佳的问题,提出一种基于特征融合的多波段探测图像描述生成方法。将红外探测成像引入图像描述领域,首先利用多层卷积神经网络(CNN)对可见光图像... 针对现有图像描述生成方法普遍存在的对夜间场景、目标被遮挡情景和拍摄模糊图像描述效果不佳的问题,提出一种基于特征融合的多波段探测图像描述生成方法。将红外探测成像引入图像描述领域,首先利用多层卷积神经网络(CNN)对可见光图像和红外图像分别提取特征;然后根据不同探测波段的互补性,以多头注意力机制为主体设计空间注意力模块,以融合目标波段特征;接着应用通道注意力机制聚合空间域信息,指导生成不同类型的单词;最后在传统加性注意力机制的基础上构建注意力增强模块,计算注意力结果图与查询向量的相关权重系数,消除无关变量的干扰,从而实现图像描述生成。在可见光图像-红外图像描述数据集上进行多组实验,结果表明,该方法能有效融合双波段的语义特征,BLEU4指标、CIDEr指标分别达到58.3%和136.1%,能显著提高图像描述准确度,可以用于安防监控、军事侦察等复杂场景任务。 展开更多
关键词 图像描述 图像融合 多波段图像 自注意力机制 组合注意力
下载PDF
基于全局与序列变分自编码的图像描述生成
18
作者 刘明明 刘浩 +1 位作者 王栋 张海燕 《计算机应用研究》 CSCD 北大核心 2024年第7期2215-2220,共6页
基于Transformer架构的图像描述生成方法通常学习从图像空间到文本空间的确定性映射,以提高预测“平均”描述语句的性能,从而导致模型倾向于生成常见的单词和重复的短语,即所谓的模式坍塌问题。为此,将条件变分自编码与基于Transformer... 基于Transformer架构的图像描述生成方法通常学习从图像空间到文本空间的确定性映射,以提高预测“平均”描述语句的性能,从而导致模型倾向于生成常见的单词和重复的短语,即所谓的模式坍塌问题。为此,将条件变分自编码与基于Transformer的图像描述生成相结合,利用条件似然的变分证据下界分别构建了句子级和单词级的多样化图像描述生成模型,通过引入全局与序列隐嵌入学习增强模型的隐表示能力。在MSCOCO基准数据集上的定量和定性实验结果表明,两种模型均具备图像到文本空间的一对多映射能力。相比于目前最新的方法COS-CVAE(diverse image captioning with context-object split latent spaces),在随机生成20个描述语句时,准确性指标CIDEr和多样性指标Div-2分别提升了1.3和33%,在随机生成100个描述语句的情况下,CIDEr和Div-2分别提升了11.4和14%,所提方法能够更好地拟合真实描述分布,在多样性和准确性之间取得了更好的平衡。 展开更多
关键词 图像描述生成 多样化描述 变分Transformer 隐嵌入
下载PDF
基于场景图感知的跨模态图像描述模型
19
作者 朱志平 杨燕 王杰 《计算机应用》 CSCD 北大核心 2024年第1期58-64,共7页
针对图像描述方法中对图像文本信息的遗忘及利用不充分问题,提出了基于场景图感知的跨模态交互网络(SGC-Net)。首先,使用场景图作为图像的视觉特征并使用图卷积网络(GCN)进行特征融合,从而使图像的视觉特征和文本特征位于同一特征空间;... 针对图像描述方法中对图像文本信息的遗忘及利用不充分问题,提出了基于场景图感知的跨模态交互网络(SGC-Net)。首先,使用场景图作为图像的视觉特征并使用图卷积网络(GCN)进行特征融合,从而使图像的视觉特征和文本特征位于同一特征空间;其次,保存模型生成的文本序列,并添加对应的位置信息作为图像的文本特征,以解决单层长短期记忆(LSTM)网络导致的文本特征丢失的问题;最后,使用自注意力机制提取出重要的图像信息和文本信息后并对它们进行融合,以解决对图像信息过分依赖以及对文本信息利用不足的问题。在Flickr30K和MSCOCO(MicroSoft Common Objects in COntext)数据集上进行实验的结果表明,与Sub-GC相比,SGC-Net在BLEU1(BiLingual Evaluation Understudy with 1-gram)、BLEU4(BiLingual Evaluation Understudy with 4-grams)、METEOR(Metric for Evaluation of Translation with Explicit ORdering)、ROUGE(Recall-Oriented Understudy for Gisting Evaluation)和SPICE(Semantic Propositional Image Caption Evaluation)指标上分别提升了1.1、0.9、0.3、0.7、0.4和0.3、0.1、0.3、0.5、0.6。可见,SGC-Net所使用的方法能够有效提升模型的图像描述性能及生成描述的流畅度。 展开更多
关键词 图像描述 场景图 注意力机制 长短期记忆网络 特征融合
下载PDF
基于事件最大边界的密集视频描述方法
20
作者 陈劭武 胡慧君 刘茂福 《中国科技论文》 CAS 2024年第2期169-177,共9页
针对基于集合预测的密集视频描述方法由于缺乏显式的事件间特征交互且未针对事件间差异训练模型而导致的模型重复预测事件或生成语句雷同问题,提出一种基于事件最大边界的密集视频描述(dense video captioning based on event maximal m... 针对基于集合预测的密集视频描述方法由于缺乏显式的事件间特征交互且未针对事件间差异训练模型而导致的模型重复预测事件或生成语句雷同问题,提出一种基于事件最大边界的密集视频描述(dense video captioning based on event maximal margin,EMM-DVC)方法。事件边界是包含事件间特征相似度、事件在视频中时间位置的距离、生成描述多样性的评分。EMM-DVC通过最大化事件边界,使相似预测结果的距离远且预测结果和实际事件的距离近。另外,EMM-DVC引入事件边界距离损失函数,通过扩大事件边界距离,引导模型关注不同事件。在ActivityNet Captions数据集上的实验证明,EMM-DVC与同类密集视频描述模型相比能生成更具多样性的描述文本,并且与主流密集视频描述模型相比,EMM-DVC在多个指标上达到最优水平。 展开更多
关键词 密集视频描述 多任务学习 端到端模型 集合预测
下载PDF
上一页 1 2 25 下一页 到第
使用帮助 返回顶部