High-resolution video transmission requires a substantial amount of bandwidth.In this paper,we present a novel video processing methodology that innovatively integrates region of interest(ROI)identification and super-...High-resolution video transmission requires a substantial amount of bandwidth.In this paper,we present a novel video processing methodology that innovatively integrates region of interest(ROI)identification and super-resolution enhancement.Our method commences with the accurate detection of ROIs within video sequences,followed by the application of advanced super-resolution techniques to these areas,thereby preserving visual quality while economizing on data transmission.To validate and benchmark our approach,we have curated a new gaming dataset tailored to evaluate the effectiveness of ROI-based super-resolution in practical applications.The proposed model architecture leverages the transformer network framework,guided by a carefully designed multi-task loss function,which facilitates concurrent learning and execution of both ROI identification and resolution enhancement tasks.This unified deep learning model exhibits remarkable performance in achieving super-resolution on our custom dataset.The implications of this research extend to optimizing low-bitrate video streaming scenarios.By selectively enhancing the resolution of critical regions in videos,our solution enables high-quality video delivery under constrained bandwidth conditions.Empirical results demonstrate a 15%reduction in transmission bandwidth compared to traditional super-resolution based compression methods,without any perceivable decline in visual quality.This work thus contributes to the advancement of video compression and enhancement technologies,offering an effective strategy for improving digital media delivery efficiency and user experience,especially in bandwidth-limited environments.The innovative integration of ROI identification and super-resolution presents promising avenues for future research and development in adaptive and intelligent video communication systems.展开更多
The emergence of new media in various fields has continuously strengthened the social aspect of social media.Netizens tend to express emotions in social interactions,and many people even use satire,metaphors,and other...The emergence of new media in various fields has continuously strengthened the social aspect of social media.Netizens tend to express emotions in social interactions,and many people even use satire,metaphors,and other techniques to express some negative emotions,it is necessary to detect sarcasm in social comment data.For sarcasm,the more reference data modalities used,the better the experimental effect.This paper conducts research on sarcasm detection technology based on image-text fusion data.To effectively utilize the features of each modality,a feature reconstruction output algorithm is proposed.This algorithm is based on the attention mechanism,learns the low-rank features of another modality through cross-modality,the eigenvectors are reconstructed for the corresponding modality through weighted averaging.When only the image modality in the dataset is used,the preprocessed data has outstanding performance in reconstructing the output model,with an accuracy rate of 87.6%.When using only the text modality data in the dataset,the reconstructed output model is optimal,with an accuracy rate of 85.2%.To improve feature fusion between modalities for effective classification,a weight adaptive learning algorithm is used.This algorithm uses a neural network combined with an attention mechanism to calculate the attention weight of each modality to achieve weight adaptive learning purposes,with an accuracy rate of 87.9%.Extensive experiments on a benchmark dataset demonstrate the superiority of our proposed model.展开更多
Many methods based on deep learning have achieved amazing results in image sentiment analysis.However,these existing methods usually pursue high accuracy,ignoring the effect on model training efficiency.Considering th...Many methods based on deep learning have achieved amazing results in image sentiment analysis.However,these existing methods usually pursue high accuracy,ignoring the effect on model training efficiency.Considering that when faced with large-scale sentiment analysis tasks,the high accuracy rate often requires long experimental time.In view of the weakness,a method that can greatly improve experimental efficiency with only small fluctuations in model accuracy is proposed,and singular value decomposition(SVD)is used to find the sparse feature of the image,which are sparse vectors with strong discriminativeness and effectively reduce redundant information;The authors propose the Fast Dictionary Learning algorithm(FDL),which can combine neural network with sparse representation.This method is based on K-Singular Value Decomposition,and through iteration,it can effectively reduce the calculation time and greatly improve the training efficiency in the case of small fluctuation of accuracy.Moreover,the effectiveness of the proposed method is evaluated on the FER2013 dataset.By adding singular value decomposition,the accuracy of the test suite increased by 0.53%,and the total experiment time was shortened by 8.2%;Fast Dictionary Learning shortened the total experiment time by 36.3%.展开更多
Understanding an image goes beyond recognizing and locating the objects in it,the relationships between objects also very important in image understanding.Most previous methods have focused on recognizing local predic...Understanding an image goes beyond recognizing and locating the objects in it,the relationships between objects also very important in image understanding.Most previous methods have focused on recognizing local predictions of the relationships.But real-world image relationships often determined by the surrounding objects and other contextual information.In this work,we employ this insight to propose a novel framework to deal with the problem of visual relationship detection.The core of the framework is a relationship inference network,which is a recurrent structure designed for combining the global contextual information of the object to infer the relationship of the image.Experimental results on Stanford VRD and Visual Genome demonstrate that the proposed method achieves a good performance both in efficiency and accuracy.Finally,we demonstrate the value of visual relationship on two computer vision tasks:image retrieval and scene graph generation.展开更多
Image super-resolution(SR)is an important technique for improving the resolution and quality of images.With the great progress of deep learning,image super-resolution achieves remarkable improvements recently.In this ...Image super-resolution(SR)is an important technique for improving the resolution and quality of images.With the great progress of deep learning,image super-resolution achieves remarkable improvements recently.In this work,a brief survey on recent advances of deep learning based single image super-resolution methods is systematically described.The existing studies of SR techniques are roughly grouped into ten major categories.Besides,some other important issues are also introduced,such as publicly available benchmark datasets and performance evaluation metrics.Finally,this survey is concluded by highlighting four future trends.展开更多
Scene text recognition(STR)is the task of recognizing character sequences in natural scenes.Although STR method has been greatly developed,the existing methods still can't recognize any shape of text,such as very ...Scene text recognition(STR)is the task of recognizing character sequences in natural scenes.Although STR method has been greatly developed,the existing methods still can't recognize any shape of text,such as very rich curve text or rotating text in daily life,irregular scene text has complex layout in two-dimensional space,which is used to recognize scene text in the past Recently,some recognizers correct irregular text to regular text image with approximate 1D layout,or convert 2D image feature mapping to one-dimensional feature sequence.Although these methods have achieved good performance,their robustness and accuracy are limited due to the loss of spatial information in the process of two-dimensional to one-dimensional transformation.In this paper,we proposes a framework to directly convert the irregular text of two-dimensional layout into character sequence by using the relationship attention module to capture the correlation of feature mapping Through a large number of experiments on multiple common benchmarks,our method can effectively identify regular and irregular scene text,and is superior to the previous methods in accuracy.展开更多
Referring expressions comprehension is the task of locating the image region described by a natural language expression,which refer to the properties of the region or the relationships with other regions.Most previous...Referring expressions comprehension is the task of locating the image region described by a natural language expression,which refer to the properties of the region or the relationships with other regions.Most previous work handles this problem by selecting the most relevant regions from a set of candidate regions,when there are many candidate regions in the set these methods are inefficient.Inspired by recent success of image captioning by using deep learning methods,in this paper we proposed a framework to understand the referring expressions by multiple steps of reasoning.We present a model for referring expressions comprehension by selecting the most relevant region directly from the image.The core of our model is a recurrent attention network which can be seen as an extension of Memory Network.The proposed model capable of improving the results by multiple computational hops.We evaluate the proposed model on two referring expression datasets:Visual Genome and Flickr30k Entities.The experimental results demonstrate that the proposed model outperform previous state-of-the-art methods both in accuracy and efficiency.We also conduct an ablation experiment to show that the performance of the model is not getting better with the increase of the attention layers.展开更多
Video synopsis is an effective and innovative way to produce short video abstraction for huge video archives,while keeping the dynamic characteristic of activities in the original video.Abnormal activity,as the critic...Video synopsis is an effective and innovative way to produce short video abstraction for huge video archives,while keeping the dynamic characteristic of activities in the original video.Abnormal activity,as the critical event,is always the main concern in video surveillance context.However,in traditional video synopsis,all the normal and abnormal activities are condensed together equally,which can make the synopsis video confused and worthless.In addition,the traditional video synopsis methods always neglect redundancy in the content domain.To solve the above-mentioned issues,a novel video synopsis method is proposed based on abnormal activity detection and key observation selection.In the proposed algorithm,activities are classified into normal and abnormal ones based on the sparse reconstruction cost from an atomically learned activity dictionary.And key observation selection using the minimum description length principle is conducted for eliminating content redundancy in normal activity.Experiments conducted in publicly available datasets demonstrate that the proposed approach can effectively generate satisfying synopsis videos.展开更多
Sparse code multiple access-based uplink grant-free transmission(SCMA-UGFT)has been proposed to realize ultra reliable and low latency communication(URLLC)in the fifth generation(5 G)system.Without the process of reso...Sparse code multiple access-based uplink grant-free transmission(SCMA-UGFT)has been proposed to realize ultra reliable and low latency communication(URLLC)in the fifth generation(5 G)system.Without the process of resource request and grant,users may collide in the same resource.To compensate the potential user performance decline,resource scheduling becomes a tough issue in the SCMA-UGFT system.This article proposes a duplicated transmission-based resource scheduling(DTBRS)scheme for SCMA-UGFT system by considering the URLLC scenario.Different from the existing schemes,not only one shared basic transmission units(BTUs)are allocated to a user equipment(UE)in the proposed DTBRS scheme for initial transmission to realize the duplicated transmission and to guarantee the transmission reliability.Besides,according to the proposed DTBRS scheme,one or two exclusive BTUs are assigned to a UE for retransmission to avoid the re-collision.At last,each packet is given a lifetime to limit the transmission latency to meet the URLLC latency requirement.The simulation demonstrates that the DTBRS scheme can achieve a better performance than the existing state-of-the-art scheme in terms of the average packet drop rate.展开更多
基金funded by National Key Research and Development Program of China(No.2022YFC3302103).
文摘High-resolution video transmission requires a substantial amount of bandwidth.In this paper,we present a novel video processing methodology that innovatively integrates region of interest(ROI)identification and super-resolution enhancement.Our method commences with the accurate detection of ROIs within video sequences,followed by the application of advanced super-resolution techniques to these areas,thereby preserving visual quality while economizing on data transmission.To validate and benchmark our approach,we have curated a new gaming dataset tailored to evaluate the effectiveness of ROI-based super-resolution in practical applications.The proposed model architecture leverages the transformer network framework,guided by a carefully designed multi-task loss function,which facilitates concurrent learning and execution of both ROI identification and resolution enhancement tasks.This unified deep learning model exhibits remarkable performance in achieving super-resolution on our custom dataset.The implications of this research extend to optimizing low-bitrate video streaming scenarios.By selectively enhancing the resolution of critical regions in videos,our solution enables high-quality video delivery under constrained bandwidth conditions.Empirical results demonstrate a 15%reduction in transmission bandwidth compared to traditional super-resolution based compression methods,without any perceivable decline in visual quality.This work thus contributes to the advancement of video compression and enhancement technologies,offering an effective strategy for improving digital media delivery efficiency and user experience,especially in bandwidth-limited environments.The innovative integration of ROI identification and super-resolution presents promising avenues for future research and development in adaptive and intelligent video communication systems.
基金funded by National Key Research and Development Program of China(No.2022YFC3302103).
文摘The emergence of new media in various fields has continuously strengthened the social aspect of social media.Netizens tend to express emotions in social interactions,and many people even use satire,metaphors,and other techniques to express some negative emotions,it is necessary to detect sarcasm in social comment data.For sarcasm,the more reference data modalities used,the better the experimental effect.This paper conducts research on sarcasm detection technology based on image-text fusion data.To effectively utilize the features of each modality,a feature reconstruction output algorithm is proposed.This algorithm is based on the attention mechanism,learns the low-rank features of another modality through cross-modality,the eigenvectors are reconstructed for the corresponding modality through weighted averaging.When only the image modality in the dataset is used,the preprocessed data has outstanding performance in reconstructing the output model,with an accuracy rate of 87.6%.When using only the text modality data in the dataset,the reconstructed output model is optimal,with an accuracy rate of 85.2%.To improve feature fusion between modalities for effective classification,a weight adaptive learning algorithm is used.This algorithm uses a neural network combined with an attention mechanism to calculate the attention weight of each modality to achieve weight adaptive learning purposes,with an accuracy rate of 87.9%.Extensive experiments on a benchmark dataset demonstrate the superiority of our proposed model.
基金supported by the National Natural Science Foundation of China(No.61801440)the High‐quality and Cutting‐edge Disciplines Construction Project for Universities in Beijing(Internet Information,Communication University of China),State Key Laboratory of Media Convergence and Communication(Communication University of China)the Fundamental Research Funds for the Central Universities(CUC2019B069).
文摘Many methods based on deep learning have achieved amazing results in image sentiment analysis.However,these existing methods usually pursue high accuracy,ignoring the effect on model training efficiency.Considering that when faced with large-scale sentiment analysis tasks,the high accuracy rate often requires long experimental time.In view of the weakness,a method that can greatly improve experimental efficiency with only small fluctuations in model accuracy is proposed,and singular value decomposition(SVD)is used to find the sparse feature of the image,which are sparse vectors with strong discriminativeness and effectively reduce redundant information;The authors propose the Fast Dictionary Learning algorithm(FDL),which can combine neural network with sparse representation.This method is based on K-Singular Value Decomposition,and through iteration,it can effectively reduce the calculation time and greatly improve the training efficiency in the case of small fluctuation of accuracy.Moreover,the effectiveness of the proposed method is evaluated on the FER2013 dataset.By adding singular value decomposition,the accuracy of the test suite increased by 0.53%,and the total experiment time was shortened by 8.2%;Fast Dictionary Learning shortened the total experiment time by 36.3%.
文摘Understanding an image goes beyond recognizing and locating the objects in it,the relationships between objects also very important in image understanding.Most previous methods have focused on recognizing local predictions of the relationships.But real-world image relationships often determined by the surrounding objects and other contextual information.In this work,we employ this insight to propose a novel framework to deal with the problem of visual relationship detection.The core of the framework is a relationship inference network,which is a recurrent structure designed for combining the global contextual information of the object to infer the relationship of the image.Experimental results on Stanford VRD and Visual Genome demonstrate that the proposed method achieves a good performance both in efficiency and accuracy.Finally,we demonstrate the value of visual relationship on two computer vision tasks:image retrieval and scene graph generation.
基金the National Key Research and Development Program of China(No.2019YFB1405900)。
文摘Image super-resolution(SR)is an important technique for improving the resolution and quality of images.With the great progress of deep learning,image super-resolution achieves remarkable improvements recently.In this work,a brief survey on recent advances of deep learning based single image super-resolution methods is systematically described.The existing studies of SR techniques are roughly grouped into ten major categories.Besides,some other important issues are also introduced,such as publicly available benchmark datasets and performance evaluation metrics.Finally,this survey is concluded by highlighting four future trends.
文摘Scene text recognition(STR)is the task of recognizing character sequences in natural scenes.Although STR method has been greatly developed,the existing methods still can't recognize any shape of text,such as very rich curve text or rotating text in daily life,irregular scene text has complex layout in two-dimensional space,which is used to recognize scene text in the past Recently,some recognizers correct irregular text to regular text image with approximate 1D layout,or convert 2D image feature mapping to one-dimensional feature sequence.Although these methods have achieved good performance,their robustness and accuracy are limited due to the loss of spatial information in the process of two-dimensional to one-dimensional transformation.In this paper,we proposes a framework to directly convert the irregular text of two-dimensional layout into character sequence by using the relationship attention module to capture the correlation of feature mapping Through a large number of experiments on multiple common benchmarks,our method can effectively identify regular and irregular scene text,and is superior to the previous methods in accuracy.
基金This work was supported in part by audio-visual new media laboratory operation and maintenance of Academy of Broadcasting Science,Grant No.200304in part by the National Key Research and Development Program of China(Grant No.2019YFB1406201).
文摘Referring expressions comprehension is the task of locating the image region described by a natural language expression,which refer to the properties of the region or the relationships with other regions.Most previous work handles this problem by selecting the most relevant regions from a set of candidate regions,when there are many candidate regions in the set these methods are inefficient.Inspired by recent success of image captioning by using deep learning methods,in this paper we proposed a framework to understand the referring expressions by multiple steps of reasoning.We present a model for referring expressions comprehension by selecting the most relevant region directly from the image.The core of our model is a recurrent attention network which can be seen as an extension of Memory Network.The proposed model capable of improving the results by multiple computational hops.We evaluate the proposed model on two referring expression datasets:Visual Genome and Flickr30k Entities.The experimental results demonstrate that the proposed model outperform previous state-of-the-art methods both in accuracy and efficiency.We also conduct an ablation experiment to show that the performance of the model is not getting better with the increase of the attention layers.
基金Supported by the National Natural Science Foundation of China(No.61402023)Beijing Technology and Business' University Youth Fund(No.QNJJ2014-23)Beijing Natural Science Foundation(No.4162019)
文摘Video synopsis is an effective and innovative way to produce short video abstraction for huge video archives,while keeping the dynamic characteristic of activities in the original video.Abnormal activity,as the critical event,is always the main concern in video surveillance context.However,in traditional video synopsis,all the normal and abnormal activities are condensed together equally,which can make the synopsis video confused and worthless.In addition,the traditional video synopsis methods always neglect redundancy in the content domain.To solve the above-mentioned issues,a novel video synopsis method is proposed based on abnormal activity detection and key observation selection.In the proposed algorithm,activities are classified into normal and abnormal ones based on the sparse reconstruction cost from an atomically learned activity dictionary.And key observation selection using the minimum description length principle is conducted for eliminating content redundancy in normal activity.Experiments conducted in publicly available datasets demonstrate that the proposed approach can effectively generate satisfying synopsis videos.
基金supported by National Natural Science Foundation of China(61801046)。
文摘Sparse code multiple access-based uplink grant-free transmission(SCMA-UGFT)has been proposed to realize ultra reliable and low latency communication(URLLC)in the fifth generation(5 G)system.Without the process of resource request and grant,users may collide in the same resource.To compensate the potential user performance decline,resource scheduling becomes a tough issue in the SCMA-UGFT system.This article proposes a duplicated transmission-based resource scheduling(DTBRS)scheme for SCMA-UGFT system by considering the URLLC scenario.Different from the existing schemes,not only one shared basic transmission units(BTUs)are allocated to a user equipment(UE)in the proposed DTBRS scheme for initial transmission to realize the duplicated transmission and to guarantee the transmission reliability.Besides,according to the proposed DTBRS scheme,one or two exclusive BTUs are assigned to a UE for retransmission to avoid the re-collision.At last,each packet is given a lifetime to limit the transmission latency to meet the URLLC latency requirement.The simulation demonstrates that the DTBRS scheme can achieve a better performance than the existing state-of-the-art scheme in terms of the average packet drop rate.