Light field(LF)imaging has attracted attention because of its ability to solve computer vision problems.In this paper we briefly review the research progress in computer vision in recent years.For most factors that af...Light field(LF)imaging has attracted attention because of its ability to solve computer vision problems.In this paper we briefly review the research progress in computer vision in recent years.For most factors that affect computer vision development,the richness and accuracy of visual information acquisition are decisive.LF imaging technology has made great contributions to computer vision because it uses cameras or microlens arrays to record the position and direction information of light rays,acquiring complete three-dimensional(3D)scene information.LF imaging technology improves the accuracy of depth estimation,image segmentation,blending,fusion,and 3D reconstruction.LF has also been innovatively applied to iris and face recognition,identification of materials and fake pedestrians,acquisition of epipolar plane images,shape recovery,and LF microscopy.Here,we further summarize the existing problems and the development trends of LF imaging in computer vision,including the establishment and evaluation of the LF dataset,applications under high dynamic range(HDR)conditions,LF image enhancement,virtual reality,3D display,and 3D movies,military optical camouflage technology,image recognition at micro-scale,image processing method based on HDR,and the optimal relationship between spatial resolution and four-dimensional(4D)LF information acquisition.LF imaging has achieved great success in various studies.Over the past 25 years,more than 180 publications have reported the capability of LF imaging in solving computer vision problems.We summarize these reports to make it easier for researchers to search the detailed methods for specific solutions.展开更多
AIM: To further improve the endoscopic detection of intestinal mucosa alterations due to celiac disease(CD).METHODS: We assessed a hybrid approach based on the integration of expert knowledge into the computerbased cl...AIM: To further improve the endoscopic detection of intestinal mucosa alterations due to celiac disease(CD).METHODS: We assessed a hybrid approach based on the integration of expert knowledge into the computerbased classification pipeline. A total of 2835 endoscopic images from the duodenum were recorded in 290 children using the modified immersion technique(MIT). These children underwent routine upper endoscopy for suspected CD or non-celiac upper abdominal symptoms between August 2008 and December 2014. Blinded to the clinical data and biopsy results, three medical experts visually classified each image as normal mucosa(Marsh-0) or villous atrophy(Marsh-3). The experts' decisions were further integrated into state-of-the-arttexture recognition systems. Using the biopsy results as the reference standard, the classification accuracies of this hybrid approach were compared to the experts' diagnoses in 27 different settings.RESULTS: Compared to the experts' diagnoses, in 24 of 27 classification settings(consisting of three imaging modalities, three endoscopists and three classification approaches), the best overall classification accuracies were obtained with the new hybrid approach. In 17 of 24 classification settings, the improvements achieved with the hybrid approach were statistically significant(P < 0.05). Using the hybrid approach classification accuracies between 94% and 100% were obtained. Whereas the improvements are only moderate in the case of the most experienced expert, the results of the less experienced expert could be improved significantly in 17 out of 18 classification settings. Furthermore, the lowest classification accuracy, based on the combination of one database and one specific expert, could be improved from 80% to 95%(P < 0.001).CONCLUSION: The overall classification performance of medical experts, especially less experienced experts, can be boosted significantly by integrating expert knowledge into computer-aided diagnosis systems.展开更多
This paper tackles the high computational/space complexity associated with multi-head self-attention(MHSA)in vanilla vision transformers.To this end,we propose hierarchical MHSA(H-MHSA),a novel approach that computes ...This paper tackles the high computational/space complexity associated with multi-head self-attention(MHSA)in vanilla vision transformers.To this end,we propose hierarchical MHSA(H-MHSA),a novel approach that computes self-attention in a hierarchical fashion.Specifically,we first divide the input image into patches as commonly done,and each patch is viewed as a token.Then,the proposed H-MHSA learns token relationships within local patches,serving as local relationship modeling.Then,the small patches are merged into larger ones,and H-MHSA models the global dependencies for the small number of the merged tokens.At last,the local and global attentive features are aggregated to obtain features with powerful representation capacity.Since we only calculate attention for a limited number of tokens at each step,the computational load is reduced dramatically.Hence,H-MHSA can efficiently model global relationships among tokens without sacrificing fine-grained information.With the H-MHSA module incorporated,we build a family of hierarchical-attention-based transformer networks,namely HAT-Net.To demonstrate the superiority of HAT-Net in scene understanding,we conduct extensive experiments on fundamental vision tasks,including image classification,semantic segmentation,object detection and instance segmentation.Therefore,HAT-Net provides a new perspective for vision transformers.Code and pretrained models are available at https://github.com/yun-liu/HAT-Net.展开更多
Audio‐visual wake word spotting is a challenging multi‐modal task that exploits visual information of lip motion patterns to supplement acoustic speech to improve overall detection performance.However,most audio‐vi...Audio‐visual wake word spotting is a challenging multi‐modal task that exploits visual information of lip motion patterns to supplement acoustic speech to improve overall detection performance.However,most audio‐visual wake word spotting models are only suitable for simple single‐speaker scenarios and require high computational complexity.Further development is hindered by complex multi‐person scenarios and computational limitations in mobile environments.In this paper,a novel audio‐visual model is proposed for on‐device multi‐person wake word spotting.Firstly,an attention‐based audio‐visual voice activity detection module is presented,which generates an attention score matrix of audio and visual representations to derive active speaker representation.Secondly,the knowledge distillation method is introduced to transfer knowledge from the large model to the on‐device model to control the size of our model.Moreover,a new audio‐visual dataset,PKU‐KWS,is collected for sentence‐level multi‐person wake word spotting.Experimental results on the PKU‐KWS dataset show that this approach outperforms the previous state‐of‐the‐art methods.展开更多
Monocular 6D pose estimation is a functional task in the field of com-puter vision and robotics.In recent years,2D-3D correspondence-based methods have achieved improved performance in multiview and depth data-based s...Monocular 6D pose estimation is a functional task in the field of com-puter vision and robotics.In recent years,2D-3D correspondence-based methods have achieved improved performance in multiview and depth data-based scenes.However,for monocular 6D pose estimation,these methods are affected by the prediction results of the 2D-3D correspondences and the robustness of the per-spective-n-point(PnP)algorithm.There is still a difference in the distance from the expected estimation effect.To obtain a more effective feature representation result,edge enhancement is proposed to increase the shape information of the object by analyzing the influence of inaccurate 2D-3D matching on 6D pose regression and comparing the effectiveness of the intermediate representation.Furthermore,although the transformation matrix is composed of rotation and translation matrices from 3D model points to 2D pixel points,the two variables are essentially different and the same network cannot be used for both variables in the regression process.Therefore,to improve the effectiveness of the PnP algo-rithm,this paper designs a dual-branch PnP network to predict rotation and trans-lation information.Finally,the proposed method is verified on the public LM,LM-O and YCB-Video datasets.The ADD(S)values of the proposed method are 94.2 and 62.84 on the LM and LM-O datasets,respectively.The AUC of ADD(-S)value on YCB-Video is 81.1.These experimental results show that the performance of the proposed method is superior to that of similar methods.展开更多
End-to-end text spotting is a vital computer vision task that aims to integrate scene text detection and recognition into a unified framework.Typical methods heavily rely on region-of-interest(Rol)operations to extrac...End-to-end text spotting is a vital computer vision task that aims to integrate scene text detection and recognition into a unified framework.Typical methods heavily rely on region-of-interest(Rol)operations to extract local features and complex post-processing steps to produce final predictions.To address these limitations,we propose TextFormer,a query-based end-to-end text spotter with a transformer architecture.Specifically,using query embedding per text instance,TextFormer builds upon an image encoder and a text decoder to learn a joint semantic understanding for multitask modeling.It allows for mutual training and optimization of classification,segmentation and recognition branches,resulting in deeper feature sharing without sacrificing flexibility or simplicity.Additionally,we design an adaptive global aggregation(AGG)module to transfer global features into sequential features for reading arbitrarilyshaped texts,which overcomes the suboptimization problem of Rol operations.Furthermore,potential corpus information is utilized from weak annotations to full labels through mixed supervision,further improving text detection and end-to-end text spotting results.Extensive experiments on various bilingual(i.e.,English and Chinese)benchmarks demonstrate the superiority of our method.Especially on the TDA-ReCTS dataset,TextFormer surpasses the state-of-the-art method in terms of 1-NED by 13.2%.展开更多
Most polyp segmentation methods use convolutional neural networks(CNNs)as their backbone,leading to two key issues when exchanging information between the encoder and decoder:(1)taking into account the differences in ...Most polyp segmentation methods use convolutional neural networks(CNNs)as their backbone,leading to two key issues when exchanging information between the encoder and decoder:(1)taking into account the differences in contribution between different-level features,and(2)designing an effective mechanism for fusing these features.Unlike existing CNN-based methods,we adopt a transformer encoder,which learns more powerful and robust representations.In addition,considering the image acquisition influence and elusive properties of polyps,we introduce three standard modules,including a cascaded fusion module(CFM),a camouflage identification module(CIM),and a similarity aggregation module(SAM).Among these,the CFM is used to collect the semantic and location information of polyps from high-level features;the CIM is applied to capture polyp information disguised in low-level features,and the SAM extends the pixel features of the polyp area with high-level semantic position information to the entire polyp area,thereby effectively fusing cross-level features.The proposed model,named Polyp-PVT,effectively suppresses noises in the features and significantly improves their expressive capabilities.Extensive experiments on five widely adopted datasets show that the proposed model is more robust to various challenging situations(e.g.,appearance changes,small objects,and rotation)than existing representative methods.The proposed model is available at https://github.com/DengPingFan/Polyp-PVT.展开更多
High intensity focused ultrasound(HIFU)therapy is an effective method in clinical treatment of tumors,in order to explore the bio-heat conduction mechanism of in multi-layer media by concave spherical transducer,tempe...High intensity focused ultrasound(HIFU)therapy is an effective method in clinical treatment of tumors,in order to explore the bio-heat conduction mechanism of in multi-layer media by concave spherical transducer,temperature field induced by this kind of transducer in multi-layer media will be simulated through solving Pennes equation with finite difference method,and the influence of initial sound pressure,absorption coefficient,and thickness of different layers of biological tissue as well as thermal conductivity parameter on sound focus and temperature distribution will be analyzed,respectively.The results show that the temperature in focus area increases faster while the initial sound pressure and thermal conductivity increase.The absorption coefficient is smaller,the ultrasound intensity in the focus area is bigger,and the size of the focus area is increasing.When the thicknesses of different layers of tissue change,the focus position changes slightly,but the sound intensity of the focus area will change obviously.The temperature in focus area will rise quickly before reaching a threshold,and then the temperature will keep in the threshold range.展开更多
We present a new method for automatically forecasting the occurrence of solar flares based on photospheric magnetic measurements. The method is a cascading combination of an ordinal logistic regression model and a sup...We present a new method for automatically forecasting the occurrence of solar flares based on photospheric magnetic measurements. The method is a cascading combination of an ordinal logistic regression model and a support vector machine classifier. The predictive variables are three photospheric magnetic parameters, i.e., the total unsigned magnetic flux, length of the strong-gradient magnetic polarity inversion line, and total magnetic energy dissipation. The output is true or false for the occurrence of a certain level of flares within 24 hours. Experimental results, from a sample of 230 active regions between 1996 and 2005, show the accuracies of a 24- hour flare forecast to be 0.86, 0.72, 0.65 and 0.84 respectively for the four different levels. Comparison shows an improvement in the accuracy of X-class flare forecasting.展开更多
Focusing on data imbalance and intraclass variation,an improved pedestrian detection with a cascade of complex peer AdaBoost classifiers is proposed.The series of the AdaBoost classifiers are learned greedily,along wi...Focusing on data imbalance and intraclass variation,an improved pedestrian detection with a cascade of complex peer AdaBoost classifiers is proposed.The series of the AdaBoost classifiers are learned greedily,along with negative example mining.The complexity of classifiers in the cascade is not limited,so more negative examples are used for training.Furthermore,the cascade becomes an ensemble of strong peer classifiers,which treats intraclass variation.To locally train the AdaBoost classifiers with a high detection rate,a refining strategy is used to discard the hardest negative training examples rather than decreasing their thresholds.Using the aggregate channel feature(ACF),the method achieves miss rates of 35%and 14%on the Caltech pedestrian benchmark and Inria pedestrian dataset,respectively,which are lower than that of increasingly complex AdaBoost classifiers,i.e.,44%and 17%,respectively.Using deep features extracted by the region proposal network(RPN),the method achieves a miss rate of 10.06%on the Caltech pedestrian benchmark,which is also lower than 10.53%from the increasingly complex cascade.This study shows that the proposed method can use more negative examples to train the pedestrian detector.It outperforms the existing cascade of increasingly complex classifiers.展开更多
We investigate the evolution of cooperation with evolutionary public goods games based on finite populations, where four pure strategies: cooperators, defectors, punishers and loners who are unwilling to participate ...We investigate the evolution of cooperation with evolutionary public goods games based on finite populations, where four pure strategies: cooperators, defectors, punishers and loners who are unwilling to participate are considered. By adopting approximate best response dynamics, we show that the magnitude of rationality not only quantitatively explains the experiment results in [Nature (London) 425 (2003) 390], but also it will heavily influence the evolution of cooperation. Compared with previous results of infinite populations, which result in two equilibriums, we show that there merely exists a special equilibrium cooperation. In addition, we characterize that loner's and the relevant high value of bounded rationality will sustain payoff plays an active role in the maintenance of cooperation, which will only be warranted for the low and moderate values of loner's payoff. It thus indicates the effects of rationality and loner's payoff will influence the cooperation. Finally, we highlight the important result that the introduction of voluntary participation and punishment will facilitate cooperation greatly.展开更多
After reviewing three different definitions of mode field diameter of single-mode fibers, coupled efficiency calculation methods associated with lateral offset, longitude separation and wavelength, the effects produce...After reviewing three different definitions of mode field diameter of single-mode fibers, coupled efficiency calculation methods associated with lateral offset, longitude separation and wavelength, the effects produced by them, and the influences of splicing defects were discussed in detail. The regularities of the effects were studied according to the first order derivation of couple efficiency formula, and a simplified formula for couple efficiency calculation was presented under the circumstance of slight misalignment, with respect to wavelength, 2, and in a good agreement with the theoretical model. The simplified formula provides a new but simple approach to evaluate wavelength dependent couple efficiency of single-mode fibers. Theoretical analyses and numerical calculations show that, when those defects exist, the wavelength produces additional effects on the couple loss that growth of wavelength causes an increase on the couple efficiency for the lateral offset or longitude separation whereas lessens the couple efficiency due to angular misalignment or mode fields mismatching, and that the wavelength degrades the couple efficiency distinctly when λ≥2.5 μm whereas it distorts the couple slightly in range of λ≤2λ≤2 μm.展开更多
Arecanut disease identification is a challenging problem in the field of image processing.In this work,we present a new combination of multi-gradient-direction and deep con-volutional neural networks for arecanut dise...Arecanut disease identification is a challenging problem in the field of image processing.In this work,we present a new combination of multi-gradient-direction and deep con-volutional neural networks for arecanut disease identification,namely,rot,split and rot-split.Due to the effect of the disease,there are chances of losing vital details in the images.To enhance the fine details in the images affected by diseases,we explore multi-Sobel directional masks for convolving with the input image,which results in enhanced images.The proposed method extracts arecanut as foreground from the enhanced images using Otsu thresholding.Further,the features are extracted for foreground information for disease identification by exploring the ResNet architecture.The advantage of the proposed approach is that it identifies the diseased images from the healthy arecanut images.Experimental results on the dataset of four classes(healthy,rot,split and rot-split)show that the proposed model is superior in terms of classification rate.展开更多
Recognizing actions according to video features is an important problem in a wide scope of applications. In this paper, we propose a temporal scale.invariant deep learning framework for action recognition, which is ro...Recognizing actions according to video features is an important problem in a wide scope of applications. In this paper, we propose a temporal scale.invariant deep learning framework for action recognition, which is robust to the change of action speed. Specifically, a video is firstly split into several sub.action clips and a keyframe is selected from each sub.action clip. The spatial and motion features of the keyframe are extracted separately by two Convolutional Neural Networks(CNN) and combined in the convolutional fusion layer for learning the relationship between the features. Then, Long Short Term Memory(LSTM) networks are applied to the fused features to formulate long.term temporal clues. Finally, the action prediction scores of the LSTM network are combined by linear weighted summation. Extensive experiments are conducted on two popular and challenging benchmarks, namely, the UCF.101 and the HMDB51 Human Actions. On both benchmarks, our framework achieves superior results over the state.of.the.art methods by 93.7% on UCF.101 and 69.5% on HMDB51, respectively.展开更多
Achieving good recognition results for License plates is challenging due to multiple adverse factors. For instance, in Malaysia, where private vehicle (e.g., cars) have numbers with dark background, while public veh...Achieving good recognition results for License plates is challenging due to multiple adverse factors. For instance, in Malaysia, where private vehicle (e.g., cars) have numbers with dark background, while public vehicle (taxis/cabs) have numbers with white background. To reduce the complexity of the problem, we propose to classify the above two types of images such that one can choose an appropriate method to achieve better results. Therefore, in this work, we explore the combination of Convolutional Neural Networks (CNN) and Recurrent Neural Networks namely, BLSTM (Bi-Directional Long Short Term Memory), for recognition. The CNN has been used for feature extraction as it has high discriminative ability, at the same time, BLSTM has the ability to extract context information based on the past information. For classification, we propose Dense Cluster based Voting (DCV), which separates foreground and background for successful classification of private and public. Experimental results on live data given by MIMOS, which is funded by Malaysian Government and the standard dataset UCSD show that the proposed classification outperforms the existing methods. In addition, the recognition results show that the recognition performance improves significantly after classification compared to before classification.展开更多
The algorithms of convex hull have been extensively studied in literature, principally because of their wide range of applications in different areas. This article presents an efficient algorithm to construct approxim...The algorithms of convex hull have been extensively studied in literature, principally because of their wide range of applications in different areas. This article presents an efficient algorithm to construct approximate convex hull from a set of n points in the plane in O(n+k) time, where k is the approximation error control parameter. The proposed algorithm is suitable for applications preferred to reduce the computation time in exchange of accuracy level such as animation and interaction in computer graphics where rapid and real-time graphics rendering is indispensable.展开更多
The contamination proposed in this paper is a defect on the surface of ice cream bar,which is a serious security threat.So it is essential to detect this defect before launched on the market. A detection method of con...The contamination proposed in this paper is a defect on the surface of ice cream bar,which is a serious security threat.So it is essential to detect this defect before launched on the market. A detection method of contamination defect on the ice cream bar surface is proposed,which is based on fuzzy rule and absolute neighborhood feature. Firstly,the ice cream bar surface is divided into several sub-regions via the defined adjacent gray level clustering method. Then the alternative contamination regions are extracted from the sub-regions via the defined fuzzy rule. At last,the real contamination regions are recognized via the relationship between absolute neighborhood gray feature and default threshold. The algorithm was tested in the self-built image database SUT-D. The results show that the accuracy of the method proposed in this paper is 97.32 percent,which increases 2.68 percent at least comparing to the other typical algorithms. It indicates that the superiority proposed in this paper,which is of actual use value.展开更多
We present a masked vision-language transformer(MVLT)for fashion-specific multi-modal representation.Technically,we simply utilize the vision transformer architecture for replacing the bidirectional encoder representa...We present a masked vision-language transformer(MVLT)for fashion-specific multi-modal representation.Technically,we simply utilize the vision transformer architecture for replacing the bidirectional encoder representations from Transformers(BERT)in the pre-training model,making MVLT the first end-to-end framework for the fashion domain.Besides,we designed masked image reconstruction(MIR)for a fine-grained understanding of fashion.MVLT is an extensible and convenient architecture that admits raw multimodal inputs without extra pre-processing models(e.g.,ResNet),implicitly modeling the vision-language alignments.More importantly,MVLT can easily generalize to various matching and generative tasks.Experimental results show obvious improvements in retrieval(rank@5:17%)and recognition(accuracy:3%)tasks over the Fashion-Gen 2018 winner,Kaleido-BERT.The code is available at https://github.com/GewelsJI/MVLT.展开更多
Driver behavior modeling is becoming increasingly important in the study of traffic safety and devel- opment of cognitive vehicles. An algorithm for dealing with reliability for both digital driving and conventional d...Driver behavior modeling is becoming increasingly important in the study of traffic safety and devel- opment of cognitive vehicles. An algorithm for dealing with reliability for both digital driving and conventional driving has been developed in this paper. Problems of digital driving error classification, digital driving error probability quantification and digital driving reliability simulation have been addressed using a comparison re- search method. Simulation results show that driving reliability analysis discussed here is capable of identifying digital driving behavior characteristics and achieving safety assessment of intelligent transportation system.展开更多
Multimodality image registration and fusion are essential steps in building 3-D models from remotesensing data. We present in this paper a neural network technique for the registration and fusion of multimodali-ty rem...Multimodality image registration and fusion are essential steps in building 3-D models from remotesensing data. We present in this paper a neural network technique for the registration and fusion of multimodali-ty remote sensing data for the reconstruction of 3-D models of terrain regions. A FeedForward neural network isused to fuse the intensity data sets with the spatial data set after learning its geometry. Results on real data arepresented. Human performance evaluation is assessed on several perceptual tests in order to evaluate the fusionresults.展开更多
基金Project supported by the National Natural Science Foundation of China(Nos.61906133,62020106004,and 92048301)。
文摘Light field(LF)imaging has attracted attention because of its ability to solve computer vision problems.In this paper we briefly review the research progress in computer vision in recent years.For most factors that affect computer vision development,the richness and accuracy of visual information acquisition are decisive.LF imaging technology has made great contributions to computer vision because it uses cameras or microlens arrays to record the position and direction information of light rays,acquiring complete three-dimensional(3D)scene information.LF imaging technology improves the accuracy of depth estimation,image segmentation,blending,fusion,and 3D reconstruction.LF has also been innovatively applied to iris and face recognition,identification of materials and fake pedestrians,acquisition of epipolar plane images,shape recovery,and LF microscopy.Here,we further summarize the existing problems and the development trends of LF imaging in computer vision,including the establishment and evaluation of the LF dataset,applications under high dynamic range(HDR)conditions,LF image enhancement,virtual reality,3D display,and 3D movies,military optical camouflage technology,image recognition at micro-scale,image processing method based on HDR,and the optimal relationship between spatial resolution and four-dimensional(4D)LF information acquisition.LF imaging has achieved great success in various studies.Over the past 25 years,more than 180 publications have reported the capability of LF imaging in solving computer vision problems.We summarize these reports to make it easier for researchers to search the detailed methods for specific solutions.
基金Supported by the Austrian Science Fund(FWF),No.KLI 429-B13 to Vécsei A
文摘AIM: To further improve the endoscopic detection of intestinal mucosa alterations due to celiac disease(CD).METHODS: We assessed a hybrid approach based on the integration of expert knowledge into the computerbased classification pipeline. A total of 2835 endoscopic images from the duodenum were recorded in 290 children using the modified immersion technique(MIT). These children underwent routine upper endoscopy for suspected CD or non-celiac upper abdominal symptoms between August 2008 and December 2014. Blinded to the clinical data and biopsy results, three medical experts visually classified each image as normal mucosa(Marsh-0) or villous atrophy(Marsh-3). The experts' decisions were further integrated into state-of-the-arttexture recognition systems. Using the biopsy results as the reference standard, the classification accuracies of this hybrid approach were compared to the experts' diagnoses in 27 different settings.RESULTS: Compared to the experts' diagnoses, in 24 of 27 classification settings(consisting of three imaging modalities, three endoscopists and three classification approaches), the best overall classification accuracies were obtained with the new hybrid approach. In 17 of 24 classification settings, the improvements achieved with the hybrid approach were statistically significant(P < 0.05). Using the hybrid approach classification accuracies between 94% and 100% were obtained. Whereas the improvements are only moderate in the case of the most experienced expert, the results of the less experienced expert could be improved significantly in 17 out of 18 classification settings. Furthermore, the lowest classification accuracy, based on the combination of one database and one specific expert, could be improved from 80% to 95%(P < 0.001).CONCLUSION: The overall classification performance of medical experts, especially less experienced experts, can be boosted significantly by integrating expert knowledge into computer-aided diagnosis systems.
基金supported by A*STAR Career Development Fund,Singapore(No.C233312006)。
文摘This paper tackles the high computational/space complexity associated with multi-head self-attention(MHSA)in vanilla vision transformers.To this end,we propose hierarchical MHSA(H-MHSA),a novel approach that computes self-attention in a hierarchical fashion.Specifically,we first divide the input image into patches as commonly done,and each patch is viewed as a token.Then,the proposed H-MHSA learns token relationships within local patches,serving as local relationship modeling.Then,the small patches are merged into larger ones,and H-MHSA models the global dependencies for the small number of the merged tokens.At last,the local and global attentive features are aggregated to obtain features with powerful representation capacity.Since we only calculate attention for a limited number of tokens at each step,the computational load is reduced dramatically.Hence,H-MHSA can efficiently model global relationships among tokens without sacrificing fine-grained information.With the H-MHSA module incorporated,we build a family of hierarchical-attention-based transformer networks,namely HAT-Net.To demonstrate the superiority of HAT-Net in scene understanding,we conduct extensive experiments on fundamental vision tasks,including image classification,semantic segmentation,object detection and instance segmentation.Therefore,HAT-Net provides a new perspective for vision transformers.Code and pretrained models are available at https://github.com/yun-liu/HAT-Net.
基金supported by the National Key R&D Program of China(No.2020AAA0108904)the Science and Technology Plan of Shenzhen(No.JCYJ20200109140410340).
文摘Audio‐visual wake word spotting is a challenging multi‐modal task that exploits visual information of lip motion patterns to supplement acoustic speech to improve overall detection performance.However,most audio‐visual wake word spotting models are only suitable for simple single‐speaker scenarios and require high computational complexity.Further development is hindered by complex multi‐person scenarios and computational limitations in mobile environments.In this paper,a novel audio‐visual model is proposed for on‐device multi‐person wake word spotting.Firstly,an attention‐based audio‐visual voice activity detection module is presented,which generates an attention score matrix of audio and visual representations to derive active speaker representation.Secondly,the knowledge distillation method is introduced to transfer knowledge from the large model to the on‐device model to control the size of our model.Moreover,a new audio‐visual dataset,PKU‐KWS,is collected for sentence‐level multi‐person wake word spotting.Experimental results on the PKU‐KWS dataset show that this approach outperforms the previous state‐of‐the‐art methods.
基金This work was supported by the National Natural Science Foundation of China(No.61871196 and 62001176)the Natural Science Foundation of Fujian Province of China(No.2019J01082 and 2020J01085)the Promotion Program for Young and Middle-aged Teachers in Science and Technology Research of Huaqiao University(ZQN-YX601).
文摘Monocular 6D pose estimation is a functional task in the field of com-puter vision and robotics.In recent years,2D-3D correspondence-based methods have achieved improved performance in multiview and depth data-based scenes.However,for monocular 6D pose estimation,these methods are affected by the prediction results of the 2D-3D correspondences and the robustness of the per-spective-n-point(PnP)algorithm.There is still a difference in the distance from the expected estimation effect.To obtain a more effective feature representation result,edge enhancement is proposed to increase the shape information of the object by analyzing the influence of inaccurate 2D-3D matching on 6D pose regression and comparing the effectiveness of the intermediate representation.Furthermore,although the transformation matrix is composed of rotation and translation matrices from 3D model points to 2D pixel points,the two variables are essentially different and the same network cannot be used for both variables in the regression process.Therefore,to improve the effectiveness of the PnP algo-rithm,this paper designs a dual-branch PnP network to predict rotation and trans-lation information.Finally,the proposed method is verified on the public LM,LM-O and YCB-Video datasets.The ADD(S)values of the proposed method are 94.2 and 62.84 on the LM and LM-O datasets,respectively.The AUC of ADD(-S)value on YCB-Video is 81.1.These experimental results show that the performance of the proposed method is superior to that of similar methods.
基金supported by the National Natural Science Foundation of China(No.61902027).
文摘End-to-end text spotting is a vital computer vision task that aims to integrate scene text detection and recognition into a unified framework.Typical methods heavily rely on region-of-interest(Rol)operations to extract local features and complex post-processing steps to produce final predictions.To address these limitations,we propose TextFormer,a query-based end-to-end text spotter with a transformer architecture.Specifically,using query embedding per text instance,TextFormer builds upon an image encoder and a text decoder to learn a joint semantic understanding for multitask modeling.It allows for mutual training and optimization of classification,segmentation and recognition branches,resulting in deeper feature sharing without sacrificing flexibility or simplicity.Additionally,we design an adaptive global aggregation(AGG)module to transfer global features into sequential features for reading arbitrarilyshaped texts,which overcomes the suboptimization problem of Rol operations.Furthermore,potential corpus information is utilized from weak annotations to full labels through mixed supervision,further improving text detection and end-to-end text spotting results.Extensive experiments on various bilingual(i.e.,English and Chinese)benchmarks demonstrate the superiority of our method.Especially on the TDA-ReCTS dataset,TextFormer surpasses the state-of-the-art method in terms of 1-NED by 13.2%.
文摘Most polyp segmentation methods use convolutional neural networks(CNNs)as their backbone,leading to two key issues when exchanging information between the encoder and decoder:(1)taking into account the differences in contribution between different-level features,and(2)designing an effective mechanism for fusing these features.Unlike existing CNN-based methods,we adopt a transformer encoder,which learns more powerful and robust representations.In addition,considering the image acquisition influence and elusive properties of polyps,we introduce three standard modules,including a cascaded fusion module(CFM),a camouflage identification module(CIM),and a similarity aggregation module(SAM).Among these,the CFM is used to collect the semantic and location information of polyps from high-level features;the CIM is applied to capture polyp information disguised in low-level features,and the SAM extends the pixel features of the polyp area with high-level semantic position information to the entire polyp area,thereby effectively fusing cross-level features.The proposed model,named Polyp-PVT,effectively suppresses noises in the features and significantly improves their expressive capabilities.Extensive experiments on five widely adopted datasets show that the proposed model is more robust to various challenging situations(e.g.,appearance changes,small objects,and rotation)than existing representative methods.The proposed model is available at https://github.com/DengPingFan/Polyp-PVT.
基金Project(11174077)supported by the National Natural Science Foundation of ChinaProject(11JJ3079)supported by the Hunan Provincial Natural Science Foundation of ChinaProjects(12C0237,11C0844)supported by the Science Research Program of Education Department of Hunan Province,China
文摘High intensity focused ultrasound(HIFU)therapy is an effective method in clinical treatment of tumors,in order to explore the bio-heat conduction mechanism of in multi-layer media by concave spherical transducer,temperature field induced by this kind of transducer in multi-layer media will be simulated through solving Pennes equation with finite difference method,and the influence of initial sound pressure,absorption coefficient,and thickness of different layers of biological tissue as well as thermal conductivity parameter on sound focus and temperature distribution will be analyzed,respectively.The results show that the temperature in focus area increases faster while the initial sound pressure and thermal conductivity increase.The absorption coefficient is smaller,the ultrasound intensity in the focus area is bigger,and the size of the focus area is increasing.When the thicknesses of different layers of tissue change,the focus position changes slightly,but the sound intensity of the focus area will change obviously.The temperature in focus area will rise quickly before reaching a threshold,and then the temperature will keep in the threshold range.
基金supported by NSF under grants ATM-071 6950,ATM-0745744NASA under grant NNXO-8 AQ90G
文摘We present a new method for automatically forecasting the occurrence of solar flares based on photospheric magnetic measurements. The method is a cascading combination of an ordinal logistic regression model and a support vector machine classifier. The predictive variables are three photospheric magnetic parameters, i.e., the total unsigned magnetic flux, length of the strong-gradient magnetic polarity inversion line, and total magnetic energy dissipation. The output is true or false for the occurrence of a certain level of flares within 24 hours. Experimental results, from a sample of 230 active regions between 1996 and 2005, show the accuracies of a 24- hour flare forecast to be 0.86, 0.72, 0.65 and 0.84 respectively for the four different levels. Comparison shows an improvement in the accuracy of X-class flare forecasting.
基金Project(2018AAA0102102)supported by the National Science and Technology Major Project,ChinaProject(2017WK2074)supported by the Planned Science and Technology Project of Hunan Province,China+1 种基金Project(B18059)supported by the National 111 Project,ChinaProject(61702559)supported by the National Natural Science Foundation of China。
文摘Focusing on data imbalance and intraclass variation,an improved pedestrian detection with a cascade of complex peer AdaBoost classifiers is proposed.The series of the AdaBoost classifiers are learned greedily,along with negative example mining.The complexity of classifiers in the cascade is not limited,so more negative examples are used for training.Furthermore,the cascade becomes an ensemble of strong peer classifiers,which treats intraclass variation.To locally train the AdaBoost classifiers with a high detection rate,a refining strategy is used to discard the hardest negative training examples rather than decreasing their thresholds.Using the aggregate channel feature(ACF),the method achieves miss rates of 35%and 14%on the Caltech pedestrian benchmark and Inria pedestrian dataset,respectively,which are lower than that of increasingly complex AdaBoost classifiers,i.e.,44%and 17%,respectively.Using deep features extracted by the region proposal network(RPN),the method achieves a miss rate of 10.06%on the Caltech pedestrian benchmark,which is also lower than 10.53%from the increasingly complex cascade.This study shows that the proposed method can use more negative examples to train the pedestrian detector.It outperforms the existing cascade of increasingly complex classifiers.
基金Supported by National Nature Science Foundation under Grant No.60904063the Tianjin municipal Natural Science Foundation under Grant Nos.11JCYBJC06600,11ZCKF6X00900,11ZCKFGX00900
文摘We investigate the evolution of cooperation with evolutionary public goods games based on finite populations, where four pure strategies: cooperators, defectors, punishers and loners who are unwilling to participate are considered. By adopting approximate best response dynamics, we show that the magnitude of rationality not only quantitatively explains the experiment results in [Nature (London) 425 (2003) 390], but also it will heavily influence the evolution of cooperation. Compared with previous results of infinite populations, which result in two equilibriums, we show that there merely exists a special equilibrium cooperation. In addition, we characterize that loner's and the relevant high value of bounded rationality will sustain payoff plays an active role in the maintenance of cooperation, which will only be warranted for the low and moderate values of loner's payoff. It thus indicates the effects of rationality and loner's payoff will influence the cooperation. Finally, we highlight the important result that the introduction of voluntary participation and punishment will facilitate cooperation greatly.
基金Projects(51005074, 91123035) supported by the National Natural Science Foundation of China Project(201021200077) supported by the Frontier Research Program of Central South University, China
文摘After reviewing three different definitions of mode field diameter of single-mode fibers, coupled efficiency calculation methods associated with lateral offset, longitude separation and wavelength, the effects produced by them, and the influences of splicing defects were discussed in detail. The regularities of the effects were studied according to the first order derivation of couple efficiency formula, and a simplified formula for couple efficiency calculation was presented under the circumstance of slight misalignment, with respect to wavelength, 2, and in a good agreement with the theoretical model. The simplified formula provides a new but simple approach to evaluate wavelength dependent couple efficiency of single-mode fibers. Theoretical analyses and numerical calculations show that, when those defects exist, the wavelength produces additional effects on the couple loss that growth of wavelength causes an increase on the couple efficiency for the lateral offset or longitude separation whereas lessens the couple efficiency due to angular misalignment or mode fields mismatching, and that the wavelength degrades the couple efficiency distinctly when λ≥2.5 μm whereas it distorts the couple slightly in range of λ≤2λ≤2 μm.
文摘Arecanut disease identification is a challenging problem in the field of image processing.In this work,we present a new combination of multi-gradient-direction and deep con-volutional neural networks for arecanut disease identification,namely,rot,split and rot-split.Due to the effect of the disease,there are chances of losing vital details in the images.To enhance the fine details in the images affected by diseases,we explore multi-Sobel directional masks for convolving with the input image,which results in enhanced images.The proposed method extracts arecanut as foreground from the enhanced images using Otsu thresholding.Further,the features are extracted for foreground information for disease identification by exploring the ResNet architecture.The advantage of the proposed approach is that it identifies the diseased images from the healthy arecanut images.Experimental results on the dataset of four classes(healthy,rot,split and rot-split)show that the proposed model is superior in terms of classification rate.
基金supported in part by the National High Technology Research and Development Program of China (863 Program) (2015AA016306)the National Nature Science Foundation of China (61231015)+2 种基金the Technology Research Program of Ministry of Public Security (2016JSYJA12)the Shenzhen Basic Research Projects (JCYJ20150422150029090)the Applied Basic Research Program of Wuhan City (2016010101010025)
文摘Recognizing actions according to video features is an important problem in a wide scope of applications. In this paper, we propose a temporal scale.invariant deep learning framework for action recognition, which is robust to the change of action speed. Specifically, a video is firstly split into several sub.action clips and a keyframe is selected from each sub.action clip. The spatial and motion features of the keyframe are extracted separately by two Convolutional Neural Networks(CNN) and combined in the convolutional fusion layer for learning the relationship between the features. Then, Long Short Term Memory(LSTM) networks are applied to the fused features to formulate long.term temporal clues. Finally, the action prediction scores of the LSTM network are combined by linear weighted summation. Extensive experiments are conducted on two popular and challenging benchmarks, namely, the UCF.101 and the HMDB51 Human Actions. On both benchmarks, our framework achieves superior results over the state.of.the.art methods by 93.7% on UCF.101 and 69.5% on HMDB51, respectively.
基金This research work was supported by the Faculty of Computer Science and Information Technology, the University of Malaya under a special allocation of Post Graduate Funding for the RP036B-15AET project. The work described in this paper was supported by the Natural Science Foundation of China under grant no. 61672273, and the Science Foundation for Distinguished Young Scholars of Jiangsu under grant no. BK20160021.
文摘Achieving good recognition results for License plates is challenging due to multiple adverse factors. For instance, in Malaysia, where private vehicle (e.g., cars) have numbers with dark background, while public vehicle (taxis/cabs) have numbers with white background. To reduce the complexity of the problem, we propose to classify the above two types of images such that one can choose an appropriate method to achieve better results. Therefore, in this work, we explore the combination of Convolutional Neural Networks (CNN) and Recurrent Neural Networks namely, BLSTM (Bi-Directional Long Short Term Memory), for recognition. The CNN has been used for feature extraction as it has high discriminative ability, at the same time, BLSTM has the ability to extract context information based on the past information. For classification, we propose Dense Cluster based Voting (DCV), which separates foreground and background for successful classification of private and public. Experimental results on live data given by MIMOS, which is funded by Malaysian Government and the standard dataset UCSD show that the proposed classification outperforms the existing methods. In addition, the recognition results show that the recognition performance improves significantly after classification compared to before classification.
文摘The algorithms of convex hull have been extensively studied in literature, principally because of their wide range of applications in different areas. This article presents an efficient algorithm to construct approximate convex hull from a set of n points in the plane in O(n+k) time, where k is the approximation error control parameter. The proposed algorithm is suitable for applications preferred to reduce the computation time in exchange of accuracy level such as animation and interaction in computer graphics where rapid and real-time graphics rendering is indispensable.
文摘The contamination proposed in this paper is a defect on the surface of ice cream bar,which is a serious security threat.So it is essential to detect this defect before launched on the market. A detection method of contamination defect on the ice cream bar surface is proposed,which is based on fuzzy rule and absolute neighborhood feature. Firstly,the ice cream bar surface is divided into several sub-regions via the defined adjacent gray level clustering method. Then the alternative contamination regions are extracted from the sub-regions via the defined fuzzy rule. At last,the real contamination regions are recognized via the relationship between absolute neighborhood gray feature and default threshold. The algorithm was tested in the self-built image database SUT-D. The results show that the accuracy of the method proposed in this paper is 97.32 percent,which increases 2.68 percent at least comparing to the other typical algorithms. It indicates that the superiority proposed in this paper,which is of actual use value.
文摘We present a masked vision-language transformer(MVLT)for fashion-specific multi-modal representation.Technically,we simply utilize the vision transformer architecture for replacing the bidirectional encoder representations from Transformers(BERT)in the pre-training model,making MVLT the first end-to-end framework for the fashion domain.Besides,we designed masked image reconstruction(MIR)for a fine-grained understanding of fashion.MVLT is an extensible and convenient architecture that admits raw multimodal inputs without extra pre-processing models(e.g.,ResNet),implicitly modeling the vision-language alignments.More importantly,MVLT can easily generalize to various matching and generative tasks.Experimental results show obvious improvements in retrieval(rank@5:17%)and recognition(accuracy:3%)tasks over the Fashion-Gen 2018 winner,Kaleido-BERT.The code is available at https://github.com/GewelsJI/MVLT.
基金Sponsored by the National Natural Science Foundation of China(50878023)the Scientific Research Foundation for the Returned Overseas Chinese Scholars
文摘Driver behavior modeling is becoming increasingly important in the study of traffic safety and devel- opment of cognitive vehicles. An algorithm for dealing with reliability for both digital driving and conventional driving has been developed in this paper. Problems of digital driving error classification, digital driving error probability quantification and digital driving reliability simulation have been addressed using a comparison re- search method. Simulation results show that driving reliability analysis discussed here is capable of identifying digital driving behavior characteristics and achieving safety assessment of intelligent transportation system.
文摘Multimodality image registration and fusion are essential steps in building 3-D models from remotesensing data. We present in this paper a neural network technique for the registration and fusion of multimodali-ty remote sensing data for the reconstruction of 3-D models of terrain regions. A FeedForward neural network isused to fuse the intensity data sets with the spatial data set after learning its geometry. Results on real data arepresented. Human performance evaluation is assessed on several perceptual tests in order to evaluate the fusionresults.