Point cloud compression is critical to deploy 3D representation of the physical world such as 3D immersive telepresence,autonomous driving,and cultural heritage preservation.However,point cloud data are distributed ir...Point cloud compression is critical to deploy 3D representation of the physical world such as 3D immersive telepresence,autonomous driving,and cultural heritage preservation.However,point cloud data are distributed irregularly and discontinuously in spatial and temporal domains,where redundant unoccupied voxels and weak correlations in 3D space make achieving efficient compression a challenging problem.In this paper,we propose a spatio-temporal context-guided algorithm for lossless point cloud geometry compression.The proposed scheme starts with dividing the point cloud into sliced layers of unit thickness along the longest axis.Then,it introduces a prediction method where both intraframe and inter-frame point clouds are available,by determining correspondences between adjacent layers and estimating the shortest path using the travelling salesman algorithm.Finally,the few prediction residual is efficiently compressed with optimal context-guided and adaptive fastmode arithmetic coding techniques.Experiments prove that the proposed method can effectively achieve low bit rate lossless compression of point cloud geometric information,and is suitable for 3D point cloud compression applicable to various types of scenes.展开更多
In this paper we present a CNN based approach for a real time 3 D-hand pose estimation from the depth sequence.Prior discriminative approaches have achieved remarkable success but are facing two main challenges:Firstl...In this paper we present a CNN based approach for a real time 3 D-hand pose estimation from the depth sequence.Prior discriminative approaches have achieved remarkable success but are facing two main challenges:Firstly,the methods are fully supervised hence require large numbers of annotated training data to extract the dynamic information from a hand representation.Secondly,unreliable hand detectors based on strong assumptions or a weak detector which often fail in several situations like complex environment and multiple hands.In contrast to these methods,this paper presents an approach that can be considered as semi-supervised by performing predictive coding of image sequences of hand poses in order to capture latent features underlying a given image without supervision.The hand is modelled using a novel latent tree dependency model(LDTM)which transforms internal joint location to an explicit representation.Then the modeled hand topology is integrated with the pose estimator using data dependent method to jointly learn latent variables of the posterior pose appearance and the pose configuration respectively.Finally,an unsupervised error term which is a part of the recurrent architecture ensures smooth estimations of the final pose.Experiments on three challenging public datasets,ICVL,MSRA,and NYU demonstrate the significant performance of the proposed method which is comparable or better than state-of-the-art approaches.展开更多
In recent years, the accuracy of speech recognition (SR) has been one of the most active areas of research. Despite that SR systems are working reasonably well in quiet conditions, they still suffer severe performance...In recent years, the accuracy of speech recognition (SR) has been one of the most active areas of research. Despite that SR systems are working reasonably well in quiet conditions, they still suffer severe performance degradation in noisy conditions or distorted channels. It is necessary to search for more robust feature extraction methods to gain better performance in adverse conditions. This paper investigates the performance of conventional and new hybrid speech feature extraction algorithms of Mel Frequency Cepstrum Coefficient (MFCC), Linear Prediction Coding Coefficient (LPCC), perceptual linear production (PLP), and RASTA-PLP in noisy conditions through using multivariate Hidden Markov Model (HMM) classifier. The behavior of the proposal system is evaluated using TIDIGIT human voice dataset corpora, recorded from 208 different adult speakers in both training and testing process. The theoretical basis for speech processing and classifier procedures were presented, and the recognition results were obtained based on word recognition rate.展开更多
To solve the problems of blindness and inefficiency existing in the determination of meso-level mechanical parameters of particle flow code (PFC) models, we firstly designed and numerically carried out orthogonal test...To solve the problems of blindness and inefficiency existing in the determination of meso-level mechanical parameters of particle flow code (PFC) models, we firstly designed and numerically carried out orthogonal tests on rock samples to investigate the correlations between macro-and meso-level mechanical parameters of rock-like bonded granular materials. Then based on the artificial intelligent technology, the intelligent prediction systems for nine meso-level mechanical parameters of PFC models were obtained by creating, training and testing the prediction models with the set of data got from the orthogonal tests. Lastly the prediction systems were used to predict the meso-level mechanical parameters of one kind of sandy mudstone, and according to the predicted results the macroscopic properties of the rock were obtained by numerical tests. The maximum relative error between the numerical test results and real rock properties is 3.28% which satisfies the precision requirement in engineering. It shows that this paper provides a fast and accurate method for the determination of meso-level mechanical parameters of PFC models.展开更多
In this paper, a CMOS image sensor(CIS) is proposed, which can accomplish both decorrelation and entropy coding of image compression directly on the focal plane. The design is based on predictive coding for image deco...In this paper, a CMOS image sensor(CIS) is proposed, which can accomplish both decorrelation and entropy coding of image compression directly on the focal plane. The design is based on predictive coding for image decorrelation. The predictions are performed in analog domain by 2×2 pixel units. Both the prediction residuals and original pixel values are quantized and encoded in parallel. Since the residuals have a peak distribution around zero,the output codewords can be replaced by the valid part of the residuals' binary mode. The compressed bit stream is accessible directly at the output of CIS without extra disposition. Simulation results show that the proposed approach achieves a compression rate of 2. 2 and PSNR of 51 on different test images.展开更多
A kind of Web voice browser based on improved synchronous linear predictive coding (ISLPC) and Text-toSpeech (TTS) algorithm and Internet application was proposed. The paper analyzes the features of TTS system wit...A kind of Web voice browser based on improved synchronous linear predictive coding (ISLPC) and Text-toSpeech (TTS) algorithm and Internet application was proposed. The paper analyzes the features of TTS system with ISLPC speech synthesis and discusses the design and implementation of ISLPC TTS-based Web voice browser. The browser integrates Web technology, Chinese information processing, artificial intelligence and the key technology of Chinese ISLPC speech synthesis. It's a visual and audible web browser that can improve information precision for network users. The evaluation results show that ISLPC-based TTS model has a better performance than other browsers in voice quality and capability of identifying Chinese characters.展开更多
In this paper,we present a comparison of Khasi speech representations with four different spectral features and novel extension towards the development of Khasi speech corpora.These four features include linear predic...In this paper,we present a comparison of Khasi speech representations with four different spectral features and novel extension towards the development of Khasi speech corpora.These four features include linear predictive coding(LPC),linear prediction cepstrum coefficient(LPCC),perceptual linear prediction(PLP),and Mel frequency cepstral coefficient(MFCC).The 10-hour speech data were used for training and 3-hour data for testing.For each spectral feature,different hidden Markov model(HMM)based recognizers with variations in HMM states and different Gaussian mixture models(GMMs)were built.The performance was evaluated by using the word error rate(WER).The experimental results show that MFCC provides a better representation for Khasi speech compared with the other three spectral features.展开更多
This paper presented an approach to hide secret speech information in code excited linear prediction (CELP)-based speech coding scheme by adopting the analysis-by-synthesis (ABS)-based algorithm of speech information ...This paper presented an approach to hide secret speech information in code excited linear prediction (CELP)-based speech coding scheme by adopting the analysis-by-synthesis (ABS)-based algorithm of speech information hiding and extracting for the purpose of secure speech communication. The secret speech is coded in 2.4 Kb/s mixed excitation linear prediction (MELP), which is embedded in CELP type public speech. The ABS algorithm adopts speech synthesizer in speech coder. Speech embedding and coding are synchronous, i.e. a fusion of speech information data of public and secret. The experiment of embedding 2.4 Kb/s MELP secret speech in G.728 scheme coded public speech transmitted via public switched telephone network (PSTN) shows that the proposed approach satisfies the requirements of information hiding, meets the secure communication speech quality constraints, and achieves high hiding capacity of average 3.2 Kb/s with an excellent speech quality and complicating speakers’ recognition.展开更多
Wake-Up-Word Speech Recognition task (WUW-SR) is a computationally very demand, particularly the stage of feature extraction which is decoded with corresponding Hidden Markov Models (HMMs) in the back-end stage of the...Wake-Up-Word Speech Recognition task (WUW-SR) is a computationally very demand, particularly the stage of feature extraction which is decoded with corresponding Hidden Markov Models (HMMs) in the back-end stage of the WUW-SR. The state of the art WUW-SR system is based on three different sets of features: Mel-Frequency Cepstral Coefficients (MFCC), Linear Predictive Coding Coefficients (LPC), and Enhanced Mel-Frequency Cepstral Coefficients (ENH_MFCC). In (front-end of Wake-Up-Word Speech Recognition System Design on FPGA) [1], we presented an experimental FPGA design and implementation of a novel architecture of a real-time spectrogram extraction processor that generates MFCC, LPC, and ENH_MFCC spectrograms simultaneously. In this paper, the details of converting the three sets of spectrograms 1) Mel-Frequency Cepstral Coefficients (MFCC), 2) Linear Predictive Coding Coefficients (LPC), and 3) Enhanced Mel-Frequency Cepstral Coefficients (ENH_MFCC) to their equivalent features are presented. In the WUW- SR system, the recognizer’s frontend is located at the terminal which is typically connected over a data network to remote back-end recognition (e.g., server). The WUW-SR is shown in Figure 1. The three sets of speech features are extracted at the front-end. These extracted features are then compressed and transmitted to the server via a dedicated channel, where subsequently they are decoded.展开更多
Lateral predictive coding is a recurrent neural network that creates energy-efficient internal representations by exploiting statistical regularity in sensory inputs.Here,we analytically investigate the trade-off betw...Lateral predictive coding is a recurrent neural network that creates energy-efficient internal representations by exploiting statistical regularity in sensory inputs.Here,we analytically investigate the trade-off between information robustness and energy in a linear model of lateral predictive coding and numerically minimize a free energy quantity.We observed several phase transitions in the synaptic weight matrix,particularly a continuous transition that breaks reciprocity and permutation symmetry and builds cyclic dominance and a discontinuous transition with the associated sudden emergence of tight balance between excitatory and inhibitory interactions.The optimal network follows an ideal gas law over an extended temperature range and saturates the efficiency upper bound of energy use.These results provide theoretical insights into the emergence and evolution of complex internal models in predictive processing systems.展开更多
The brain function of prediction is fundamental for human beings to shape perceptions efficiently and successively. Through decades of effort, a valuable brain activation map has been obtained for prediction. However,...The brain function of prediction is fundamental for human beings to shape perceptions efficiently and successively. Through decades of effort, a valuable brain activation map has been obtained for prediction. However,much less is known about how the brain manages the prediction process over time using traditional neuropsychological paradigms. Here, we implemented an innovative paradigm for timing prediction to precisely study the temporal dynamics of neural oscillations. In the experiment recruiting 45 participants, expectation suppression was found for the overall electroencephalographic activity,consistent with previous hemodynamic studies. Notably,we found that N1 was positively associated with predictability while N2 showed a reversed relation to predictability. Furthermore, the matching prediction had a similar profile with no timing prediction, both showing an almost saturated N1 and an absence of N2. The results indicate that the N1 process showed a ‘sharpening' effect for predictable inputs, while the N2 process showed a‘dampening' effect. Therefore, these two paradoxical neural effects of prediction, which have provoked wide confusion in accounting for expectation suppression,actually co-exist in the procedure of timing prediction but work in separate time windows. These findings strongly support a recently-proposed opposing process theory.展开更多
Predictive coding is a promising theoretical framework in neuroscience for understanding information transmission and perception.It posits that the brain perceives the external world through internal models and update...Predictive coding is a promising theoretical framework in neuroscience for understanding information transmission and perception.It posits that the brain perceives the external world through internal models and updates these models under the guidance of prediction errors.Previous studies on predictive coding emphasized top-down feedback interactions in hierarchical multilayered networks but largely ignored lateral recurrent interactions.We perform analytical and numerical investigations in this work on the effects of single-layer lateral interactions.We consider a simple predictive response dynamics and run it on the MNIST dataset of hand-written digits.We find that learning will generally break the interaction symmetry between peer neurons,and that high input correlation between two neurons does not necessarily bring strong direct interactions between them.The optimized network responds to familiar input signals much faster than to novel or random inputs,and it significantly reduces the correlations between the output states of pairs of neurons.展开更多
Many important developments in video compression technologies have occurred during the past two decades. The block-based discrete cosine transform with motion compensation hybrid coding scheme has been widely employed...Many important developments in video compression technologies have occurred during the past two decades. The block-based discrete cosine transform with motion compensation hybrid coding scheme has been widely employed by most available video coding standards, notably the ITU-T H.26x and ISO/IEC MPEG-x families and video part of China audio video coding standard (AVS). The objective of this paper is to provide a review of the developments of the four basic building blocks of hybrid coding scheme, namely predictive coding, transform coding, quantization and entropy coding, and give theoretical analyses and summaries of the technological advancements. We further analyze the development trends and perspectives of video com- pression, highlighting problems and research directions.展开更多
文摘Point cloud compression is critical to deploy 3D representation of the physical world such as 3D immersive telepresence,autonomous driving,and cultural heritage preservation.However,point cloud data are distributed irregularly and discontinuously in spatial and temporal domains,where redundant unoccupied voxels and weak correlations in 3D space make achieving efficient compression a challenging problem.In this paper,we propose a spatio-temporal context-guided algorithm for lossless point cloud geometry compression.The proposed scheme starts with dividing the point cloud into sliced layers of unit thickness along the longest axis.Then,it introduces a prediction method where both intraframe and inter-frame point clouds are available,by determining correspondences between adjacent layers and estimating the shortest path using the travelling salesman algorithm.Finally,the few prediction residual is efficiently compressed with optimal context-guided and adaptive fastmode arithmetic coding techniques.Experiments prove that the proposed method can effectively achieve low bit rate lossless compression of point cloud geometric information,and is suitable for 3D point cloud compression applicable to various types of scenes.
基金supported in part by the Fundamental Research Funds for the Central Universities(WK2350000002)。
文摘In this paper we present a CNN based approach for a real time 3 D-hand pose estimation from the depth sequence.Prior discriminative approaches have achieved remarkable success but are facing two main challenges:Firstly,the methods are fully supervised hence require large numbers of annotated training data to extract the dynamic information from a hand representation.Secondly,unreliable hand detectors based on strong assumptions or a weak detector which often fail in several situations like complex environment and multiple hands.In contrast to these methods,this paper presents an approach that can be considered as semi-supervised by performing predictive coding of image sequences of hand poses in order to capture latent features underlying a given image without supervision.The hand is modelled using a novel latent tree dependency model(LDTM)which transforms internal joint location to an explicit representation.Then the modeled hand topology is integrated with the pose estimator using data dependent method to jointly learn latent variables of the posterior pose appearance and the pose configuration respectively.Finally,an unsupervised error term which is a part of the recurrent architecture ensures smooth estimations of the final pose.Experiments on three challenging public datasets,ICVL,MSRA,and NYU demonstrate the significant performance of the proposed method which is comparable or better than state-of-the-art approaches.
文摘In recent years, the accuracy of speech recognition (SR) has been one of the most active areas of research. Despite that SR systems are working reasonably well in quiet conditions, they still suffer severe performance degradation in noisy conditions or distorted channels. It is necessary to search for more robust feature extraction methods to gain better performance in adverse conditions. This paper investigates the performance of conventional and new hybrid speech feature extraction algorithms of Mel Frequency Cepstrum Coefficient (MFCC), Linear Prediction Coding Coefficient (LPCC), perceptual linear production (PLP), and RASTA-PLP in noisy conditions through using multivariate Hidden Markov Model (HMM) classifier. The behavior of the proposal system is evaluated using TIDIGIT human voice dataset corpora, recorded from 208 different adult speakers in both training and testing process. The theoretical basis for speech processing and classifier procedures were presented, and the recognition results were obtained based on word recognition rate.
基金the National Natural Science Foundation of China (Nos. 50674083 and 51074162) for its financial support
文摘To solve the problems of blindness and inefficiency existing in the determination of meso-level mechanical parameters of particle flow code (PFC) models, we firstly designed and numerically carried out orthogonal tests on rock samples to investigate the correlations between macro-and meso-level mechanical parameters of rock-like bonded granular materials. Then based on the artificial intelligent technology, the intelligent prediction systems for nine meso-level mechanical parameters of PFC models were obtained by creating, training and testing the prediction models with the set of data got from the orthogonal tests. Lastly the prediction systems were used to predict the meso-level mechanical parameters of one kind of sandy mudstone, and according to the predicted results the macroscopic properties of the rock were obtained by numerical tests. The maximum relative error between the numerical test results and real rock properties is 3.28% which satisfies the precision requirement in engineering. It shows that this paper provides a fast and accurate method for the determination of meso-level mechanical parameters of PFC models.
基金Supported by the National Natural Science Foundation of China(No.61036004)Tianjin Research Program of Application Foundation and Advanced Technology(No.13JCQNJC00600)
文摘In this paper, a CMOS image sensor(CIS) is proposed, which can accomplish both decorrelation and entropy coding of image compression directly on the focal plane. The design is based on predictive coding for image decorrelation. The predictions are performed in analog domain by 2×2 pixel units. Both the prediction residuals and original pixel values are quantized and encoded in parallel. Since the residuals have a peak distribution around zero,the output codewords can be replaced by the valid part of the residuals' binary mode. The compressed bit stream is accessible directly at the output of CIS without extra disposition. Simulation results show that the proposed approach achieves a compression rate of 2. 2 and PSNR of 51 on different test images.
基金Supported by the National High-Technology Re-search and Development Program(2005AA122210) the National Out-standing Youth Foundation (60325104)
文摘A kind of Web voice browser based on improved synchronous linear predictive coding (ISLPC) and Text-toSpeech (TTS) algorithm and Internet application was proposed. The paper analyzes the features of TTS system with ISLPC speech synthesis and discusses the design and implementation of ISLPC TTS-based Web voice browser. The browser integrates Web technology, Chinese information processing, artificial intelligence and the key technology of Chinese ISLPC speech synthesis. It's a visual and audible web browser that can improve information precision for network users. The evaluation results show that ISLPC-based TTS model has a better performance than other browsers in voice quality and capability of identifying Chinese characters.
基金supported by the Visvesvaraya Ph.D.Scheme for Electronics and IT students launched by the Ministry of Electronics and Information Technology(MeiTY),Government of India under Grant No.PhD-MLA/4(95)/2015-2016.
文摘In this paper,we present a comparison of Khasi speech representations with four different spectral features and novel extension towards the development of Khasi speech corpora.These four features include linear predictive coding(LPC),linear prediction cepstrum coefficient(LPCC),perceptual linear prediction(PLP),and Mel frequency cepstral coefficient(MFCC).The 10-hour speech data were used for training and 3-hour data for testing.For each spectral feature,different hidden Markov model(HMM)based recognizers with variations in HMM states and different Gaussian mixture models(GMMs)were built.The performance was evaluated by using the word error rate(WER).The experimental results show that MFCC provides a better representation for Khasi speech compared with the other three spectral features.
文摘This paper presented an approach to hide secret speech information in code excited linear prediction (CELP)-based speech coding scheme by adopting the analysis-by-synthesis (ABS)-based algorithm of speech information hiding and extracting for the purpose of secure speech communication. The secret speech is coded in 2.4 Kb/s mixed excitation linear prediction (MELP), which is embedded in CELP type public speech. The ABS algorithm adopts speech synthesizer in speech coder. Speech embedding and coding are synchronous, i.e. a fusion of speech information data of public and secret. The experiment of embedding 2.4 Kb/s MELP secret speech in G.728 scheme coded public speech transmitted via public switched telephone network (PSTN) shows that the proposed approach satisfies the requirements of information hiding, meets the secure communication speech quality constraints, and achieves high hiding capacity of average 3.2 Kb/s with an excellent speech quality and complicating speakers’ recognition.
文摘Wake-Up-Word Speech Recognition task (WUW-SR) is a computationally very demand, particularly the stage of feature extraction which is decoded with corresponding Hidden Markov Models (HMMs) in the back-end stage of the WUW-SR. The state of the art WUW-SR system is based on three different sets of features: Mel-Frequency Cepstral Coefficients (MFCC), Linear Predictive Coding Coefficients (LPC), and Enhanced Mel-Frequency Cepstral Coefficients (ENH_MFCC). In (front-end of Wake-Up-Word Speech Recognition System Design on FPGA) [1], we presented an experimental FPGA design and implementation of a novel architecture of a real-time spectrogram extraction processor that generates MFCC, LPC, and ENH_MFCC spectrograms simultaneously. In this paper, the details of converting the three sets of spectrograms 1) Mel-Frequency Cepstral Coefficients (MFCC), 2) Linear Predictive Coding Coefficients (LPC), and 3) Enhanced Mel-Frequency Cepstral Coefficients (ENH_MFCC) to their equivalent features are presented. In the WUW- SR system, the recognizer’s frontend is located at the terminal which is typically connected over a data network to remote back-end recognition (e.g., server). The WUW-SR is shown in Figure 1. The three sets of speech features are extracted at the front-end. These extracted features are then compressed and transmitted to the server via a dedicated channel, where subsequently they are decoded.
基金supported by the National Natural Science Foundation of China(Grant Nos.12047503,11747601 and 12247104)the National Innovation Institute of Defense Technology(Grant No.22TQ0904ZT01025)。
文摘Lateral predictive coding is a recurrent neural network that creates energy-efficient internal representations by exploiting statistical regularity in sensory inputs.Here,we analytically investigate the trade-off between information robustness and energy in a linear model of lateral predictive coding and numerically minimize a free energy quantity.We observed several phase transitions in the synaptic weight matrix,particularly a continuous transition that breaks reciprocity and permutation symmetry and builds cyclic dominance and a discontinuous transition with the associated sudden emergence of tight balance between excitatory and inhibitory interactions.The optimal network follows an ideal gas law over an extended temperature range and saturates the efficiency upper bound of energy use.These results provide theoretical insights into the emergence and evolution of complex internal models in predictive processing systems.
基金supported by the National Key Research and Development Program of China(2017YFB1300302)the National Natural Science Foundation of China(81925020 and61976152)the Young Elite Scientist Sponsorship Program of the China Association for Science and Technology(2018QNRC001)。
文摘The brain function of prediction is fundamental for human beings to shape perceptions efficiently and successively. Through decades of effort, a valuable brain activation map has been obtained for prediction. However,much less is known about how the brain manages the prediction process over time using traditional neuropsychological paradigms. Here, we implemented an innovative paradigm for timing prediction to precisely study the temporal dynamics of neural oscillations. In the experiment recruiting 45 participants, expectation suppression was found for the overall electroencephalographic activity,consistent with previous hemodynamic studies. Notably,we found that N1 was positively associated with predictability while N2 showed a reversed relation to predictability. Furthermore, the matching prediction had a similar profile with no timing prediction, both showing an almost saturated N1 and an absence of N2. The results indicate that the N1 process showed a ‘sharpening' effect for predictable inputs, while the N2 process showed a‘dampening' effect. Therefore, these two paradoxical neural effects of prediction, which have provoked wide confusion in accounting for expectation suppression,actually co-exist in the procedure of timing prediction but work in separate time windows. These findings strongly support a recently-proposed opposing process theory.
基金supported by the National Natural Science Foundation of China(Grant Nos.11975295 and 12047503)the Chinese Academy of Sciences(Grant Nos.QYZDJ-SSW-SYS018,and XDPD15)
文摘Predictive coding is a promising theoretical framework in neuroscience for understanding information transmission and perception.It posits that the brain perceives the external world through internal models and updates these models under the guidance of prediction errors.Previous studies on predictive coding emphasized top-down feedback interactions in hierarchical multilayered networks but largely ignored lateral recurrent interactions.We perform analytical and numerical investigations in this work on the effects of single-layer lateral interactions.We consider a simple predictive response dynamics and run it on the MNIST dataset of hand-written digits.We find that learning will generally break the interaction symmetry between peer neurons,and that high input correlation between two neurons does not necessarily bring strong direct interactions between them.The optimized network responds to familiar input signals much faster than to novel or random inputs,and it significantly reduces the correlations between the output states of pairs of neurons.
基金Project (No. 2009CB320903) supported by the National Basic Research Program (973) of China
文摘Many important developments in video compression technologies have occurred during the past two decades. The block-based discrete cosine transform with motion compensation hybrid coding scheme has been widely employed by most available video coding standards, notably the ITU-T H.26x and ISO/IEC MPEG-x families and video part of China audio video coding standard (AVS). The objective of this paper is to provide a review of the developments of the four basic building blocks of hybrid coding scheme, namely predictive coding, transform coding, quantization and entropy coding, and give theoretical analyses and summaries of the technological advancements. We further analyze the development trends and perspectives of video com- pression, highlighting problems and research directions.