Lip-reading technologies are rapidly progressing following the breakthrough of deep learning.It plays a vital role in its many applications,such as:human-machine communication practices or security applications.In thi...Lip-reading technologies are rapidly progressing following the breakthrough of deep learning.It plays a vital role in its many applications,such as:human-machine communication practices or security applications.In this paper,we propose to develop an effective lip-reading recognition model for Arabic visual speech recognition by implementing deep learning algorithms.The Arabic visual datasets that have been collected contains 2400 records of Arabic digits and 960 records of Arabic phrases from 24 native speakers.The primary purpose is to provide a high-performance model in terms of enhancing the preprocessing phase.Firstly,we extract keyframes from our dataset.Secondly,we produce a Concatenated Frame Images(CFIs)that represent the utterance sequence in one single image.Finally,the VGG-19 is employed for visual features extraction in our proposed model.We have examined different keyframes:10,15,and 20 for comparing two types of approaches in the proposed model:(1)the VGG-19 base model and(2)VGG-19 base model with batch normalization.The results show that the second approach achieves greater accuracy:94%for digit recognition,97%for phrase recognition,and 93%for digits and phrases recognition in the test dataset.Therefore,our proposed model is superior to models based on CFIs input.展开更多
Classifying the visual features in images to retrieve a specific image is a significant problem within the computer vision field especially when dealing with historical faded colored images.Thus,there were lots of eff...Classifying the visual features in images to retrieve a specific image is a significant problem within the computer vision field especially when dealing with historical faded colored images.Thus,there were lots of efforts trying to automate the classification operation and retrieve similar images accurately.To reach this goal,we developed a VGG19 deep convolutional neural network to extract the visual features from the images automatically.Then,the distances among the extracted features vectors are measured and a similarity score is generated using a Siamese deep neural network.The Siamese model built and trained at first from scratch but,it didn’t generated high evaluation metrices.Thus,we re-built it from VGG19 pre-trained deep learning model to generate higher evaluation metrices.Afterward,three different distance metrics combined with the Sigmoid activation function are experimented looking for the most accurate method formeasuring the similarities among the retrieved images.Reaching that the highest evaluation parameters generated using the Cosine distance metric.Moreover,the Graphics Processing Unit(GPU)utilized to run the code instead of running it on the Central Processing Unit(CPU).This step optimized the execution further since it expedited both the training and the retrieval time efficiently.After extensive experimentation,we reached satisfactory solution recording 0.98 and 0.99 F-score for the classification and for the retrieval,respectively.展开更多
文摘Lip-reading technologies are rapidly progressing following the breakthrough of deep learning.It plays a vital role in its many applications,such as:human-machine communication practices or security applications.In this paper,we propose to develop an effective lip-reading recognition model for Arabic visual speech recognition by implementing deep learning algorithms.The Arabic visual datasets that have been collected contains 2400 records of Arabic digits and 960 records of Arabic phrases from 24 native speakers.The primary purpose is to provide a high-performance model in terms of enhancing the preprocessing phase.Firstly,we extract keyframes from our dataset.Secondly,we produce a Concatenated Frame Images(CFIs)that represent the utterance sequence in one single image.Finally,the VGG-19 is employed for visual features extraction in our proposed model.We have examined different keyframes:10,15,and 20 for comparing two types of approaches in the proposed model:(1)the VGG-19 base model and(2)VGG-19 base model with batch normalization.The results show that the second approach achieves greater accuracy:94%for digit recognition,97%for phrase recognition,and 93%for digits and phrases recognition in the test dataset.Therefore,our proposed model is superior to models based on CFIs input.
基金The authors would like to thank the Deanship of Scientific Research at Umm Al-Qura University for supporting this work by Grant Code:(22UQU4400271DSR01).
文摘Classifying the visual features in images to retrieve a specific image is a significant problem within the computer vision field especially when dealing with historical faded colored images.Thus,there were lots of efforts trying to automate the classification operation and retrieve similar images accurately.To reach this goal,we developed a VGG19 deep convolutional neural network to extract the visual features from the images automatically.Then,the distances among the extracted features vectors are measured and a similarity score is generated using a Siamese deep neural network.The Siamese model built and trained at first from scratch but,it didn’t generated high evaluation metrices.Thus,we re-built it from VGG19 pre-trained deep learning model to generate higher evaluation metrices.Afterward,three different distance metrics combined with the Sigmoid activation function are experimented looking for the most accurate method formeasuring the similarities among the retrieved images.Reaching that the highest evaluation parameters generated using the Cosine distance metric.Moreover,the Graphics Processing Unit(GPU)utilized to run the code instead of running it on the Central Processing Unit(CPU).This step optimized the execution further since it expedited both the training and the retrieval time efficiently.After extensive experimentation,we reached satisfactory solution recording 0.98 and 0.99 F-score for the classification and for the retrieval,respectively.