The demand for image retrieval with text manipulation exists in many fields, such as e-commerce and Internet search. Deep metric learning methods are used by most researchers to calculate the similarity between the qu...The demand for image retrieval with text manipulation exists in many fields, such as e-commerce and Internet search. Deep metric learning methods are used by most researchers to calculate the similarity between the query and the candidate image by fusing the global feature of the query image and the text feature. However, the text usually corresponds to the local feature of the query image rather than the global feature. Therefore, in this paper, we propose a framework of image retrieval with text manipulation by local feature modification(LFM-IR) which can focus on the related image regions and attributes and perform modification. A spatial attention module and a channel attention module are designed to realize the semantic mapping between image and text. We achieve excellent performance on three benchmark datasets, namely Color-Shape-Size(CSS), Massachusetts Institute of Technology(MIT) States and Fashion200K(+8.3%, +0.7% and +4.6% in R@1).展开更多
The volume of information being created, generated and stored is huge. Without adequate knowledge of Information Retrieval (IR) methods, the retrieval process for information would be cumbersome and frustrating. Studi...The volume of information being created, generated and stored is huge. Without adequate knowledge of Information Retrieval (IR) methods, the retrieval process for information would be cumbersome and frustrating. Studies have further revealed that IR methods are essential in information centres (for example, Digital Library environment) for storage and retrieval of information. Therefore, with more than one billion people accessing the Internet, and millions of queries being issued on a daily basis, modern Web search engines are facing a problem of daunting scale. The main problem associated with the existing search engines is how to avoid irrelevant information retrieval and to retrieve the relevant ones. In this study, the existing system of library retrieval was studied. Problems associated with them were analyzed in order to address this problem. The concept of existing information retrieval models was studied, and the knowledge gained was used to design a digital library information retrieval system. It was successfully implemented using a real life data. The need for a continuous evaluation of the IR methods for effective and efficient full text retrieval system was recommended.展开更多
To overcome the problem that the confusion between texts limits the precision in text re- trieval, a new text retrieval algorithm that decrease confusion (DCTR) is proposed. The algorithm constructs the searching te...To overcome the problem that the confusion between texts limits the precision in text re- trieval, a new text retrieval algorithm that decrease confusion (DCTR) is proposed. The algorithm constructs the searching template to represent the user' s searching intention through positive and negative training. By using the prior probabilities in the template, the supported probability and anti- supported probability of each text in the text library can be estimated for discrimination. The search- ing result can be ranked according to similarities between retrieved texts and the template. The com- plexity of DCTR is close to term frequency and mversed document frequency (TF-IDF). Its distin- guishing ability to confusable texts could be advanced and the performance of the result would be im- proved with increasing of training times.展开更多
跨模态图像-文本检索是一项在给定一种模态(如文本)的查询条件下检索另一种模态(如图像)的任务.该任务的关键问题在于如何准确地测量图文两种模态之间的相似性,在减少视觉和语言这两种异构模态之间的视觉语义差异中起着至关重要的作用....跨模态图像-文本检索是一项在给定一种模态(如文本)的查询条件下检索另一种模态(如图像)的任务.该任务的关键问题在于如何准确地测量图文两种模态之间的相似性,在减少视觉和语言这两种异构模态之间的视觉语义差异中起着至关重要的作用.传统的检索范式依靠深度学习提取图像和文本的特征表示,并将其映射到一个公共表示空间中进行匹配.然而,这种方法更多地依赖数据表面的相关关系,无法挖掘数据背后真实的因果关系,在高层语义信息的表示和可解释性方面面临着挑战.为此,在深度学习的基础上引入因果推断和嵌入共识知识,提出嵌入共识知识的因果图文检索方法.具体而言,将因果干预引入视觉特征提取模块,通过因果关系替换相关关系学习常识因果视觉特征,并与原始视觉特征进行连接得到最终的视觉特征表示.为解决本方法文本特征表示不足的问题,采用更强大的文本特征提取模型BERT(Bidirectional encoder representations from transformers,双向编码器表示),并且嵌入两种模态数据之间共享的共识知识对图文特征进行共识级的表示学习.在MS-COCO数据集以及MS-COCO到Flickr30k上的跨数据集实验,证明了本文方法可以在双向图文检索任务上实现召回率和平均召回率的一致性改进.展开更多
The developed system for eye and face detection using Convolutional Neural Networks(CNN)models,followed by eye classification and voice-based assistance,has shown promising potential in enhancing accessibility for ind...The developed system for eye and face detection using Convolutional Neural Networks(CNN)models,followed by eye classification and voice-based assistance,has shown promising potential in enhancing accessibility for individuals with visual impairments.The modular approach implemented in this research allows for a seamless flow of information and assistance between the different components of the system.This research significantly contributes to the field of accessibility technology by integrating computer vision,natural language processing,and voice technologies.By leveraging these advancements,the developed system offers a practical and efficient solution for assisting blind individuals.The modular design ensures flexibility,scalability,and ease of integration with existing assistive technologies.However,it is important to acknowledge that further research and improvements are necessary to enhance the system’s accuracy and usability.Fine-tuning the CNN models and expanding the training dataset can improve eye and face detection as well as eye classification capabilities.Additionally,incorporating real-time responses through sophisticated natural language understanding techniques and expanding the knowledge base of ChatGPT can enhance the system’s ability to provide comprehensive and accurate responses.Overall,this research paves the way for the development of more advanced and robust systems for assisting visually impaired individuals.By leveraging cutting-edge technologies and integrating them into amodular framework,this research contributes to creating a more inclusive and accessible society for individuals with visual impairments.Future work can focus on refining the system,addressing its limitations,and conducting user studies to evaluate its effectiveness and impact in real-world scenarios.展开更多
基金Foundation items:Shanghai Sailing Program,China (No. 21YF1401300)Shanghai Science and Technology Innovation Action Plan,China (No.19511101802)Fundamental Research Funds for the Central Universities,China (No.2232021D-25)。
文摘The demand for image retrieval with text manipulation exists in many fields, such as e-commerce and Internet search. Deep metric learning methods are used by most researchers to calculate the similarity between the query and the candidate image by fusing the global feature of the query image and the text feature. However, the text usually corresponds to the local feature of the query image rather than the global feature. Therefore, in this paper, we propose a framework of image retrieval with text manipulation by local feature modification(LFM-IR) which can focus on the related image regions and attributes and perform modification. A spatial attention module and a channel attention module are designed to realize the semantic mapping between image and text. We achieve excellent performance on three benchmark datasets, namely Color-Shape-Size(CSS), Massachusetts Institute of Technology(MIT) States and Fashion200K(+8.3%, +0.7% and +4.6% in R@1).
文摘The volume of information being created, generated and stored is huge. Without adequate knowledge of Information Retrieval (IR) methods, the retrieval process for information would be cumbersome and frustrating. Studies have further revealed that IR methods are essential in information centres (for example, Digital Library environment) for storage and retrieval of information. Therefore, with more than one billion people accessing the Internet, and millions of queries being issued on a daily basis, modern Web search engines are facing a problem of daunting scale. The main problem associated with the existing search engines is how to avoid irrelevant information retrieval and to retrieve the relevant ones. In this study, the existing system of library retrieval was studied. Problems associated with them were analyzed in order to address this problem. The concept of existing information retrieval models was studied, and the knowledge gained was used to design a digital library information retrieval system. It was successfully implemented using a real life data. The need for a continuous evaluation of the IR methods for effective and efficient full text retrieval system was recommended.
文摘To overcome the problem that the confusion between texts limits the precision in text re- trieval, a new text retrieval algorithm that decrease confusion (DCTR) is proposed. The algorithm constructs the searching template to represent the user' s searching intention through positive and negative training. By using the prior probabilities in the template, the supported probability and anti- supported probability of each text in the text library can be estimated for discrimination. The search- ing result can be ranked according to similarities between retrieved texts and the template. The com- plexity of DCTR is close to term frequency and mversed document frequency (TF-IDF). Its distin- guishing ability to confusable texts could be advanced and the performance of the result would be im- proved with increasing of training times.
文摘跨模态图像-文本检索是一项在给定一种模态(如文本)的查询条件下检索另一种模态(如图像)的任务.该任务的关键问题在于如何准确地测量图文两种模态之间的相似性,在减少视觉和语言这两种异构模态之间的视觉语义差异中起着至关重要的作用.传统的检索范式依靠深度学习提取图像和文本的特征表示,并将其映射到一个公共表示空间中进行匹配.然而,这种方法更多地依赖数据表面的相关关系,无法挖掘数据背后真实的因果关系,在高层语义信息的表示和可解释性方面面临着挑战.为此,在深度学习的基础上引入因果推断和嵌入共识知识,提出嵌入共识知识的因果图文检索方法.具体而言,将因果干预引入视觉特征提取模块,通过因果关系替换相关关系学习常识因果视觉特征,并与原始视觉特征进行连接得到最终的视觉特征表示.为解决本方法文本特征表示不足的问题,采用更强大的文本特征提取模型BERT(Bidirectional encoder representations from transformers,双向编码器表示),并且嵌入两种模态数据之间共享的共识知识对图文特征进行共识级的表示学习.在MS-COCO数据集以及MS-COCO到Flickr30k上的跨数据集实验,证明了本文方法可以在双向图文检索任务上实现召回率和平均召回率的一致性改进.
文摘The developed system for eye and face detection using Convolutional Neural Networks(CNN)models,followed by eye classification and voice-based assistance,has shown promising potential in enhancing accessibility for individuals with visual impairments.The modular approach implemented in this research allows for a seamless flow of information and assistance between the different components of the system.This research significantly contributes to the field of accessibility technology by integrating computer vision,natural language processing,and voice technologies.By leveraging these advancements,the developed system offers a practical and efficient solution for assisting blind individuals.The modular design ensures flexibility,scalability,and ease of integration with existing assistive technologies.However,it is important to acknowledge that further research and improvements are necessary to enhance the system’s accuracy and usability.Fine-tuning the CNN models and expanding the training dataset can improve eye and face detection as well as eye classification capabilities.Additionally,incorporating real-time responses through sophisticated natural language understanding techniques and expanding the knowledge base of ChatGPT can enhance the system’s ability to provide comprehensive and accurate responses.Overall,this research paves the way for the development of more advanced and robust systems for assisting visually impaired individuals.By leveraging cutting-edge technologies and integrating them into amodular framework,this research contributes to creating a more inclusive and accessible society for individuals with visual impairments.Future work can focus on refining the system,addressing its limitations,and conducting user studies to evaluate its effectiveness and impact in real-world scenarios.