This study introduces the Orbit Weighting Scheme(OWS),a novel approach aimed at enhancing the precision and efficiency of Vector Space information retrieval(IR)models,which have traditionally relied on weighting schem...This study introduces the Orbit Weighting Scheme(OWS),a novel approach aimed at enhancing the precision and efficiency of Vector Space information retrieval(IR)models,which have traditionally relied on weighting schemes like tf-idf and BM25.These conventional methods often struggle with accurately capturing document relevance,leading to inefficiencies in both retrieval performance and index size management.OWS proposes a dynamic weighting mechanism that evaluates the significance of terms based on their orbital position within the vector space,emphasizing term relationships and distribution patterns overlooked by existing models.Our research focuses on evaluating OWS’s impact on model accuracy using Information Retrieval metrics like Recall,Precision,InterpolatedAverage Precision(IAP),andMeanAverage Precision(MAP).Additionally,we assessOWS’s effectiveness in reducing the inverted index size,crucial for model efficiency.We compare OWS-based retrieval models against others using different schemes,including tf-idf variations and BM25Delta.Results reveal OWS’s superiority,achieving a 54%Recall and 81%MAP,and a notable 38%reduction in the inverted index size.This highlights OWS’s potential in optimizing retrieval processes and underscores the need for further research in this underrepresented area to fully leverage OWS’s capabilities in information retrieval methodologies.展开更多
A hybrid model that is based on the Combination of keywords and concept was put forward. The hybrid model is built on vector space model and probabilistic reasoning network. It not only can exert the advantages of key...A hybrid model that is based on the Combination of keywords and concept was put forward. The hybrid model is built on vector space model and probabilistic reasoning network. It not only can exert the advantages of keywords retrieval and concept retrieval but also can compensate for their shortcomings. Their parameters can be adjusted according to different usage in order to accept the best information retrieval result, and it has been proved by our experiments.展开更多
The major problem of the most current approaches of information models lies in that individual words provide unreliable evidence about the content of the texts. When the document is short, e.g. only the abstract is av...The major problem of the most current approaches of information models lies in that individual words provide unreliable evidence about the content of the texts. When the document is short, e.g. only the abstract is available, the word-use variability problem will have substantial impact on the Information Retrieval (IR) performance. To solve the problem, a new technology to short document retrieval named Reference Document Model (RDM) is put forward in this letter. RDM gets the statistical semantic of the query/document by pseudo feedback both for the query and document from reference documents. The contributions of this model are three-fold: (1) Pseudo feedback both for the query and the document; (2) Building the query model and the document model from reference documents; (3) Flexible indexing units, which can be ally linguistic elements such as documents, paragraphs, sentences, n-grams, term or character. For short document retrieval, RDM achieves significant improvements over the classical probabilistic models on the task of ad hoc retrieval on Text REtrieval Conference (TREC) test sets. Results also show that the shorter the document, the better the RDM performance.展开更多
To eliminate the mismatch between words of relevant documents and user's query and more seriousnegative effects it has on the performance of information retrieval,a method of query expansion on the ba-sis of new t...To eliminate the mismatch between words of relevant documents and user's query and more seriousnegative effects it has on the performance of information retrieval,a method of query expansion on the ba-sis of new terms co-occurrence representation was put forward by analyzing the process of producingquery.The expansion terms were selected according to their correlation to the whole query.At the sametime,the position information between terms were considered.The experimental result on test retrievalconference(TREC)data collection shows that the method proposed in the paper has made an improve-ment of 5%~19% all the time than the language modeling method without expansion.Compared to thepopular approach of query expansion,pseudo feedback,the precision of the proposed method is competi-tive.展开更多
A language model for information retrieval is built by using a query language model to generate queries and a document language model to generate documents. The documents are ranked according to the relative entropies...A language model for information retrieval is built by using a query language model to generate queries and a document language model to generate documents. The documents are ranked according to the relative entropies of estimated document language models with respect to the estimated query language model. Two popular and relatively efficient smoothing methods, the Jelinek- Mercer method and the absolute discounting method, are used to smooth the document language model in estimation of the document language, A combined model composed of the feedback document language model and the collection language model is used to estimate the query model. A performacne comparison between the new retrieval method and the existing method with feedback is made, and the retrieval performances of the proposed method with the two different smoothing techniques are evaluated on three Text Retrieval Conference (TREC) data sets. Experimental results show that the method is effective and performs better than the basic language modeling approach; moreover, the method using the Jelinek-Mercer technique performs better than that using the absolute discounting technique, and the perfomance is sensitive to the smoothing peramters.展开更多
During a two day strategic workshop in February 2018,22 information retrieval researchers met to discuss the future challenges and opportunities within the field.The outcome is a list of potential research directions,...During a two day strategic workshop in February 2018,22 information retrieval researchers met to discuss the future challenges and opportunities within the field.The outcome is a list of potential research directions,project ideas,and challenges.This report describes the major conclusions we have obtained during the workshop.A key result is that we need to open our mind to embrace a broader IR field by rethink the definition of information,retrieval,user,system,and evaluation of IR.By providing detailed discussions on these topics,this report is expected to inspire our IR researchers in both academia and industry,and help the future growth of the IR research community.展开更多
The paper provides a semantic vector retrieval model for desktop documents based on the ontology. Comparing with traditional vector space model, the semantic model using semantic and ontology technology to solve sever...The paper provides a semantic vector retrieval model for desktop documents based on the ontology. Comparing with traditional vector space model, the semantic model using semantic and ontology technology to solve several problems that traditional model could not overcome such as the shortcomings of weight computing based on statistical method, the expression of semantic relations between different keywords, the description of document semantic vectors and the similarity calculating, etc. Finally, the experimental results show that the retrieval ability of our new model has significant improvement both on recall and precision.展开更多
The traditional information hiding methods embed the secret information by modifying the carrier,which will inevitably leave traces of modification on the carrier.In this way,it is hard to resist the detection of steg...The traditional information hiding methods embed the secret information by modifying the carrier,which will inevitably leave traces of modification on the carrier.In this way,it is hard to resist the detection of steganalysis algorithm.To address this problem,the concept of coverless information hiding was proposed.Coverless information hiding can effectively resist steganalysis algorithm,since it uses unmodified natural stego-carriers to represent and convey confidential information.However,the state-of-the-arts method has a low hidden capacity,which makes it less appealing.Because the pixel values of different regions of the molecular structure images of material(MSIM)are usually different,this paper proposes a novel coverless information hiding method based on MSIM,which utilizes the average value of sub-image’s pixels to represent the secret information,according to the mapping between pixel value intervals and secret information.In addition,we employ a pseudo-random label sequence that is used to determine the position of sub-images to improve the security of the method.And the histogram of the Bag of words model(BOW)is used to determine the number of subimages in the image that convey secret information.Moreover,to improve the retrieval efficiency,we built a multi-level inverted index structure.Furthermore,the proposed method can also be used for other natural images.Compared with the state-of-the-arts,experimental results and analysis manifest that our method has better performance in anti-steganalysis,security and capacity.展开更多
The use of agent technology in a dynamic environment is rapidly growing as one of the powerful technologies and the need to provide the benefits of the Intelligent Information Agent technique to massive open online co...The use of agent technology in a dynamic environment is rapidly growing as one of the powerful technologies and the need to provide the benefits of the Intelligent Information Agent technique to massive open online courses, is very important from various aspects including the rapid growing of MOOCs environments, and the focusing more on static information than on updated information. One of the main problems in such environment is updating the information to the needs of the student who interacts at each moment. Using such technology can ensure more flexible information, lower waste time and hence higher earnings in learning. This paper presents Intelligent Topic-Based Information Agent to offer an updated knowledge including various types of resource for students. Using dominant meaning method, the agent searches the Internet, controls the metadata coming from the Internet, filters and shows them into a categorized content lists. There are two experiments conducted on the Intelligent Topic-Based Information Agent: one measures the improvement in the retrieval effectiveness and the other measures the impact of the agent on the learning. The experiment results indicate that our methodology to expand the query yields a considerable improvement in the retrieval effectiveness in all categories of Google Web Search API. On the other hand, there is a positive impact on the performance of learning session.展开更多
Web search provides a promising way for people to obtain information and has been extensively studied.With the surge of deep learning and large-scale pre-training techniques,various neural information retrieval models...Web search provides a promising way for people to obtain information and has been extensively studied.With the surge of deep learning and large-scale pre-training techniques,various neural information retrieval models are proposed,and they have demonstrated the power for improving search(especially,the ranking)quality.All these existing search methods follow a common paradigm,i.e.,index-retrieve-rerank,where they first build an index of all documents based on document terms(i.e.,sparse inverted index)or representation vectors(i.e.,dense vector index),then retrieve and rerank retrieved documents based on the similarity between the query and documents via ranking models.In this paper,we explore a new paradigm of information retrieval without an explicit index but only with a pre-trained model.Instead,all of the knowledge of the documents is encoded into model parameters,which can be regarded as a differentiable indexer and optimized in an end-to-end manner.Specifically,we propose a pre-trained model-based information retrieval(IR)system called DynamicRetriever,which directly returns document identifiers for a given query.Under such a framework,we implement two variants to explore how to train the model from scratch and how to combine the advantages of dense retrieval models.Compared with existing search methods,the model-based IR system parameterizes the traditional static index with a pre-training model,which converts the document semantic mapping into a dynamic and updatable process.Extensive experiments conducted on the public search benchmark Microsoft machine reading comprehension(MS MARCO)verify the effectiveness and potential of our proposed new paradigm for information retrieval.展开更多
信息检索(IR)是一种通过特定的技术和方法组织、处理信息,以满足用户的信息需求的过程。近年来,基于预训练模型的稠密检索方法取得了巨大的成功;然而,这些方法只利用了文本和词语的向量表征计算查询与文档相关度,忽略了它们短语层面间...信息检索(IR)是一种通过特定的技术和方法组织、处理信息,以满足用户的信息需求的过程。近年来,基于预训练模型的稠密检索方法取得了巨大的成功;然而,这些方法只利用了文本和词语的向量表征计算查询与文档相关度,忽略了它们短语层面间的语义信息。针对该问题,提出一种名为MSIR(Multi-Scale IR)的IR方法。所提方法通过融合查询与文档中多种不同粒度的语义信息提高IR性能。首先,构建查询和文档中词语、短语和文本这3个粒度的语义单元;其次,利用预训练模型对这3个语义单元分别进行编码获得它们的语义表征;最后,利用语义表征计算查询和文档相关度。在Corvid-19、TREC2019和Robust04这3个不同大小的经典数据集上进行了对比实验。与ColBERT(ranking model based on Contextualized late interaction over BERT(Bidirectional Encoder Representation from Transformers))相比,MSIR在Robust04数据集的P@10、P@20、NDCG@10和NDCG@20指标上均实现了约8%的提升,同时在Corvid-19和TREC2019数据集上也取得了一定的改进。实验结果表明,MSIR能够成功融合多种语义粒度,提升检索精度。展开更多
A new method to evaluate fuzzily user's relevance on the basis of cloud models has been proposed. All factors of personalized information retrieval system are taken into account in this method. So using this method f...A new method to evaluate fuzzily user's relevance on the basis of cloud models has been proposed. All factors of personalized information retrieval system are taken into account in this method. So using this method for personalized information retrieval (PIR) system can efficiently judge multi-value relevance, such as quite relevant, comparatively relevant, commonly relevant, basically relevant and completely non-relevant, and realize a kind of transform of qualitative concepts and quantity and improve accuracy of relevance judgements in PIR system. Experimental data showed that the method is practical and valid. Evaluation results are more accurate and approach to the fact better.展开更多
This paper proposed a novel text representation and matching scheme for Chinese text retrieval. At present, the indexing methods of Chinese retrieval systems are either character-based or word-based. The character-bas...This paper proposed a novel text representation and matching scheme for Chinese text retrieval. At present, the indexing methods of Chinese retrieval systems are either character-based or word-based. The character-based indexing methods, such as bi-gram or tri-gram indexing, have high false drops due to the mismatches between queries and documents. On the other hand, it's difficult to efficiently identify all the proper nouns, terminology of different domains, and phrases in the word-based indexing systems. The new indexing method uses both proximity and mutual information of the word pairs to represent the text content so as to overcome the high false drop, new word and phrase problems that exist in the character-based and word-based systems. The evaluation results indicate that the average query precision of proximity-based indexing is 5.2% higher than the best results of TREC-5.展开更多
在HBIM(Historic Building Information Modeling)数据库中进行信息查询面临三个问题:一是没有普适性的规则判断建筑之间的相似性;二是未考虑建筑本身所包含的历史文化信息;三是查询文本多基于关键词,难以检索到关键词未包含的信息。针...在HBIM(Historic Building Information Modeling)数据库中进行信息查询面临三个问题:一是没有普适性的规则判断建筑之间的相似性;二是未考虑建筑本身所包含的历史文化信息;三是查询文本多基于关键词,难以检索到关键词未包含的信息。针对以上问题,提出了一种面向历史建筑的多模态检索方法,用户能通过输入图像或自然语言文本数据,检索到与输入特征相符的建筑,并以列表形式进行排序。在以图像检索建筑时,利用“dino_vit16”模型对图像进行特征提取,所提出的图像-建筑检索方法检索精度达90.08%;在文本检索建筑时则基于CLIP(Contrastive Language-Image Pre-training)模型建立图像和文本的关联,研究了图文相似度和文本相似度权重的取值,选择m=0.6,n=0.4作为权重的最佳配置。实验证明所提出的文本-建筑检索算法对于包含某种外观特征查询语句的检索效果最好,对于描述某种功能和建筑风格的查询语句检索效果最差,而当查询语句中包含4个以上的混合特征,能够描述出建筑的基本面貌时,可以准确地检索到符合条件的建筑。展开更多
文摘This study introduces the Orbit Weighting Scheme(OWS),a novel approach aimed at enhancing the precision and efficiency of Vector Space information retrieval(IR)models,which have traditionally relied on weighting schemes like tf-idf and BM25.These conventional methods often struggle with accurately capturing document relevance,leading to inefficiencies in both retrieval performance and index size management.OWS proposes a dynamic weighting mechanism that evaluates the significance of terms based on their orbital position within the vector space,emphasizing term relationships and distribution patterns overlooked by existing models.Our research focuses on evaluating OWS’s impact on model accuracy using Information Retrieval metrics like Recall,Precision,InterpolatedAverage Precision(IAP),andMeanAverage Precision(MAP).Additionally,we assessOWS’s effectiveness in reducing the inverted index size,crucial for model efficiency.We compare OWS-based retrieval models against others using different schemes,including tf-idf variations and BM25Delta.Results reveal OWS’s superiority,achieving a 54%Recall and 81%MAP,and a notable 38%reduction in the inverted index size.This highlights OWS’s potential in optimizing retrieval processes and underscores the need for further research in this underrepresented area to fully leverage OWS’s capabilities in information retrieval methodologies.
文摘A hybrid model that is based on the Combination of keywords and concept was put forward. The hybrid model is built on vector space model and probabilistic reasoning network. It not only can exert the advantages of keywords retrieval and concept retrieval but also can compensate for their shortcomings. Their parameters can be adjusted according to different usage in order to accept the best information retrieval result, and it has been proved by our experiments.
基金Supported by the Funds of Heilongjiang Outstanding Young Teacher (1151G037).
文摘The major problem of the most current approaches of information models lies in that individual words provide unreliable evidence about the content of the texts. When the document is short, e.g. only the abstract is available, the word-use variability problem will have substantial impact on the Information Retrieval (IR) performance. To solve the problem, a new technology to short document retrieval named Reference Document Model (RDM) is put forward in this letter. RDM gets the statistical semantic of the query/document by pseudo feedback both for the query and document from reference documents. The contributions of this model are three-fold: (1) Pseudo feedback both for the query and the document; (2) Building the query model and the document model from reference documents; (3) Flexible indexing units, which can be ally linguistic elements such as documents, paragraphs, sentences, n-grams, term or character. For short document retrieval, RDM achieves significant improvements over the classical probabilistic models on the task of ad hoc retrieval on Text REtrieval Conference (TREC) test sets. Results also show that the shorter the document, the better the RDM performance.
基金the High Technology Research and Development Program of China(No.2006AA01Z150)the National Natural Science Foundation of China(No.60435020)
文摘To eliminate the mismatch between words of relevant documents and user's query and more seriousnegative effects it has on the performance of information retrieval,a method of query expansion on the ba-sis of new terms co-occurrence representation was put forward by analyzing the process of producingquery.The expansion terms were selected according to their correlation to the whole query.At the sametime,the position information between terms were considered.The experimental result on test retrievalconference(TREC)data collection shows that the method proposed in the paper has made an improve-ment of 5%~19% all the time than the language modeling method without expansion.Compared to thepopular approach of query expansion,pseudo feedback,the precision of the proposed method is competi-tive.
基金The National Natural Science Founda-tion of China ( No. 60473004)the Science and ResearchFoundation Program of Henan University of Science and Tech-nology (No.2004ZY041)the Natural and Science FoundationProgram of the Education Department of Henan Province (No.200410464004)
文摘A language model for information retrieval is built by using a query language model to generate queries and a document language model to generate documents. The documents are ranked according to the relative entropies of estimated document language models with respect to the estimated query language model. Two popular and relatively efficient smoothing methods, the Jelinek- Mercer method and the absolute discounting method, are used to smooth the document language model in estimation of the document language, A combined model composed of the feedback document language model and the collection language model is used to estimate the query model. A performacne comparison between the new retrieval method and the existing method with feedback is made, and the retrieval performances of the proposed method with the two different smoothing techniques are evaluated on three Text Retrieval Conference (TREC) data sets. Experimental results show that the method is effective and performs better than the basic language modeling approach; moreover, the method using the Jelinek-Mercer technique performs better than that using the absolute discounting technique, and the perfomance is sensitive to the smoothing peramters.
文摘During a two day strategic workshop in February 2018,22 information retrieval researchers met to discuss the future challenges and opportunities within the field.The outcome is a list of potential research directions,project ideas,and challenges.This report describes the major conclusions we have obtained during the workshop.A key result is that we need to open our mind to embrace a broader IR field by rethink the definition of information,retrieval,user,system,and evaluation of IR.By providing detailed discussions on these topics,this report is expected to inspire our IR researchers in both academia and industry,and help the future growth of the IR research community.
文摘The paper provides a semantic vector retrieval model for desktop documents based on the ontology. Comparing with traditional vector space model, the semantic model using semantic and ontology technology to solve several problems that traditional model could not overcome such as the shortcomings of weight computing based on statistical method, the expression of semantic relations between different keywords, the description of document semantic vectors and the similarity calculating, etc. Finally, the experimental results show that the retrieval ability of our new model has significant improvement both on recall and precision.
基金This work is supported,in part,by the National Natural Science Foundation of China under grant numbers U1536206,U1405254,61772283,61602253,61672294,61502242in part,by the Jiangsu Basic Research Programs-Natural Science Foundation under grant numbers BK20150925 and BK20151530+1 种基金in part,by the Priority Academic Program Development of Jiangsu Higher Education Institutions(PAPD)fundin part,by the Collaborative Innovation Center of Atmospheric Environment and Equipment Technology(CICAEET)fund,China.
文摘The traditional information hiding methods embed the secret information by modifying the carrier,which will inevitably leave traces of modification on the carrier.In this way,it is hard to resist the detection of steganalysis algorithm.To address this problem,the concept of coverless information hiding was proposed.Coverless information hiding can effectively resist steganalysis algorithm,since it uses unmodified natural stego-carriers to represent and convey confidential information.However,the state-of-the-arts method has a low hidden capacity,which makes it less appealing.Because the pixel values of different regions of the molecular structure images of material(MSIM)are usually different,this paper proposes a novel coverless information hiding method based on MSIM,which utilizes the average value of sub-image’s pixels to represent the secret information,according to the mapping between pixel value intervals and secret information.In addition,we employ a pseudo-random label sequence that is used to determine the position of sub-images to improve the security of the method.And the histogram of the Bag of words model(BOW)is used to determine the number of subimages in the image that convey secret information.Moreover,to improve the retrieval efficiency,we built a multi-level inverted index structure.Furthermore,the proposed method can also be used for other natural images.Compared with the state-of-the-arts,experimental results and analysis manifest that our method has better performance in anti-steganalysis,security and capacity.
文摘The use of agent technology in a dynamic environment is rapidly growing as one of the powerful technologies and the need to provide the benefits of the Intelligent Information Agent technique to massive open online courses, is very important from various aspects including the rapid growing of MOOCs environments, and the focusing more on static information than on updated information. One of the main problems in such environment is updating the information to the needs of the student who interacts at each moment. Using such technology can ensure more flexible information, lower waste time and hence higher earnings in learning. This paper presents Intelligent Topic-Based Information Agent to offer an updated knowledge including various types of resource for students. Using dominant meaning method, the agent searches the Internet, controls the metadata coming from the Internet, filters and shows them into a categorized content lists. There are two experiments conducted on the Intelligent Topic-Based Information Agent: one measures the improvement in the retrieval effectiveness and the other measures the impact of the agent on the learning. The experiment results indicate that our methodology to expand the query yields a considerable improvement in the retrieval effectiveness in all categories of Google Web Search API. On the other hand, there is a positive impact on the performance of learning session.
基金supported by National Natural Science Foundation of China(Nos.61872370 and 61832017)Beijing Outstanding Young Scientist Program(No.BJJWZYJH012019100020098)Beijing Academy of Artificial Intelligence(BAAI),the Outstanding Innovative Talents Cultivation Funded Programs 2021 of Renmin University of China,and Intelligent Social Governance Platform,Major Innovation&Planning Interdisciplinary Platform for the“Double-First Class”Initiative,Renmin University of China.
文摘Web search provides a promising way for people to obtain information and has been extensively studied.With the surge of deep learning and large-scale pre-training techniques,various neural information retrieval models are proposed,and they have demonstrated the power for improving search(especially,the ranking)quality.All these existing search methods follow a common paradigm,i.e.,index-retrieve-rerank,where they first build an index of all documents based on document terms(i.e.,sparse inverted index)or representation vectors(i.e.,dense vector index),then retrieve and rerank retrieved documents based on the similarity between the query and documents via ranking models.In this paper,we explore a new paradigm of information retrieval without an explicit index but only with a pre-trained model.Instead,all of the knowledge of the documents is encoded into model parameters,which can be regarded as a differentiable indexer and optimized in an end-to-end manner.Specifically,we propose a pre-trained model-based information retrieval(IR)system called DynamicRetriever,which directly returns document identifiers for a given query.Under such a framework,we implement two variants to explore how to train the model from scratch and how to combine the advantages of dense retrieval models.Compared with existing search methods,the model-based IR system parameterizes the traditional static index with a pre-training model,which converts the document semantic mapping into a dynamic and updatable process.Extensive experiments conducted on the public search benchmark Microsoft machine reading comprehension(MS MARCO)verify the effectiveness and potential of our proposed new paradigm for information retrieval.
文摘信息检索(IR)是一种通过特定的技术和方法组织、处理信息,以满足用户的信息需求的过程。近年来,基于预训练模型的稠密检索方法取得了巨大的成功;然而,这些方法只利用了文本和词语的向量表征计算查询与文档相关度,忽略了它们短语层面间的语义信息。针对该问题,提出一种名为MSIR(Multi-Scale IR)的IR方法。所提方法通过融合查询与文档中多种不同粒度的语义信息提高IR性能。首先,构建查询和文档中词语、短语和文本这3个粒度的语义单元;其次,利用预训练模型对这3个语义单元分别进行编码获得它们的语义表征;最后,利用语义表征计算查询和文档相关度。在Corvid-19、TREC2019和Robust04这3个不同大小的经典数据集上进行了对比实验。与ColBERT(ranking model based on Contextualized late interaction over BERT(Bidirectional Encoder Representation from Transformers))相比,MSIR在Robust04数据集的P@10、P@20、NDCG@10和NDCG@20指标上均实现了约8%的提升,同时在Corvid-19和TREC2019数据集上也取得了一定的改进。实验结果表明,MSIR能够成功融合多种语义粒度,提升检索精度。
文摘A new method to evaluate fuzzily user's relevance on the basis of cloud models has been proposed. All factors of personalized information retrieval system are taken into account in this method. So using this method for personalized information retrieval (PIR) system can efficiently judge multi-value relevance, such as quite relevant, comparatively relevant, commonly relevant, basically relevant and completely non-relevant, and realize a kind of transform of qualitative concepts and quantity and improve accuracy of relevance judgements in PIR system. Experimental data showed that the method is practical and valid. Evaluation results are more accurate and approach to the fact better.
文摘This paper proposed a novel text representation and matching scheme for Chinese text retrieval. At present, the indexing methods of Chinese retrieval systems are either character-based or word-based. The character-based indexing methods, such as bi-gram or tri-gram indexing, have high false drops due to the mismatches between queries and documents. On the other hand, it's difficult to efficiently identify all the proper nouns, terminology of different domains, and phrases in the word-based indexing systems. The new indexing method uses both proximity and mutual information of the word pairs to represent the text content so as to overcome the high false drop, new word and phrase problems that exist in the character-based and word-based systems. The evaluation results indicate that the average query precision of proximity-based indexing is 5.2% higher than the best results of TREC-5.
文摘在HBIM(Historic Building Information Modeling)数据库中进行信息查询面临三个问题:一是没有普适性的规则判断建筑之间的相似性;二是未考虑建筑本身所包含的历史文化信息;三是查询文本多基于关键词,难以检索到关键词未包含的信息。针对以上问题,提出了一种面向历史建筑的多模态检索方法,用户能通过输入图像或自然语言文本数据,检索到与输入特征相符的建筑,并以列表形式进行排序。在以图像检索建筑时,利用“dino_vit16”模型对图像进行特征提取,所提出的图像-建筑检索方法检索精度达90.08%;在文本检索建筑时则基于CLIP(Contrastive Language-Image Pre-training)模型建立图像和文本的关联,研究了图文相似度和文本相似度权重的取值,选择m=0.6,n=0.4作为权重的最佳配置。实验证明所提出的文本-建筑检索算法对于包含某种外观特征查询语句的检索效果最好,对于描述某种功能和建筑风格的查询语句检索效果最差,而当查询语句中包含4个以上的混合特征,能够描述出建筑的基本面貌时,可以准确地检索到符合条件的建筑。