This study introduces the Orbit Weighting Scheme(OWS),a novel approach aimed at enhancing the precision and efficiency of Vector Space information retrieval(IR)models,which have traditionally relied on weighting schem...This study introduces the Orbit Weighting Scheme(OWS),a novel approach aimed at enhancing the precision and efficiency of Vector Space information retrieval(IR)models,which have traditionally relied on weighting schemes like tf-idf and BM25.These conventional methods often struggle with accurately capturing document relevance,leading to inefficiencies in both retrieval performance and index size management.OWS proposes a dynamic weighting mechanism that evaluates the significance of terms based on their orbital position within the vector space,emphasizing term relationships and distribution patterns overlooked by existing models.Our research focuses on evaluating OWS’s impact on model accuracy using Information Retrieval metrics like Recall,Precision,InterpolatedAverage Precision(IAP),andMeanAverage Precision(MAP).Additionally,we assessOWS’s effectiveness in reducing the inverted index size,crucial for model efficiency.We compare OWS-based retrieval models against others using different schemes,including tf-idf variations and BM25Delta.Results reveal OWS’s superiority,achieving a 54%Recall and 81%MAP,and a notable 38%reduction in the inverted index size.This highlights OWS’s potential in optimizing retrieval processes and underscores the need for further research in this underrepresented area to fully leverage OWS’s capabilities in information retrieval methodologies.展开更多
信息检索(IR)是一种通过特定的技术和方法组织、处理信息,以满足用户的信息需求的过程。近年来,基于预训练模型的稠密检索方法取得了巨大的成功;然而,这些方法只利用了文本和词语的向量表征计算查询与文档相关度,忽略了它们短语层面间...信息检索(IR)是一种通过特定的技术和方法组织、处理信息,以满足用户的信息需求的过程。近年来,基于预训练模型的稠密检索方法取得了巨大的成功;然而,这些方法只利用了文本和词语的向量表征计算查询与文档相关度,忽略了它们短语层面间的语义信息。针对该问题,提出一种名为MSIR(Multi-Scale IR)的IR方法。所提方法通过融合查询与文档中多种不同粒度的语义信息提高IR性能。首先,构建查询和文档中词语、短语和文本这3个粒度的语义单元;其次,利用预训练模型对这3个语义单元分别进行编码获得它们的语义表征;最后,利用语义表征计算查询和文档相关度。在Corvid-19、TREC2019和Robust04这3个不同大小的经典数据集上进行了对比实验。与ColBERT(ranking model based on Contextualized late interaction over BERT(Bidirectional Encoder Representation from Transformers))相比,MSIR在Robust04数据集的P@10、P@20、NDCG@10和NDCG@20指标上均实现了约8%的提升,同时在Corvid-19和TREC2019数据集上也取得了一定的改进。实验结果表明,MSIR能够成功融合多种语义粒度,提升检索精度。展开更多
随着智能电网建设的全面展开,产生了大量与设备缺陷相关的电力设备缺陷文本,蕴含着故障类型、故障原因及设备消缺方法等关键信息,是电力领域的研究热点。但缺陷文本存在着体量大、多源异构和内容杂乱冗余的问题,目前缺乏对其进行高效整...随着智能电网建设的全面展开,产生了大量与设备缺陷相关的电力设备缺陷文本,蕴含着故障类型、故障原因及设备消缺方法等关键信息,是电力领域的研究热点。但缺陷文本存在着体量大、多源异构和内容杂乱冗余的问题,目前缺乏对其进行高效整合利用的方法。针对以上问题,该文基于BERT(bidirectional encoder representation from transformers)模型对命名实体抽取技术展开研究。一方面,增加了双向长短期记忆(bi-directional long short-term memory,Bi-LSTM)层进一步提取文本语义信息;另一方面,采用条件随机场(conditional random field,CRF)替换了BERT的输出层,克服了预测标签的局部最优问题。最后融合以上2种策略提出了改进BERT算法,即将BERT与双向长短记忆网络和条件随机场相结合,实现了缺陷文本的命名实体抽取。实验结果表明,改进BERT算法在7类实体上均取得了较高的F1值(精确率和召回率的加权调和平均值)。与BERT相比,实体抽取的总体精确率和召回率分别提升了0.94%和0.95%。展开更多
文摘This study introduces the Orbit Weighting Scheme(OWS),a novel approach aimed at enhancing the precision and efficiency of Vector Space information retrieval(IR)models,which have traditionally relied on weighting schemes like tf-idf and BM25.These conventional methods often struggle with accurately capturing document relevance,leading to inefficiencies in both retrieval performance and index size management.OWS proposes a dynamic weighting mechanism that evaluates the significance of terms based on their orbital position within the vector space,emphasizing term relationships and distribution patterns overlooked by existing models.Our research focuses on evaluating OWS’s impact on model accuracy using Information Retrieval metrics like Recall,Precision,InterpolatedAverage Precision(IAP),andMeanAverage Precision(MAP).Additionally,we assessOWS’s effectiveness in reducing the inverted index size,crucial for model efficiency.We compare OWS-based retrieval models against others using different schemes,including tf-idf variations and BM25Delta.Results reveal OWS’s superiority,achieving a 54%Recall and 81%MAP,and a notable 38%reduction in the inverted index size.This highlights OWS’s potential in optimizing retrieval processes and underscores the need for further research in this underrepresented area to fully leverage OWS’s capabilities in information retrieval methodologies.
文摘信息检索(IR)是一种通过特定的技术和方法组织、处理信息,以满足用户的信息需求的过程。近年来,基于预训练模型的稠密检索方法取得了巨大的成功;然而,这些方法只利用了文本和词语的向量表征计算查询与文档相关度,忽略了它们短语层面间的语义信息。针对该问题,提出一种名为MSIR(Multi-Scale IR)的IR方法。所提方法通过融合查询与文档中多种不同粒度的语义信息提高IR性能。首先,构建查询和文档中词语、短语和文本这3个粒度的语义单元;其次,利用预训练模型对这3个语义单元分别进行编码获得它们的语义表征;最后,利用语义表征计算查询和文档相关度。在Corvid-19、TREC2019和Robust04这3个不同大小的经典数据集上进行了对比实验。与ColBERT(ranking model based on Contextualized late interaction over BERT(Bidirectional Encoder Representation from Transformers))相比,MSIR在Robust04数据集的P@10、P@20、NDCG@10和NDCG@20指标上均实现了约8%的提升,同时在Corvid-19和TREC2019数据集上也取得了一定的改进。实验结果表明,MSIR能够成功融合多种语义粒度,提升检索精度。
文摘随着智能电网建设的全面展开,产生了大量与设备缺陷相关的电力设备缺陷文本,蕴含着故障类型、故障原因及设备消缺方法等关键信息,是电力领域的研究热点。但缺陷文本存在着体量大、多源异构和内容杂乱冗余的问题,目前缺乏对其进行高效整合利用的方法。针对以上问题,该文基于BERT(bidirectional encoder representation from transformers)模型对命名实体抽取技术展开研究。一方面,增加了双向长短期记忆(bi-directional long short-term memory,Bi-LSTM)层进一步提取文本语义信息;另一方面,采用条件随机场(conditional random field,CRF)替换了BERT的输出层,克服了预测标签的局部最优问题。最后融合以上2种策略提出了改进BERT算法,即将BERT与双向长短记忆网络和条件随机场相结合,实现了缺陷文本的命名实体抽取。实验结果表明,改进BERT算法在7类实体上均取得了较高的F1值(精确率和召回率的加权调和平均值)。与BERT相比,实体抽取的总体精确率和召回率分别提升了0.94%和0.95%。