摘要
在信息检索建模中,确定索引词项在文档中的重要性是一项重要内容。以词袋(bag-of-word)的形式表示文档来建立检索模型的方法中大多是基于词项独立性假设,用TF和IDF的函数来计算词项的重要性,并未考虑词项之间的关系。该文采用基于词项图(graph-of-word)的文档表示形式来捕获词项间的依赖关系,提出了一种新的基于词重要性的信息检索图模型TI-IDF。根据词项图得到文档中词项的共现矩阵和词项间的概率转移矩阵,通过马尔科夫链计算方法来确定词项在文档中的重要性(Term Importance,TI),并以此替代索引过程中传统的词项频率TF。该模型具有更好的鲁棒性,我们在国际公开数据集上与传统的检索模型进行了比较。实验结果表明,该文提出的模型都要优于BM25,且在大多数情况下优于BM25的扩展模型、TW-IDF等模型。
In information retrieval modeling,to determine the importance of index terms of the documents is an important content.Those retrieval models which use a bag-of-word document representation are mostly based on the term independence assumption,and calculate the termsimportance by the functions of TF and IDF,without considering about the relationship between terms.In this paper,we used a document representation based on graph-ofword to capture the dependencies between terms,and proposed a novel information graph retrieval model TI-IDF.According to the graph,we obtained the co-occurrence matrix and the probability transfer matrix of terms,then we determined the termsimportance(TI)by using the Markov chain computing method,and used TI to replace traditional term frequency at indexing time.This model possesses a better robustness,we compared our model with traditional retrieval models on the international public datasets.Experimental results show that,the proposed model is consistently superior to BM25 and better than its extension models,TW-IDF and other models in most cases.
作者
王明文
洪欢
江爱文
左家莉
WANG Mingwen HONG Huan JIANG Aiwen ZUO Jiali(School of Computer Information Engineering, Jiangxi Normal University, Nanchang, Jiangxi 330022, Chin)
出处
《中文信息学报》
CSCD
北大核心
2016年第4期134-141,共8页
Journal of Chinese Information Processing
基金
国家自然科学基金(61272212
61462043
61462045)
江西省自然科学基金(20122BAB211032
2015BAB217014)