A New Retrieval Model Based on TextTiling for Document Similarity Search 被引量：2

导出

摘要 Document similarity search is to find documents similar to a given query document and return a ranked list of similar documents to users, which is widely used in many text and web systems, such as digital library, search engine,etc. Traditional retrieval models, including the Okapi's BM25 model and the Smart's vector space model with length normalization, could handle this problem to some extent by taking the query document as a long query. In practice,the Cosine measure is considered as the best model for document similarity search because of its good ability to measure similarity between two documents. In this paper, the quantitative performances of the above models are compared using experiments. Because the Cosine measure is not able to reflect the structural similarity between documents, a new retrieval model based on TextTiling is proposed in the paper. The proposed model takes into account the subtopic structures of documents. It first splits the documents into text segments with TextTiling and calculates the similarities for different pairs of text segments in the documents. Lastly the overall similarity between the documents is returned by combining the similarities of different pairs of text segments with optimal matching method. Experiments are performed and results show:1) the popular retrieval models (the Okapi's BM25 model and the Smart's vector space model with length normalization)do not perform well for document similarity search; 2) the proposed model based on TextTiling is effective and outperforms other models, including the Cosine measure; 3) the methods for the three components in the proposed model are validated to be appropriately employed.

作者万小军彭宇新

机构地区 NationalKeyLaboratoryofTextProcessingTechnology

出处《Journal of Computer Science & Technology》 SCIE EI CSCD 2005年第4期552-558,共7页 计算机科学技术学报（英文版）

关键词 WEB 计算机网络文件搜索文件恢复

分类号 TP393 [自动化与计算机技术—计算机应用技术]

引文网络
相关文献

参考文献22

1Robertson S, Walker S. Some simple effective approximations to the 2-poisson model for probabilistic weighted retrieval. In Proc. the 17th International ACM/SIGIR Conference on Research and Development in Information Retrieval(SIGIR'2003), Dublin, Ireland, 1994, pp.232-241.
2Salton G. The SMART Retrieval System: Experiments in Automatic Document Processing. Prentice-Hall, 1971.
3Baeza-Yates R, Ribeiro-Neto B. Modern Information Retrieval. ACM Press and Addison Wesley, New York, 1999.
4Jones W P, Furnas G W. Pictures of relevance: A geometric analysis of similarity measure. Journal of the American Society for Information Science, 1987, 38(6): 420-442.
5Zobel J, Moffat A. Exploring the similarity space. ACM SIGIR FORUM, 1998, 32(1): 18-34.
6Aslam J A, Frost M. An information-theoretic measure for document similarity. In Proc. the 26th International ACM/SIGIR Conference on Research and Development in Information Retrieval (SIGIR'2003), Toronto, Canada, 2003,pp.449-450.
7Hammouda K M, Kamel M S. Document similarity using a phrase indexing graph model. Journal of Knowledge and Information Systems, 2004, 6(6): 710-717.
8I .Hearst M A. Multi-paragraph segmentation of expository text.In Proc. the 32nd Meeting of the Association for Computational Linguistics (ACL'1994), Los Cruces, NM, 1994, pp.9-16.
9van Rijsbergen C J. Information Retrieval. Butterworths,London, 1979.
10Deerwester S C, Dumais S T, Landauer T K et al. Indexing by latent semantic analysis. Journal of the American Societyof Information Science, 1990, 41(6): 211-240.

同被引文献14

1Becker J,Kuropka D.Topic-based Vector Space Model[C]// Proceedings of the 6th International Conference on Business Information System.Colorado,USA:Springs,2003:7-12.
2Blei D M,Ng A Y,Jordan M I.Latent Dirichlet Allocation[J].Journal of Machine Learning Research,2003,3(5):993-1022.
3Griffiths T L,Steyvers M.Finding Scientific Topics[J].Journal of the National Academy of Science,2004,101(Suppl.1):5228-5235.
4Baeza-Yates R, Ribeiro-Neto B. Modern information retrieval [ M]. New York : ACM, 1999 : 1-180.
5Becker J,Kuropka D. Topic-based vector space model [C]// Proceedings of the Sixth International Conference on Business Information System. Colorado Springs:[ s. n. ] ,2003 : 7-12.
6Hearst M A. Multi-paragraph segmentation of expository text [ C] //Proceedings of the 32nd Meeting of the Association for Computational Linguistics. New Mexico : ACL, 1994:9-16.
7Lovasz L,Plummer M D. Matching theory [ M ]. Amsterdam: Elsevier Science Publishers, 1986 : 1-200.
8Papadimitriou C H, Raghavan P, Tamaki H, et al. Latent semantic indexing: a probabilistic analysis [ C ]//Proceedings of the ACM Conference on Principles of Database Systems. Washington : ACM, 1998 : 159-168.
9Blei D M, Ng A Y, Jordan M I. Latent dirichlet allocation [ J ]. Journal of Machine Learning Research,2003,3:993- 1022.
10Hofmann T. Probabilistic latent semantic analysis [ C ]// Proceedings of the 15th Conference on Uncertainty in Artificial Intelligence. Stockholm : Morgan Kaufmann, 1999 : 289-296.

引证文献2

1贾西平,彭宏,郑启伦,石时需,江焯林.基于主题的文档检索模型[J].华南理工大学学报（自然科学版）,2008,36(9):37-42. 被引量：4
2贾西平,刘海珠.一种潜在文档相似模型[J].计算机工程,2009,35(15):32-34. 被引量：1

二级引证文献5

1王朝飞,王凯.主题模型在数字图书馆Web服务中的应用[J].情报理论与实践,2010,33(2):118-120. 被引量：4
2常鹏,冯楠.基于词共现的文档表示模型[J].中文信息学报,2012,26(1):51-57. 被引量：8
3周亦鹏,杜军平.基于时空情境模型的主题跟踪[J].华南理工大学学报（自然科学版）,2012,40(8):82-87. 被引量：1
4王春龙,张敬旭.基于LDA的改进K-means算法在文本聚类中的应用[J].计算机应用,2014,34(1):249-254. 被引量：21
5侯超昆,李石君.基于领域本体的网页主题相关度计算[J].计算机工程与设计,2014,35(12):4344-4349. 被引量：3

1Minghe YU,Guoliang LI,Dong DENG,Jianhua FENG.String similarity search and join： a survey[J].Frontiers of Computer Science,2016,10(3):399-417. 被引量：4
2谢彬彬,贾西平,方刚,欧卫.一种基于TextTiling的镜头边界检测算法[J].计算机应用与软件,2016,33(1):259-262. 被引量：1
3范晨熙,黄理灿,李雪利.基于Lucene的BM25模型的评分机制的研究[J].工业控制计算机,2013,26(3):78-79. 被引量：15
4PENG Zhi-yong.A Hyper Audio Web System and Its Implementation[J].Journal of Shanghai University(English Edition),2001,5(z1):101-103.
5LANG Rongling,WANG Yuan,GAO Fei,Pan Lei.Fault Diagnosis of Airborne Equipments Based on Similarity Search＊[J].Chinese Journal of Electronics,2013,22(4):855-860.
6卞真旭.一种关键词抽取方法研究[J].安徽电气工程职业技术学院学报,2011,16(B10):149-153.
7联想网御数字校园应用安全解决方案[J].计算机与网络,2009,35(6):56-57.
8刘素花.略论数字图书馆建设[J].山东省农业管理干部学院学报,2007,23(4):190-190. 被引量：3
9胡旷达.基于神经网络的个性化信息检索模型研究[J].现代计算机（中旬刊）,2016(4):18-23. 被引量：2
10LIU Xiu-ying MENG Fan-mao.The Topic Structure of SCs[J].US-China Foreign Language,2014,12(8):662-669.

Journal of Computer Science & Technology

2005年第4期

浏览历史

内容加载中请稍等...

A New Retrieval Model Based on TextTiling for Document Similarity Search 被引量：2

参考文献22

同被引文献14

引证文献2

二级引证文献5

相关作者

相关机构

相关主题

浏览历史