期刊文献+

基于网页信息和分词的中文机构名全称和简称提取方法 被引量:3

Extraction method of organization full names and abbreviations based on Web page and word segmentation
下载PDF
导出
摘要 搜索引擎在处理全称和简称的对应关系时,以往只能通过人工添加,造成简称遗漏、搜索结果召回率低等问题。为此,提出了一种自动获取机构全称和简称的方法。根据域名地址获取机构网站首页源代码,从中提取相应机构全称,再结合机构名上下文特征词集合从中提取候选简称,最后计算候选简称与全称的相似度确定最终简称。通过对1 287个组织机构网站的实验,全称提取正确率达93.9%,简称召回率和正确率分别达85.3%和90.8%,实验表明该方法效果良好。 When processing the correspondence between full names and abbreviations, search engine can only use the way of manually adding in the past, resulting in abbreviations omission and low recall rate of search results. To solve these problems, this paper proposed an extraction method of organizations' full names and abbreviations based on Web page and word segmentation. It obtained source code of website homepage of organization firstly. Then it extracted relevant organization full name from the source code, and extracted candidate abbreviations based on contextual features collection of organization names. Finally it calculated the similarity between candidate abbreviations and full name to determine which candidates were the exact abbreviations. Through experiments on 1 287 organization websites, the full names' correct rate of this method is 93.9% , the abbreviations' recall rate and correct rate are 85.3% and 90.8% separately. Experimental results show that the method has a good effect.
出处 《计算机应用研究》 CSCD 北大核心 2017年第4期972-976,共5页 Application Research of Computers
基金 国家自然科学基金资助项目(61375039 61272433)
关键词 机构名简称提取 机构名全称提取 网页分析 简称相似度计算 extraction of organization abbreviations extraction of organization full name Web page analysis abbreviation similarity calculation
  • 相关文献

参考文献9

二级参考文献89

  • 1殷志平.构造缩略语的方法和原则[J].语言教学与研究,1999(2):73-82. 被引量:46
  • 2车万翔,刘挺,秦兵,李生.基于改进编辑距离的中文相似句子检索[J].高技术通讯,2004,14(7):15-19. 被引量:63
  • 3孙茂松,黄昌宁,高海燕,方捷.中文姓名的自动辨识[J].中文信息学报,1995,9(2):16-27. 被引量:87
  • 4钟良伍,郑方.基于中文机构名简称的检索方法研究[J].中文信息学报,2007,21(1):38-42. 被引量:7
  • 5Wren J D, Chang J T, Pustejovsky J, Adar E, Garner H R, Altman R B. Biomedical term mapping databases. Nucleic Acid Research, 2005, 33: 289-293.
  • 6Yoshida M, Fukuda K, Takagi T. Pnad-css: A workbench for constructing a protein name abbreviation dictionary. Bioinformatics, 2000, 16(2): 169-175.
  • 7Nenadic G, Spasic I, Ananiadou S. Automatic acronym acquisition and term variation management within domain-specific texts. In Proc. the LREC-3, Las Palmas, Spain, 2002, pp.2155-2162.
  • 8Schwartz A, Hearst M. A simple algorithm for identifying abbreviation definitions in biomedical texts. In Proc. the Pacific Symposium on Biocomputing (PSB 2003), pp.451-462.
  • 9Manuel Zahariev. An efficient methodology for acronymexpansion matching. In Proc. the International Conference on Information and Knowledge Engineering ( IKE), Las Vegas, USA, 2003, pp.32-37.
  • 10Adar E. Sarad: A simple and robust abbreviation dictionary. Bioinformatics, 2004, 20(4): 527-533.

共引文献150

同被引文献47

引证文献3

二级引证文献11

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部