摘要
为了满足日益增长的对专利检索的需求,国家高技术研究发展计划(863计划)启动了族性化学结构数据库系统的研究与开发。族性化学结构数据库系统主要涉及两方面的关键技术:(1)族性化学结构的计算机表达,(2)族性化学结构的检索算法。本文主要讨论族性化学结构的计算机表达。存在于化学专利原始文献中的族性化学结构是用具有一定规范的自然语言表述的。为了能在计算机系统中储存与检索这些信息,自然语言表述的族性化学结构必须转换为计算机可以接受的无歧义的形式语言。这个过程叫做族性化学结构的标引。国际上一般采用的基于结构片断的族性化学结构标引形式语言开发于20世纪70~80年代,这种形式语言与化学家采用的图形自然语言相去甚远,标引速度慢,成本高。本文介绍在ISIS/Draw绘图功能基础上发展起来的标引族性化学结构的图形形式语言,它的主要特点是与化学家日常使用的图形自然语言接近,规则简单易于掌握,从而提高标引效率,降低族性化学结构数据库系统的实现成本。
The State Intellectual Patent Office of P. R. C receives a huge amount of chemical patent applications each year Academies and enterprises have to search a large number of chemical patents in order to protect their own intellectual properties, and make use of known technology. In 2004, the National High Technology Research and Development Program of China initiated the project of generic chemical structure database as a solution to the chemical patent process challenges. Two core technologies of this project are: ( 1 ) Computer representation of generic chemical structure, (2) Retrieval algorithm of generic chemical structure. This article presents new protocols to represent generic chemical structures in a computer system. A generic chemical structure in a chemical patent is described in natural language, which is not well defined. Such natural language has to be formalized in order to be stored, exchanged, and searchable in a database system. The formalized language is called a formal language. An indexing process is to translate a chemical patent in natural language to the patent in a formal language. A number of formal languages for generic chemical structure have been reported in past years. Most of them are based upon the concept of chemical structure fragmentation. The main disadvantages of these languages are (1) syntaxes are too complicated to learn, (2) the rules are too different from natural chemical language, and hard to understand. These problems make the chemical patent indexing process very costly. In this paper, we propose a novel formal language to represent generic chemical structures, which are close to natural chemical language; syntax rules are concise and easy to learn. Therefore, the new formal language is well received in our chemical patent indexing process in SIPO (State Intellectual Property Office).
出处
《情报学报》
CSSCI
北大核心
2007年第2期253-259,共7页
Journal of the China Society for Scientific and Technical Information
基金
本项目由国家高技术研究发展经费资助(2003AA223603).
关键词
族性化学结构
马库什结构
标引
图形形式语言
计算机检索
generic chemical structure, Markush structure, indexing, graphic formal language, Markush database