摘要
为关键词定义了与主题或语义相关联的信息度量。首先获取基于主题的语料库,然后建立语料库的潜语义向量空间模型,通过该模型定义关键词的信息度量。由此可以计算任意文档包含该主题的信息量,定义文档对主题的隶属度。设定文档对主题隶属度阈值,从而判断文档是否属于该主题类。实验表明,与主题或语义关联的信息度量可以克服搜索中"词匹配"的不足,达到"语义匹配"的搜索。
The authors defined an information measurement associated with a topic or semantics for a keyword. Firstly, the topic-based corpus was obtained. Then the latent semantic vector space model of the corpus was established. After that, the information measurement of the keyword was defined through the model. Accordingly, the amount of the topic information any document contained could be calculated. Lastly, the membership measurement which measured the membership degree of the document belonging to the topic was introduced. A measurement threshold was set, thereby it determined whether the documents belonging to the topic or not. The experimental results show that the definition of the information measurement can get over the difficulty of the word-match search and really reach the goal of the semantic-match search.
出处
《计算机应用》
CSCD
北大核心
2009年第9期2450-2453,2467,共5页
journal of Computer Applications
基金
上海市科学技术委员会科技攻关项目(055115001)
上海工程技术大学大学生创新项目(cx082100)
关键词
潜语义
信息度量
度量分布
隶属度
latent semantics
information measurement
metric distribution
membership degree