摘要
提出了一种识别多载体数据流中包含的特定信息的新方法.该方法按照特征词及其拼音匹配规则,基于统计自然语言理论,通过自动的归纳学习,将从语料库中获得的词性间的转移值作为系统知识,利用有效的知识逼近策略判断真实数据流中的特征词与其上下文的关系,并得到特征词在真实文本中的评测值,以此来考查真实数据流中出现的全部特征词与在语料中所学到的特征词下下文搭配规则上的相似程度.如果整个数据流的评测值超过阈值,该数据流将被屏蔽.实验结果表明,根据该方法开发的识别及监控多载体数据流中不良信息的实验系统取得很好的效果.
A method is presented to identify some pieces of specific information in multi-carrier data streams by feature words and based on Pin Yin matching. An effective knowledge approximation method is used to judge the relation between feature words and context by statistics theory. The part of speech transfer-value as system knowledge can be obtained by inductive learning of training corpus. When data streams are evaluated, the evaluation value can be gained according to the system knowledge by matching all feature words and based on their Pin Yin, which examines the comparability with context regular of part of speech between all feature words in data streams and themselves in training corpus. Further more, if the evaluation value exceeds the threshold, the data streams will be shielded. Experimental results show that the effect of the experiment system based on this method is efficient for identifying ill information and monitoring and controlling their spreading by multi-carrier data streams.
出处
《软件学报》
EI
CSCD
北大核心
2003年第9期1538-1543,共6页
Journal of Software
基金
国家高技术研究发展计划(863)~~
关键词
信息识别
知识逼近
词性转移
归纳学习
Calculations
Evaluation
Information retrieval
Knowledge engineering
Statistics
Telecommunication networks
Text processing
Word processing