摘要
针对B2B垂直搜索引擎中提取产品规格信息困难的问题,提出了一种基于双数组Trie(Double-Array Trie)的规则串提取方法。该方法针对B2B系统中"参数名:参数值"字符串的规则特征构建规则串,生成双数组Trie树;并优先处理分支结点最多的子树,来提高存储效率。该方法对搜索文本进行一次扫描就能得到所有规则串;通过在规则中加入约束条件,对候选串进行有效过滤,以提高规则串的提取准确率。实验表明,该方法能够降低传统规则串查找的算法复杂度,查找规则串的时间复杂度是O(n)。
To extract the data of product specification in B2B system, the ruled string extracting method based on dou- ble-array trie was proposed. The data feature is formed as "name. value" for the parameters of the product specification in B2B system. The method constructs the rule according to the data feature of specification parameters. The double-ar- ray trie is generated for the extracting processing according to the rules database. The optimization measures are adopt- ed to improve the storing efficiency for the double-array trie. The measures include giving high priority to handle the sub tree with more child node. The method can extract all the ruled string by scanning the input text data once. The ac- curacy of the extracting results is improved via filtering according to the restrictions condition of the rules. Experimental results show that the extracting method can improve accuracy and decrease complexity comparing to the traditional methock The complexity of the extracting algorithm is O(n).
出处
《计算机科学》
CSCD
北大核心
2013年第5期206-208,223,共4页
Computer Science
基金
国家自然科学基金项目(61175048
60875029)
科技部创新方法工作专项项目(2010IM020900)资助