Trie Tree 介绍_综合

概述

Trie树不同于通常的基于键比较的方法, 直接利用键的数字序列直接定位, 通常用于字符串匹配, 特别对公共前缀查找, 非常有效. 朴素的Trie树使用转移矩阵表示,简单易懂, 查找速度快(O(k), k表示键的长度).唯一的缺点是空间利用效率低下. 于是有两种压缩方案被提出, 一种叫后缀压缩, 把non-branching的后缀统一存储起来; 另一种使用链表来表示转移矩阵, 能有效压缩存储空间, 但由于线性查询, 又会造成了效率低下. 后来, 新的数据结构Triple Array Trie被提出, 既兼顾效率, 又考虑性能. 最后, Double Array Trie作为Triple Array Trie的改进版被提出.

引言

Trie 来源于单词 re**triev**e, 表示信息的获取. 它不同于基于键比较的查找方式, 它利用键的数字或字符序列直接look-at, 速度很快. 比如在字典查找中, 就用到了这个概念, 查找键(bachelor), 直接翻到字母B所在区域, 再进行进一步查找. 把这个概念加以推广就形成了Trie树.

转移矩阵

最简单的表示Trie树的方式莫过于转移矩阵了, 矩阵的行表示Key的字符集合, 矩阵的列表示状态(state), 当从state(1)开始, 通过字符B, 进入到 state(2), 可以给出以下定义

g(s, c) = t (其中s就是state1, c就是字符B, t就是state2)

下面表格中, 展示了4个Key的转移矩阵: bachelor, baby, badge, jar, 如果要查找bachelor, 可以从state(1)开始, Key的第1个字符是b, 于是进入state(2), Key的第2个字符是a, 于是进入state(4), Key的第3个字符是c, 于是找到bachelor.

Array-Structured Trie

实现中, Trie树可以作为一颗M叉树, 利用数组作为节点, 表示state; 利用字符作为下标, 链接到下一个数组或者子节点, 比如下图表示了4个key的Trie树: bachelor, baby, badge, jar. 根节点(s)通过字符j, 链接到子节点(t).

代码实现如下:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>#define TREE_WIDTH 256
#define WORD_LEN_MAX 128struct trie_node_t {int count;int pass;struct trie_node_t *next[TREE_WIDTH];
};static struct trie_node_t root={
   0,0, {NULL}};int insert(const char *word)
{struct trie_node_t *current;struct trie_node_t *new_node;if (word[0] == '\0') {return 0;}current = &root;for (int i=0; ; ++i) {
    if (word[i] == '\0') {break;}current->pass++;if (current->next[word[i]] == NULL) {new_node = (struct trie_node_t *)malloc(sizeof(struct trie_node_t));memset(new_node, 0, sizeof(struct trie_node_t));current->next[word[i]] = new_node;}current = current->next[word[i]];}current->count ++;return 0;
};int do_travel(struct trie_node_t *pRoot)
{static char word_dump[WORD_LEN_MAX];static int pos = 0;if (pRoot == NULL) {return 0;}if (pRoot->count > 0) {word_dump[pos] = '\0';fprintf (stdout, "word: %s, pass: %d, count: %d\n", word_dump, pRoot->pass, pRoot->count);}for (int i=0; i<TREE_WIDTH; ++i) {
    word_dump[pos++] = i;do_travel(pRoot->next[i]);pos--;}return 0;
}
int main(void)
{static char *text = "bechelor baby badge jar";char buffer[128];snprintf (buffer, 128, "%s", text);char *ptr = buffer;while (1) {char *word = strsep(&ptr, " ");if (word == NULL) {break;}if (word[0] == '\0') {continue;}insert(word);}do_travel(&root);// do freereturn 0;
}

list-structured Trie

基于数组结构Trie树, 空间消耗很大, 虽然可以把无子节点的链接置空, 以节约空间. 比如根节点(s)除了b和j相关的链接指向子节点, 其余链接全部置为NULL. 但是数组空间的利用率依然低下, 比如根节点的利用率为1/13. 鉴于此, 有人认为, 应当压缩空链接, 于是就形成 list structured Trie.

链表节点的定义:

struct trie_node_t {char arc;struct trie_node_t *child;struct trie_node_t *sibling;
};

下图展示了4个Key(bachelor, baby, badge, jar)的list-structured存储结构, 如果要匹配baby, 首先定位到根节点, 然后顺着child链接,找到a节点, 接着通过child链接找到c节点, 因为c != b, 于是沿着sibling链接找到b节点, 最后找到y节点, 匹配成功.

代码如下:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>#define TREE_WIDTH 256
#define WORD_LEN_MAX 128struct trie_node_t {int count;int pass;char arc;struct trie_node_t *child;struct trie_node_t *sibling;
};static struct trie_node_t root = {
   0, 0, 0, NULL, NULL};int insert(const char *word)
{struct trie_node_t *current;struct trie_node_t *prev;if (word[0] == '\0') {return 0;}current = &root;prev = current;int i = 0;while (1) {if (current == NULL) {current = new trie_node_t();current->pass = 0;current->count = 0;current->arc = word[i];current->child = NULL;current->sibling = NULL;prev->child = current;}while (current != NULL) {if (current->arc == word[i]) {break;}prev = current;current = current->sibling;}if (current == NULL) {current = new trie_node_t();current->pass = 0;current->count = 0;current->arc = word[i];current->child = NULL;current->sibling = NULL;prev->sibling = current;}i++;if (word[i] == '\0') {break;}current->pass++;prev = current;current = current->child;}current->count ++;return 0;
};int do_travel(struct trie_node_t *pRoot)
{static char word_dump[WORD_LEN_MAX];static int pos = 0;if (pRoot == NULL) {return 0;}if (pRoot == NULL) {return 0;}if (pRoot->count > 0) {word_dump[pos] = pRoot->arc;word_dump[pos+1] = '\0';fprintf (stdout, "word: %s, pass: %d, count: %d\n", word_dump, pRoot->pass, pRoot->count);}struct trie_node_t *current;for (current = pRoot; current != NULL; current = current->sibling) {
    word_dump[pos++] = current->arc;do_travel(current->child);pos--;}return 0;
}int main(void)
{static char *text = "bechelor baby badge jar";char buffer[128];snprintf (buffer, 128, "%s", text);char *ptr = buffer;while (1) {char *word = strsep(&ptr, " ");if (word == NULL) {break;}if (word[0] == '\0') {continue;}insert(word);}do_travel(root.sibling);// freereturn 0;
}

Reduced Tried(后缀压缩)

K表示key的集合, 在节点s中, 如果字符c能唯一的在K标识key, 那么

t = g(s, c)

节点t就被称为*separate node*. 从节点t到最后一个节点组成的字符串就称为节点t的*single string*, trie 树可以分成两个部分, 根节点到*separate node*还放在list-structured数据结构中, single string 另外存储在tail数组中. 从而进一步压缩空间.

假设 K = { bachelor, baby, badge, jar}, 从上文list-structured trie图中可以看出, bachelor 只需要存储b,a,c3个节点, 就能在K中唯一定位到bachelor, c节点就称为*separate node*, 后缀helor其实就是*single string*, 直接存储在tail数组中.

总结

本文从基础出发, 介绍了Trie树的原理, 以及两种简单的压缩优化方案, 当然, 上面这些都比较简单, 真正复杂的是Triple-Array Trie, 特别是Double-Array Trie, 等了解清楚了, 再撰文以飨读者.

参考文档

An Efficient Implementation of Trie Structures
An Implementation of Double-Array Trie
The Art of Computer Programming Vol. 3, Sorting and Searching.