当前位置：首页 > 技术 > 正文内容

字符串处理算法：哈希、KMP、Trie及AC自动机

访客技术 2026年7月4日 1

本篇内容将介绍四种常用的字符串处理技术：哈希（Hash）、KMP算法、Trie（字典树）以及AC自动机。

哈希 (Hash)

哈希算法通过一个哈希函数将数据映射到一个固定大小的数值，称为哈希值。这使得快速匹配和查找成为可能。

字符串哈希

字符串哈希常用于快速判断两个字符串是否相等，或统计子串出现的次数。逐字符比较长字符串效率低下，但通过将字符串转换为数值，可以实现 O(1) 的比较。

常用的字符串哈希函数构造方式如下：选择一对互质的数 Base 和 Mod (Base < Mod)。对于字符串 S = s_1s_2...s_n，前缀 i 的哈希值可定义为：
Hash(i) = (s_1 * Base^(i-1) + s_2 * Base^(i-2) + ... + s_i * Base^0) mod Mod
Base 的幂次可以预先计算并存储在数组 power[] 中，以实现 O(1) 查询。
power[i] = Base^i
hashValue[i] = (hashValue[i-1] * Base + s_i) mod Mod

技巧：使用 unsigned long long 类型，利用其自然溢出特性，可以省略取模操作。

代码示例：

// 预计算 Base 的幂次
unsigned long long power[MAX_LEN];
// 预计算字符串的前缀哈希值
unsigned long long hashValue[MAX_LEN];

void precompute(const std::string& s, unsigned long long base, unsigned long long mod) {
    power[0] = 1;
    for (int i = 1; i < s.length(); ++i) {
        power[i] = (power[i - 1] * base) % mod;
    }

    hashValue[0] = 0;
    for (int i = 0; i < s.length(); ++i) {
        hashValue[i + 1] = (hashValue[i] * base + (s[i] - 'a' + 1)) % mod; // s[i]-'a'+1 假设是小写字母
    }
}

// 获取子串 s[l...r] 的哈希值
unsigned long long getSubstringHash(int l, int r, unsigned long long base, unsigned long long mod) {
    unsigned long long len = r - l + 1;
    unsigned long long subHash = (hashValue[r + 1] - (hashValue[l] * power[len]) % mod + mod) % mod;
    return subHash;
}

哈希表

哈希表是一种高效的数据结构，提供近乎常数时间的查找效率。它通过哈希函数将键映射到数组索引，从而快速定位数据。当发生哈希冲突（不同键映射到同一索引）时，通常使用链表等方式来解决。

STL 中的 std::unordered_map 内部实现就是一个哈希表。

简化的哈希表实现示例：

const int TABLE_SIZE = 10000; // 示例大小
std::vector<int> hashTable[TABLE_SIZE];

void insert(int key) {
    int hashIndex = key % TABLE_SIZE;
    // 避免重复插入
    for (int val : hashTable[hashIndex]) {
        if (val == key) return;
    }
    hashTable[hashIndex].push_back(key);
}

bool query(int key) {
    int hashIndex = key % TABLE_SIZE;
    for (int val : hashTable[hashIndex]) {
        if (val == key) return true;
    }
    return false;
}

KMP 算法

KMP (Knuth-Morris-Pratt) 算法是一种改进的字符串匹配算法，用于在一个文本串中查找一个模式串。其核心思想是利用模式串自身的结构，避免不必要的比较。

KMP 算法通过构建一个 next 数组（或称 lps - longest proper prefix suffix 数组），记录当模式串失配时，应该从模式串的哪个位置重新开始匹配。这个数组的值表示在当前位置之前，最长的既是前缀又是后缀的长度。

构建 next 数组的过程：

std::vector<int> computeLPS(const std::string& pattern) {
    int m = pattern.length();
    std::vector<int> lps(m, 0);
    int length = 0; // 长度，也是当前最长公共前后缀的末尾位置
    int i = 1;

    while (i < m) {
        if (pattern[i] == pattern[length]) {
            length++;
            lps[i] = length;
            i++;
        } else {
            if (length != 0) {
                length = lps[length - 1]; // 回溯
            } else {
                lps[i] = 0;
                i++;
            }
        }
    }
    return lps;
}

KMP 匹配过程：

int kmpSearch(const std::string& text, const std::string& pattern) {
    int n = text.length();
    int m = pattern.length();
    if (m == 0) return 0; // 空模式串

    std::vector<int> lps = computeLPS(pattern);
    int count = 0;
    int i = 0; // index for text
    int j = 0; // index for pattern

    while (i < n) {
        if (pattern[j] == text[i]) {
            i++;
            j++;
        }
        if (j == m) {
            count++; // 找到一个匹配
            j = lps[j - 1]; // 继续查找下一个匹配
        } else if (i < n && pattern[j] != text[i]) {
            if (j != 0) {
                j = lps[j - 1]; // 失配，回溯 j
            } else {
                i++; // j 为 0，无法回溯，移动 i
            }
        }
    }
    return count;
}

Trie (字典树)

Trie，也称为字典树或前缀树，是一种用于高效存储和检索字符串集合的数据结构。每个节点代表一个前缀，从根节点到某个节点的路径构成一个字符串。

Trie 的结构：每个节点可以有多个子节点，每个子节点对应一个字符。通常，节点会存储一个标记，表示以该节点为终点的字符串是否存在。

Trie 的构建：从根节点开始，为每个字符串的每个字符创建或遍历到对应的子节点。如果节点不存在，则创建新节点。在字符串的末尾节点标记为结束。

Trie 的查找：从根节点开始，沿着与查询字符串字符对应的路径向下查找。如果路径中断，则字符串不存在；如果到达字符串末尾并且该节点被标记为结束，则字符串存在。

Trie 实现示例：

const int ALPHABET_SIZE = 26;

struct TrieNode {
    TrieNode* children[ALPHABET_SIZE];
    bool isEndOfWord;

    TrieNode() : isEndOfWord(false) {
        for (int i = 0; i < ALPHABET_SIZE; ++i) {
            children[i] = nullptr;
        }
    }
    
    // 析构函数，释放内存
    ~TrieNode() {
        for (int i = 0; i < ALPHABET_SIZE; ++i) {
            delete children[i];
        }
    }
};

class Trie {
private:
    TrieNode* root;

public:
    Trie() {
        root = new TrieNode();
    }

    ~Trie() {
        delete root;
    }

    void insert(const std::string& word) {
        TrieNode* current = root;
        for (char ch : word) {
            int index = ch - 'a'; // 假设是小写字母
            if (!current->children[index]) {
                current->children[index] = new TrieNode();
            }
            current = current->children[index];
        }
        current->isEndOfWord = true;
    }

    bool search(const std::string& word) {
        TrieNode* current = root;
        for (char ch : word) {
            int index = ch - 'a';
            if (!current->children[index]) {
                return false;
            }
            current = current->children[index];
        }
        return current != nullptr && current->isEndOfWord;
    }

    bool startsWith(const std::string& prefix) {
        TrieNode* current = root;
        for (char ch : prefix) {
            int index = ch - 'a';
            if (!current->children[index]) {
                return false;
            }
            current = current->children[index];
        }
        return current != nullptr;
    }
};

AC 自动机

AC 自动机（Aho-Corasick automaton）结合了 Trie 和 KMP 的思想，能够在一个文本串中高效地匹配多个模式串。

AC 自动机在 Trie 的基础上增加了"失配指针"（fail 指针）。fail 指针指向当前状态所代表的前缀的最长后缀所对应的状态。当在 Trie 中匹配失配时，通过 fail 指针跳转，类似于 KMP 的回溯。

构建 fail 指针：通常使用 BFS 遍历 Trie。对于节点 u，其父节点为 p，边字符为 c。如果 p 的 fail 指针指向 v，且 v 存在指向 w 的字符 c 的边，那么 u 的 fail 指针就指向 w。如果不存在，则继续沿着 v 的 fail 指针查找。

AC 自动机的构建过程实质上是将 Trie 转化为一个有限状态自动机，其中每个状态代表一个模式串的前缀，fail 指针用于处理模式串之间的包含关系（后缀匹配）。

AC 自动机匹配过程：遍历文本串，根据当前字符和当前状态，沿着 Trie 的边转移。如果失配，则沿着 fail 指针跳转，直到找到匹配的边或回到根节点。同时，在每次转移到新状态时，需要检查该状态及其所有 fail 指针指向的状态是否为某个模式串的结尾，从而统计匹配次数。

AC 自动机构建（BFS）：

const int ALPHABET_SIZE = 26;
const int MAX_NODES = 100005; // 示例最大节点数

struct ACNode {
    int next[ALPHABET_SIZE];
    int fail;
    int isEnd; // 标记是否为某个模式串的结尾，或存储模式串数量
    // ... 其他可能需要的字段，如模式串 ID 等

    ACNode() : fail(0), isEnd(0) {
        for (int i = 0; i < ALPHABET_SIZE; ++i) {
            next[i] = 0;
        }
    }
};

ACNode acTree[MAX_NODES];
int nodeCount; // 当前节点数量

void insertPattern(const std::string& pattern) {
    int currentNode = 0;
    for (char ch : pattern) {
        int index = ch - 'a';
        if (!acTree[currentNode].next[index]) {
            acTree[currentNode].next[index] = ++nodeCount;
        }
        currentNode = acTree[currentNode].next[index];
    }
    acTree[currentNode].isEnd++; // 标记为模式串结尾，可以统计出现次数
}

void buildAC() {
    std::queue<int> q;
    for (int i = 0; i < ALPHABET_SIZE; ++i) {
        if (acTree[0].next[i]) {
            q.push(acTree[0].next[i]);
        }
    }

    while (!q.empty()) {
        int u = q.front();
        q.pop();

        for (int i = 0; i < ALPHABET_SIZE; ++i) {
            if (acTree[u].next[i]) {
                int v = acTree[u].next[i];
                acTree[v].fail = acTree[acTree[u].fail].next[i]; // 计算 fail 指针
                acTree[v].isEnd += acTree[acTree[v].fail].isEnd; // 累加 fail 指针指向的模式串计数
                q.push(v);
            } else {
                acTree[u].next[i] = acTree[acTree[u].fail].next[i]; // 构建字典图
            }
        }
    }
}

int queryText(const std::string& text) {
    int currentNode = 0;
    int totalMatches = 0;
    for (char ch : text) {
        int index = ch - 'a';
        currentNode = acTree[currentNode].next[index]; // 转移
        totalMatches += acTree[currentNode].isEnd; // 累加匹配数量
    }
    return totalMatches;
}

// 初始化 AC 自动机
void initAC() {
    nodeCount = 0;
    // 清空 acTree 数组
    for(int i = 0; i <= MAX_NODES; ++i) {
        acTree[i] = ACNode(); // 调用默认构造函数重置
    }
}

标签: 哈希 KMP Trie AC自动机

返回列表

上一篇：ROS核心概念解析与系统架构

下一篇：中国航天运载火箭系列详解

老程序员博客

字符串处理算法：哈希、KMP、Trie及AC自动机

哈希 (Hash)

字符串哈希

哈希表

KMP 算法

Trie (字典树)

AC 自动机

相关文章

Linux crontab 详解

富文本里可以允许的 HTML 属性

Mac 安装 Node.js 指南

Dom\HTML_NO_DEFAULT_NS 的副作用：自动加闭合标签

Laravel 事件和监听器创建

自定义域名解析神器 dnsmasq

发表评论

Copyright © agingcoder.cn

Powered By Z-BlogPHP. Theme by TOYEAN.

老程序员博客

字符串处理算法：哈希、KMP、Trie及AC自动机

哈希 (Hash)

字符串哈希

哈希表

KMP 算法

Trie (字典树)

AC 自动机

相关文章

Linux crontab 详解

富文本里可以允许的 HTML 属性

Mac 安装 Node.js 指南

Dom\HTML_NO_DEFAULT_NS 的副作用：自动加闭合标签

Laravel 事件和监听器创建

自定义域名解析神器 dnsmasq

发表评论取消回复

Copyright © agingcoder.cn

Powered By Z-BlogPHP. Theme by TOYEAN.

发表评论