Boyer-Moore algorithm - コードの恵み

文字列照合のアルゴリズムは理解しても結果は java.lang.String#indexOf と変わらない。
ただ String#indexOf が O(NM) であるのに対して O(N) で処理できる*1。
Boyer-Moore algorithm (BM法) は KMP 法や Z algorithm と同じく O(N) だが平均比較回数は優れている。
最悪の場合は KMP 法より比較回数が多いが同程度に改良されたアルゴリズムがあるらしい。

BM 法の3つのポイント

Pattern の後方から照合する
不一致だったテキスト側の文字を利用して次の照合位置を決める (Bad character rule)
パターンの一致している部分を利用して次の照合位置を決める (Good suffix rule)

3に関しては KMP 法や Z algorithm と同じ。
3のおかげで worst case でも照合が O(N) でできる。
KMP 法で2の処理を行うこともできるが BM 法ほど効果が期待できない。
2は1と組み合わせることによって威力を発揮する。
ただ1のために3の前処理が KMP 法より複雑になっている。

メイン処理

def bm_search(text, pattern, int i=0) {
  final int N = text.size()
  final int M = pattern.size()
  def bc_table = bc_table(pattern)    // Bad character rule
  def gs_table = gs_table(pattern)    // Good suffix rule
  while (i <= N-M) {
    // pattern は右から左へ走査する (right to left scan)
    int j = M-1
    def c = text[i+j]
    while (c == pattern[j]) {
      // pattern が見つかった場合、pattern の開始位置を返す
      if (j == 0) return i
      j--
      c = text[i+j]
    }
    // pattern が見つからなかった場合、次の走査の開始位置を決める
    i += Math.max(gs_table[j], j-bc_table[c])
  }
  return -1
}

i は pattern の開始位置
j は pattern 中の照合位置

Bad character rule と Good suffix rule でより照合位置を進められる方を適用する。

Bad character rule

// bc_table[c] := pattern.lastIndexOf(c)
//
// 次の pattern の照合開始位置を i' は
// i' + bc_table[c] == i + j
// i' == i + j - bc_table[c]
def bc_table(pattern) {
  final int M = pattern.size()
  return (0..<M).collectEntries{ [pattern[it], it] }.withDefault{ -1 }
}

計算量は O(M)。
このテーブルを使えば最善の場合 O(N/M) で照合が完了する。
これは照合した文字がパターンに含まれない場合なので、文字の種類が多いほど有効である。

Good suffix rule

いくつかあるがすべて結果は同じで、gs_table[j] はパターン照合を j で失敗したときのずらし量である。

まずは富豪的に O(M^3) ?

def gs_table(pattern) {
  final int M = pattern.size()
  def table = new int[M]
  // j より後ろで prefix に一致する最も長い suffix を探す
  for (int j = M-1; j >= 0; j--) {
    table[j] = (j+1..<M-1).find{ pattern.startsWith(pattern[it..<M]) } ?: M
  }
  // j より後ろの文字列を pattern から探す
  //   複数存在する場合は、後ろにある方を優先する
  //   見つけた pattern の前の文字が pattern[j] と一致してはいけない
  //   (pattern[j] と同じ文字なら次の照合でも不一致になるのは明らか)
  for (int j = M-1; j >= 0; j--) {
    def suffix = pattern[j+1..<M]
    int p = pattern.lastIndexOf(suffix, j)
    while (p > 0 && pattern[p-1] == pattern[j]) {
      p = pattern.lastIndexOf(suffix, p-1)
    }
    if (p > 0) table[j] = j+1-p
  }
  return table
}

全ての位置の suffix 長がわかっていれば O(M) で準備できる

def gs_table(pattern) {
  final int M = pattern.size()
  def table = new int[M]
  // O(M)
  def suff  = suffixes(pattern)

  // O(M)
  // j より後ろで prefix に一致する最も長い suffix の開始位置を探す
  // ずらし量 s
  int s = M
  for (int j = M-1; j >= 0; j--) {
    int size = M-1-j
    // prefix と suffix が一致しているか判定
    if (size > 0 && suff[size-1] == size) {
      // suffix は j の次から始まる
      s = j+1
    }
    // 次は suffix の先頭位置に pattern の先頭位置を合わせてから照合する
    table[j] = s
  }

  // O(M)
  // j より後ろの文字列を pattern から探す
  for (int k = 0; k < M-1; k++) {
    table[M-1-suff[k]] = M-1-k
  }
  return table
}

// 参考 http://igm.univ-mlv.fr/~lecroq/string/node14.html
// suffixes だけ
def suffixes(pattern) {
  final int M = pattern.size()
  def suff = new int[M]
  int L = R = M-1
  for (int i = M-2; i >= 0; --i) {
    if (i > R && suff[i+M-1-L] < i-R)
      suff[i] = suff[i+M-1-L]
    else {
      if (i < R) R = i
      L = i
      while (R >= 0 && pattern[R] == pattern[R+M-1-L])
        --R
      suff[i] = L-R
    }
  }
  return suff
}

ずらし量でループする O(M^2)

// O(M^2)
def gs_table(pattern) {
  final int M = pattern.size()
  def table = new int[M]
  // s はずらし量
  for (int s = M; s >= 1; s--) {
    // pattern の照合位置が j で失敗した場合を考える
    int j = M-1
    while (j-s >= 0 && pattern[j-s] == pattern[j]) j--
    if (j >= s) {
      // pattern の先頭からではなく途中から good suffix に一致する場合
      table[j] = s
    } else while (j >= 0) {
      // pattern の先頭から good suffix に一致する場合
      table[j] = s
      j--
    } // else  pattern に good suffix に一致する箇所がなかった場合
  }
  return table
}

理解がいまいちな処理 O(M)

def gs_table(pattern) {
  final int M = pattern.size()
  // 照合位置からのずらし量
  def table = new int[M]

  // case 3 一致しない場合
  Arrays.fill(table, M)

  // case 1 途中から一致する場合
  // k は tagged border の prefix の1つ前の文字
  // j は tagged border の suffix の1つ前の文字 (照合位置)
  // g[k] == j
  def g = new int[M]
  int j = M
  for (int k = M-1; k >= 0; k--) {
    g[k] = j
    while (j < M && pattern[j] != pattern[k]) {
      table[j] = Math.min(table[j], j-k)
      // border の再帰的構造を利用する
      j = g[j]
    }
    j--
  }

  // case 2 先頭から一致する場合
  int s = j
  for (j = 0; j < M; j++) {
    table[j] = Math.min(table[j], s+1)
    if (j >= s) s = g[s]
  }
  return table
}

Horspool algorithm と Quick Search algorithm

Bad character rule のみを使用し、基準にする文字の位置を変えた簡易版のアルゴリズム。
参照する文字は、照合に失敗した位置でなくても実はよかった。
Good suffix rule を使用しないので worst case が O(NM) になるが平均は O(N) のまま。
ただし Hoospool の場合は bc_table[c] := pattern[0..

// 0 1 2 3 4 5 6 7 8 9 ...  j
// a b c a b d a a c b a    text
// b c a a b                pattern
// 各アルゴリズムで bc_table を参照したときの次のパターンの照合位置 (戻る場合は1進める)
//     b c a a b            Boyer-Moore algorithm は照合に失敗した文字を参照する ('c')
//         b c a a b        Horspool algorithm は pattern の末尾と照合した文字を参照する ('b')
//             b c a a b    Quick Search algorithm は pattern の次の文字を参照する ('d')

参考

『岩波講座ソフトウェア科学 3 アルゴリズムとデータ構造』
- アルゴリズムの入門書。解説がわかりやすい
2002$BG/(B $BE_3X4|(B $B:#0f8&NXFI(B
- 『Algorithms on Strings,Trees,and Sequences』
ESMAJ
String matching algorithms

*1:前処理にO(M)かかるが同じパターンなら使い回せる