OK, so I don’t sound like an idiot I’m going to state the problem/requirements more explicitly:
- Needle (pattern) and haystack (text to search) are both C-style null-terminated strings. No length information is provided; if needed, it must be computed.
- Function should return a pointer to the first match, or
NULL
if no match is found. - Failure cases are not allowed. This means any algorithm with non-constant (or large constant) storage requirements will need to have a fallback case for allocation failure (and performance in the fallback care thereby contributes to worst-case performance).
- Implementation is to be in C, although a good description of the algorithm (or link to such) without code is fine too.
…as well as what I mean by “fastest”:
- Deterministic
O(n)
wheren
= haystack length. (But it may be possible to use ideas from algorithms which are normallyO(nm)
(for example rolling hash) if they’re combined with a more robust algorithm to give deterministicO(n)
results). - Never performs (measurably; a couple clocks for
if (!needle[1])
etc. are okay) worse than the naive brute force algorithm, especially on very short needles which are likely the most common case. (Unconditional heavy preprocessing overhead is bad, as is trying to improve the linear coefficient for pathological needles at the expense of likely needles.) - Given an arbitrary needle and haystack, comparable or better performance (no worse than 50% longer search time) versus any other widely-implemented algorithm.
- Aside from these conditions, I’m leaving the definition of “fastest” open-ended. A good answer should explain why you consider the approach you’re suggesting “fastest”.
My current implementation runs in roughly between 10% slower and 8 times faster (depending on the input) than glibc’s implementation of Two-Way.
Update: My current optimal algorithm is as follows:
- For needles of length 1, use
strchr
. - For needles of length 2-4, use machine words to compare 2-4 bytes at once as follows: Preload needle in a 16- or 32-bit integer with bitshifts and cycle old byte out/new bytes in from the haystack at each iteration. Every byte of the haystack is read exactly once and incurs a check against 0 (end of string) and one 16- or 32-bit comparison.
- For needles of length >4, use Two-Way algorithm with a bad shift table (like Boyer-Moore) which is applied only to the last byte of the window. To avoid the overhead of initializing a 1kb table, which would be a net loss for many moderate-length needles, I keep a bit array (32 bytes) marking which entries in the shift table are initialized. Bits that are unset correspond to byte values which never appear in the needle, for which a full-needle-length shift is possible.
The big questions left in my mind are:
- Is there a way to make better use of the bad shift table? Boyer-Moore makes best use of it by scanning backwards (right-to-left) but Two-Way requires a left-to-right scan.
- The only two viable candidate algorithms I’ve found for the general case (no out-of-memory or quadratic performance conditions) are Two-Way and String Matching on Ordered Alphabets. But are there easily-detectable cases where different algorithms would be optimal? Certainly many of the
O(m)
(wherem
is needle length) in space algorithms could be used form<100
or so. It would also be possible to use algorithms which are worst-case quadratic if there’s an easy test for needles which provably require only linear time.
Bonus points for:
- Can you improve performance by assuming the needle and haystack are both well-formed UTF-8? (With characters of varying byte lengths, well-formed-ness imposes some string alignment requirements between the needle and haystack and allows automatic 2-4 byte shifts when a mismatching head byte is encountered. But do these constraints buy you much/anything beyond what maximal suffix computations, good suffix shifts, etc. already give you with various algorithms?)
Note: I’m well aware of most of the algorithms out there, just not how well they perform in practice. Here’s a good reference so people don’t keep giving me references on algorithms as comments/answers: http://www-igm.univ-mlv.fr/~lecroq/string/index.html