Struct aho_corasick::AhoCorasickBuilder
source · [−]pub struct AhoCorasickBuilder { /* private fields */ }
Expand description
A builder for configuring an Aho-Corasick automaton.
Implementations
Create a new builder for configuring an Aho-Corasick automaton.
If you don’t need fine grained configuration or aren’t sure which knobs
to set, try using
AhoCorasick::new_auto_configured
instead.
pub fn build<I, P>(&self, patterns: I) -> AhoCorasick where
I: IntoIterator<Item = P>,
P: AsRef<[u8]>,
pub fn build<I, P>(&self, patterns: I) -> AhoCorasick where
I: IntoIterator<Item = P>,
P: AsRef<[u8]>,
Build an Aho-Corasick automaton using the configuration set on this builder.
A builder may be reused to create more automatons.
This method will use the default for representing internal state
identifiers, which is usize
. This guarantees that building the
automaton will succeed and is generally a good default, but can make
the size of the automaton 2-8 times bigger than it needs to be,
depending on your target platform.
Examples
Basic usage:
use aho_corasick::AhoCorasickBuilder;
let patterns = &["foo", "bar", "baz"];
let ac = AhoCorasickBuilder::new()
.build(patterns);
assert_eq!(Some(1), ac.find("xxx bar xxx").map(|m| m.pattern()));
pub fn build_with_size<S, I, P>(
&self,
patterns: I
) -> Result<AhoCorasick<S>, Error> where
S: StateID,
I: IntoIterator<Item = P>,
P: AsRef<[u8]>,
pub fn build_with_size<S, I, P>(
&self,
patterns: I
) -> Result<AhoCorasick<S>, Error> where
S: StateID,
I: IntoIterator<Item = P>,
P: AsRef<[u8]>,
Build an Aho-Corasick automaton using the configuration set on this
builder with a specific state identifier representation. This only has
an effect when the dfa
option is enabled.
Generally, the choices for a state identifier representation are
u8
, u16
, u32
, u64
or usize
, with usize
being the default.
The advantage of choosing a smaller state identifier representation
is that the automaton produced will be smaller. This might be
beneficial for just generally using less space, or might even allow it
to fit more of the automaton in your CPU’s cache, leading to overall
better search performance.
Unlike the standard build
method, this can report an error if the
state identifier representation cannot support the size of the
automaton.
Note that the state identifier representation is determined by the
S
type variable. This requires a type hint of some sort, either
by specifying the return type or using the turbofish, e.g.,
build_with_size::<u16, _, _>(...)
.
Examples
Basic usage:
use aho_corasick::{AhoCorasick, AhoCorasickBuilder};
let patterns = &["foo", "bar", "baz"];
let ac: AhoCorasick<u8> = AhoCorasickBuilder::new()
.build_with_size(patterns)?;
assert_eq!(Some(1), ac.find("xxx bar xxx").map(|m| m.pattern()));
Or alternatively, with turbofish:
use aho_corasick::AhoCorasickBuilder;
let patterns = &["foo", "bar", "baz"];
let ac = AhoCorasickBuilder::new()
.build_with_size::<u8, _, _>(patterns)?;
assert_eq!(Some(1), ac.find("xxx bar xxx").map(|m| m.pattern()));
Automatically configure the settings on this builder according to the patterns that will be used to construct the automaton.
The idea here is to balance space and time automatically. That is, when searching a small number of patterns, this will attempt to use the fastest possible configuration since the total space required will be small anyway. As the number of patterns grows, this will fall back to slower configurations that use less space.
This is guaranteed to never set match_kind
, but any other option may
be overridden.
Examples
Basic usage:
use aho_corasick::AhoCorasickBuilder;
let patterns = &["foo", "bar", "baz"];
let ac = AhoCorasickBuilder::new()
.auto_configure(patterns)
.build(patterns);
assert_eq!(Some(1), ac.find("xxx bar xxx").map(|m| m.pattern()));
Set the desired match semantics.
The default is MatchKind::Standard
, which corresponds to the match
semantics supported by the standard textbook description of the
Aho-Corasick algorithm. Namely, matches are reported as soon as they
are found. Moreover, this is the only way to get overlapping matches
or do stream searching.
The other kinds of match semantics that are supported are
MatchKind::LeftmostFirst
and MatchKind::LeftmostLongest
. The former
corresponds to the match you would get if you were to try to match
each pattern at each position in the haystack in the same order that
you give to the automaton. That is, it returns the leftmost match
corresponding the earliest pattern given to the automaton. The latter
corresponds to finding the longest possible match among all leftmost
matches.
For more details on match semantics, see the
documentation for MatchKind
.
Examples
In these examples, we demonstrate the differences between match
semantics for a particular set of patterns in a specific order:
b
, abc
, abcd
.
Standard semantics:
use aho_corasick::{AhoCorasickBuilder, MatchKind};
let patterns = &["b", "abc", "abcd"];
let haystack = "abcd";
let ac = AhoCorasickBuilder::new()
.match_kind(MatchKind::Standard) // default, not necessary
.build(patterns);
let mat = ac.find(haystack).expect("should have a match");
assert_eq!("b", &haystack[mat.start()..mat.end()]);
Leftmost-first semantics:
use aho_corasick::{AhoCorasickBuilder, MatchKind};
let patterns = &["b", "abc", "abcd"];
let haystack = "abcd";
let ac = AhoCorasickBuilder::new()
.match_kind(MatchKind::LeftmostFirst)
.build(patterns);
let mat = ac.find(haystack).expect("should have a match");
assert_eq!("abc", &haystack[mat.start()..mat.end()]);
Leftmost-longest semantics:
use aho_corasick::{AhoCorasickBuilder, MatchKind};
let patterns = &["b", "abc", "abcd"];
let haystack = "abcd";
let ac = AhoCorasickBuilder::new()
.match_kind(MatchKind::LeftmostLongest)
.build(patterns);
let mat = ac.find(haystack).expect("should have a match");
assert_eq!("abcd", &haystack[mat.start()..mat.end()]);
Enable anchored mode, which requires all matches to start at the first position in a haystack.
This option is disabled by default.
Examples
Basic usage:
use aho_corasick::AhoCorasickBuilder;
let patterns = &["foo", "bar"];
let haystack = "foobar";
let ac = AhoCorasickBuilder::new()
.anchored(true)
.build(patterns);
assert_eq!(1, ac.find_iter(haystack).count());
When searching for overlapping matches, all matches that start at the beginning of a haystack will be reported:
use aho_corasick::AhoCorasickBuilder;
let patterns = &["foo", "foofoo"];
let haystack = "foofoo";
let ac = AhoCorasickBuilder::new()
.anchored(true)
.build(patterns);
assert_eq!(2, ac.find_overlapping_iter(haystack).count());
// A non-anchored search would return 3 matches.
Enable ASCII-aware case insensitive matching.
When this option is enabled, searching will be performed without
respect to case for ASCII letters (a-z
and A-Z
) only.
Enabling this option does not change the search algorithm, but it may increase the size of the automaton.
NOTE: In the future, support for full Unicode case insensitivity may be added, but ASCII case insensitivity is comparatively much simpler to add.
Examples
Basic usage:
use aho_corasick::AhoCorasickBuilder;
let patterns = &["FOO", "bAr", "BaZ"];
let haystack = "foo bar baz";
let ac = AhoCorasickBuilder::new()
.ascii_case_insensitive(true)
.build(patterns);
assert_eq!(3, ac.find_iter(haystack).count());
Set the limit on how many NFA states use a dense representation for their transitions.
A dense representation uses more space, but supports faster access to transitions at search time. Thus, this setting permits the control of a space vs time trade off when using the NFA variant of Aho-Corasick.
This limit is expressed in terms of the depth of a state, i.e., the number of transitions from the starting state of the NFA. The idea is that most of the time searching will be spent near the starting state of the automaton, so states near the start state should use a dense representation. States further away from the start state would then use a sparse representation, which uses less space but is slower to access transitions at search time.
By default, this is set to a low but non-zero number.
This setting has no effect if the dfa
option is enabled.
Compile the standard Aho-Corasick automaton into a deterministic finite automaton (DFA).
When this is disabled (which is the default), then a non-deterministic finite automaton (NFA) is used instead.
The main benefit to a DFA is that it can execute searches more quickly than a NFA (perhaps 2-4 times as fast). The main drawback is that the DFA uses more space and can take much longer to build.
Enabling this option does not change the time complexity for
constructing the Aho-Corasick automaton (which is O(p)
where
p
is the total number of patterns being compiled). Enabling this
option does however reduce the time complexity of non-overlapping
searches from O(n + p)
to O(n)
, where n
is the length of the
haystack.
In general, it’s a good idea to enable this if you’re searching a small number of fairly short patterns (~1000), or if you want the fastest possible search without regard to compilation time or space usage.
Enable heuristic prefilter optimizations.
When enabled, searching will attempt to quickly skip to match candidates using specialized literal search routines. A prefilter cannot always be used, and is generally treated as a heuristic. It can be useful to disable this if the prefilter is observed to be sub-optimal for a particular workload.
This is enabled by default.
👎 Deprecated since 0.7.16: not carrying its weight, will be always enabled, see: https://github.com/BurntSushi/aho-corasick/issues/57
not carrying its weight, will be always enabled, see: https://github.com/BurntSushi/aho-corasick/issues/57
Shrink the size of the transition alphabet by mapping bytes to their
equivalence classes. This only has an effect when the dfa
option is
enabled.
When enabled, each a DFA will use a map from all possible bytes
to their corresponding equivalence class. Each equivalence class
represents a set of bytes that does not discriminate between a match
and a non-match in the DFA. For example, the patterns bar
and baz
have at least five equivalence classes: singleton sets of b
, a
, r
and z
, and a final set that contains every other byte.
The advantage of this map is that the size of the transition table can
be reduced drastically from #states * 256 * sizeof(id)
to
#states * k * sizeof(id)
where k
is the number of equivalence
classes. As a result, total space usage can decrease substantially.
Moreover, since a smaller alphabet is used, compilation becomes faster
as well.
The disadvantage of this map is that every byte searched must be passed through this map before it can be used to determine the next transition. This has a small match time performance cost. However, if the DFA is otherwise very large without byte classes, then using byte classes can greatly improve memory locality and thus lead to better overall performance.
This option is enabled by default.
👎 Deprecated since 0.7.16: not carrying its weight, will be always enabled, see: https://github.com/BurntSushi/aho-corasick/issues/57
not carrying its weight, will be always enabled, see: https://github.com/BurntSushi/aho-corasick/issues/57
Premultiply state identifiers in the transition table. This only has
an effect when the dfa
option is enabled.
When enabled, state identifiers are premultiplied to point to their
corresponding row in the transition table. That is, given the i
th
state, its corresponding premultiplied identifier is i * k
where k
is the alphabet size of the automaton. (The alphabet size is at most
256, but is in practice smaller if byte classes is enabled.)
When state identifiers are not premultiplied, then the identifier of
the i
th state is i
.
The advantage of premultiplying state identifiers is that is saves a multiplication instruction per byte when searching with a DFA. This has been observed to lead to a 20% performance benefit in micro-benchmarks.
The primary disadvantage of premultiplying state identifiers is that they require a larger integer size to represent. For example, if the DFA has 200 states, then its premultiplied form requires 16 bits to represent every possible state identifier, where as its non-premultiplied form only requires 8 bits.
This option is enabled by default.
Trait Implementations
Returns the “default value” for a type. Read more
Auto Trait Implementations
impl RefUnwindSafe for AhoCorasickBuilder
impl Send for AhoCorasickBuilder
impl Sync for AhoCorasickBuilder
impl Unpin for AhoCorasickBuilder
impl UnwindSafe for AhoCorasickBuilder
Blanket Implementations
Mutably borrows from an owned value. Read more