logo
Expand description

Iterators which split strings on Grapheme Cluster, Word or Sentence boundaries, according to the Unicode Standard Annex #29 rules.

extern crate unicode_segmentation;

use unicode_segmentation::UnicodeSegmentation;

fn main() {
    let s = "a̐éö̲\r\n";
    let g = UnicodeSegmentation::graphemes(s, true).collect::<Vec<&str>>();
    let b: &[_] = &["a̐", "é", "ö̲", "\r\n"];
    assert_eq!(g, b);

    let s = "The quick (\"brown\") fox can't jump 32.3 feet, right?";
    let w = s.unicode_words().collect::<Vec<&str>>();
    let b: &[_] = &["The", "quick", "brown", "fox", "can't", "jump", "32.3", "feet", "right"];
    assert_eq!(w, b);

    let s = "The quick (\"brown\")  fox";
    let w = s.split_word_bounds().collect::<Vec<&str>>();
    let b: &[_] = &["The", " ", "quick", " ", "(", "\"", "brown", "\"", ")", "  ", "fox"];
    assert_eq!(w, b);
}

no_std

unicode-segmentation does not depend on libstd, so it can be used in crates with the #![no_std] attribute.

crates.io

You can use this package in your project by adding the following to your Cargo.toml:

[dependencies]
unicode-segmentation = "1.7.1"

Structs

Cursor-based segmenter for grapheme clusters.

External iterator for grapheme clusters and byte offsets.

External iterator for a string’s grapheme clusters.

External iterator for sentence boundaries and byte offsets.

External iterator for a string’s sentence boundaries.

External iterator for word boundaries and byte offsets.

External iterator for a string’s word boundaries.

An iterator over the substrings of a string which, after splitting the string on sentence boundaries, contain any characters with the Alphabetic property, or with General_Category=Number.

An iterator over the substrings of a string which, after splitting the string on word boundaries, contain any characters with the Alphabetic property, or with General_Category=Number. This iterator also provides the byte offsets for each substring.

An iterator over the substrings of a string which, after splitting the string on word boundaries, contain any characters with the Alphabetic property, or with General_Category=Number.

Enums

An error return indicating that not enough content was available in the provided chunk to satisfy the query, and that more content must be provided.

Constants

The version of Unicode that this version of unicode-segmentation is based on.

Traits

Methods for segmenting strings according to Unicode Standard Annex #29.