Skip to content

Latest commit

 

History

History
33 lines (25 loc) · 1.36 KB

README.md

File metadata and controls

33 lines (25 loc) · 1.36 KB

segtok

Segtok is a fast, rule-based sentence segmentation and tokenization library for well-orthographed texts, particularly in English, German, and Romance languages.

  • Unicode support
  • High precision for well-orthographed texts
  • Minimal false positives
  • Handles complex sentence boundaries
  • Handles technical texts and URLs

It minimizes false positives, handles complex sentence structures, technical terms, and URLs, and supports Unicode. It’s lightweight, customizable for developers, and integrates easily into Unix-based workflows. Segtok is ideal for processing structured, regular texts where precision and speed are crucial.

Ported from the python package (not maintained anymore), and fixes a few bugs not fixed there. You may want to read about why segtok was made.

Example

use segtok::{segmenter::*, tokenizer::*};

fn main() {
  let input = include_str!("../tests/test_google.txt");

  let sentences: Vec<Vec<_>> = split_multi(input, SegmentConfig::default())
    .into_iter()
    .map(|span| split_contractions(web_tokenizer(&span)).collect())
    .collect();
}