The following repo implements a N-gram language model in Rust and uses it to generate synthetic exerpts from hypothetical Presidential "State of the Union" address. Below are some samples:
And the whole world looks to us for protection.
Industry is always necessary to keep economies in balance.
Everything is a possibility.
We will continue along the path toward a balanced budget.
This Administration will be remembered for support of the workers who are not truly disabled.
(Generated using the entire SOTU corpus, a Quad-gram model and the Probabilistic text generation mode.)
A "State of the Union" (SOTU) address is a speech delivered annually by the President of the United States to Congress. Typically, the president outlines their administration's achievements from the previous year and goals for the coming one. It's an important event, serving as a demonstration of the President's priorities and focus for the coming year.
The N-gram model uses 244 SOTU addresses from the American Presidency Project by UC Santa Barbara.
N-gram language models are statistical models commonplace in NLP and lingustics settings. They use the frequency of words (grams) to predict future text. These rely on the independence assumption, that the probability of a word only depends on a fixed number of previous words (history).
This implementation allows the history size, the number of grams used, to be varied -hence it's an N-gram model. For this implementation the following sources have been heavily relied on:
- Speech and Language Processing. Daniel Jurafsky & James H. Martin, 2023. Link
- Foundations of Natural Language Processing, N-gram language models. Alex Lascarides, 2020. Link
The repository also includes the following sample texts, used for tests and debugging:
shakespeare_alllines.txt
: Shakespeare Playsbiden_sotu_2024.txt
: Joe Biden State of the Union, 2024biden_sotu_2022.txt
: Joe Biden State of the Union, 2022