Skip to content

Commit d984a8b

Browse files
committed
feat: initial commit, 🚀
0 parents  commit d984a8b

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

41 files changed

+33158
-0
lines changed

.vscode/settings.json

+4
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
{
2+
"deno.enable": true,
3+
"deno.unstable": true
4+
}

LICENSE

+65
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,65 @@
1+
License for snowball-ts
2+
=======================
3+
4+
MIT License
5+
6+
Copyright (c) 2022 Claudiu Ceia
7+
8+
Permission is hereby granted, free of charge, to any person obtaining a copy
9+
of this software and associated documentation files (the "Software"), to deal
10+
in the Software without restriction, including without limitation the rights
11+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
12+
copies of the Software, and to permit persons to whom the Software is
13+
furnished to do so, subject to the following conditions:
14+
15+
The above copyright notice and this permission notice shall be included in all
16+
copies or substantial portions of the Software.
17+
18+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
19+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
20+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
21+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
22+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
23+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
24+
SOFTWARE.
25+
26+
27+
License for Snowball
28+
====================
29+
Copyright (c) 2001, Dr Martin Porter
30+
Copyright (c) 2004,2005, Richard Boulton
31+
Copyright (c) 2013, Yoshiki Shibukawa
32+
Copyright (c) 2006,2007,2009,2010,2011,2014-2019, Olly Betts
33+
All rights reserved.
34+
35+
Redistribution and use in source and binary forms, with or without
36+
modification, are permitted provided that the following conditions
37+
are met:
38+
39+
1. Redistributions of source code must retain the above copyright notice,
40+
this list of conditions and the following disclaimer.
41+
2. Redistributions in binary form must reproduce the above copyright notice,
42+
this list of conditions and the following disclaimer in the documentation
43+
and/or other materials provided with the distribution.
44+
3. Neither the name of the Snowball project nor the names of its contributors
45+
may be used to endorse or promote products derived from this software
46+
without specific prior written permission.
47+
48+
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
49+
ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
50+
WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
51+
DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR
52+
ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
53+
(INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
54+
LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON
55+
ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
56+
(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
57+
SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
58+
59+
60+
License for sample data files
61+
=============================
62+
63+
The [`words_short.json`](./data/ro/words_short.json) file is used for testing
64+
and it's a trimmed and modified version of [this file](https://raw.githubusercontent.com/snowballstem/snowball-data/master/romanian/voc.txt),
65+
covered by the same BSD License mentioned above.

README.md

+60
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,60 @@
1+
# snowball-ts
2+
3+
This is a TypeScript interface to the [stemming algorithms from the Snowball project](https://snowballstem.org/).
4+
The library includes the original Snowball algorithms compiled to Javascript, and provides a very thin layer on top
5+
of that in order to:
6+
7+
* Provide a way to only load the desired algorithm
8+
* Provide typings
9+
* Hide methods meant to be private, only exposing a single `stem` function
10+
11+
## Usage
12+
13+
```ts
14+
// load desired algorithm
15+
const porter = await getStemmer("porter");
16+
const stemmed = porter.stem("cars");
17+
```
18+
19+
Since the `getStemmer` function uses dynamic imports, you'll need to pass the `--allow-read` permission to Deno.
20+
21+
## Contributing
22+
23+
This library should strive to match the Snowball release calendar, so whenever a new Snowball version is released,
24+
the source files should be updated. Minor versions can be released to include new algorithms (if required, and without breaking
25+
changes to existing algorithms).
26+
27+
**Updating source algorithms:**
28+
29+
* Compile the algorithms [using the Snowball CLI](https://snowballstem.org/runtime/use.html)
30+
* Modify the generated file to:
31+
- Import the base stemmer: `import BaseStemmer from "./base/base-stemmer.js";`
32+
- Export the function, instead of `Algorithm = function() {` do `export default function() {`
33+
- If it's a new algorithm, make sure to update the mapping in [`stemmers.ts`](./src/stemmers.ts)
34+
35+
**Changes to the interface:**
36+
* Breaking changes to the interface should mirror Snowball major releases
37+
* Non-breaking additions can be made at any point if justified in a GitHub issue, with a minor release
38+
* Bug fixes and other improvements to the existing interface can be made at any point, with a patch release
39+
40+
PRs are welcome. If unsure about the scope of the changes, feel free to open a GitHub issue first, to discuss.
41+
42+
## What is Snowball?
43+
44+
Snowball is a small string processing language for creating stemming algorithms for use in Information Retrieval, plus a collection of stemming algorithms implemented using it.
45+
46+
It was originally designed and built by Martin Porter. Martin retired from development in 2014 and Snowball is now maintained as a community project. Martin originally chose the name Snowball as a tribute to SNOBOL, the excellent string handling language from the 1960s. It now also serves as a metaphor for how the project grows by gathering contributions over time.
47+
48+
The Snowball compiler translates a Snowball program into source code in another language - currently Ada, ISO C, C#, Go, Java, Javascript, Object Pascal, Python and Rust are supported.
49+
50+
## What is Stemming?
51+
52+
Stemming maps different forms of the same word to a common "stem" - for example, the English stemmer maps connection, connections, connective, connected, and connecting to connect. So a searching for connected would also find documents which only have the other forms.
53+
54+
This stem form is often a word itself, but this is not always the case as this is not a requirement for text search systems, which are the intended field of use. We also aim to conflate words with the same meaning, rather than all words with a common linguistic root (so awe and awful don't have the same stem), and over-stemming is more problematic than under-stemming so we tend not to stem in cases that are hard to resolve. If you want to always reduce words to a root form and/or get a root form which is itself a word then Snowball's stemming algorithms likely aren't the right answer.
55+
56+
## License
57+
58+
snowball-ts is copyright (c) 2022, Claudiu Ceia, and is licensed under the MIT license: see the file ["LICENSE"](./LICENSE) for the full text of this.
59+
60+
The snowball algorithms, and the snowball library, are [licensed under the BSD license, included in the `source` directory](./src/source/COPYING).

data/ro/words_short.json

+93
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,93 @@
1+
[
2+
"a",
3+
"ab",
4+
"abajur",
5+
"abajurul",
6+
"abandon",
7+
"abandoneze",
8+
"abat",
9+
"abate",
10+
"abatere",
11+
"abateri",
12+
"abator",
13+
"abătut",
14+
"abătută",
15+
"abătuţi",
16+
"abces",
17+
"aberaţie",
18+
"aberaţii",
19+
"aberaţiune",
20+
"aberaţiuni",
21+
"abia",
22+
"abil",
23+
"abilă",
24+
"abile",
25+
"abilitate",
26+
"abilitatea",
27+
"abnegaţie",
28+
"abonamentul",
29+
"abruptă",
30+
"absent",
31+
"absentă",
32+
"absente",
33+
"absenţa",
34+
"absenţă",
35+
"absenţi",
36+
"absolut",
37+
"absoluta",
38+
"absolută",
39+
"absolute",
40+
"absolutul",
41+
"absolutului",
42+
"absoluţi",
43+
"absolve",
44+
"absolvenţi",
45+
"absolvenţii",
46+
"absolvi",
47+
"absolvire",
48+
"absolvit",
49+
"absolvită",
50+
"absolviţi",
51+
"absorbant",
52+
"absorbantă",
53+
"absorbi",
54+
"absorbit",
55+
"absorbite",
56+
"absorbiţi",
57+
"absorbţia",
58+
"abstinent",
59+
"abstract",
60+
"abstractă",
61+
"abstracte",
62+
"abstractiza",
63+
"abstractizare",
64+
"abstractizat",
65+
"abstractizăm",
66+
"abstracto",
67+
"abstracţia",
68+
"abstracţii",
69+
"abstracţiune",
70+
"abstracţiuni",
71+
"abstrage",
72+
"absurd",
73+
"absurdă",
74+
"absurde",
75+
"absurditate",
76+
"absurdităţi",
77+
"absurdităţilor",
78+
"absurdul",
79+
"abţină",
80+
"abţinem",
81+
"abţinere",
82+
"abundent",
83+
"abundentă",
84+
"abur",
85+
"aburi",
86+
"aburit",
87+
"aburitoare",
88+
"aburoase",
89+
"aburul",
90+
"abuza",
91+
"abuzez",
92+
"abuziv"
93+
]

deno.json

+7
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
{
2+
"lint": {
3+
"files": {
4+
"exclude": ["src/source/"]
5+
}
6+
}
7+
}

mod.ts

+20
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,20 @@
1+
import { stemmers } from "./src/stemmers.ts";
2+
3+
type Stemmer = {
4+
stem: (word: string) => string;
5+
};
6+
7+
export async function getStemmer(
8+
language: keyof typeof stemmers
9+
): Promise<Stemmer> {
10+
if (!(language in stemmers)) {
11+
throw new Error(`Language ${language} is not supported`);
12+
}
13+
14+
const stemmer = await stemmers[language]();
15+
const instance = new stemmer.default();
16+
17+
return {
18+
stem: (word: string) => instance.stemWord(word),
19+
};
20+
}

src/source/COPYING

+29
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,29 @@
1+
Copyright (c) 2001, Dr Martin Porter
2+
Copyright (c) 2004,2005, Richard Boulton
3+
Copyright (c) 2013, Yoshiki Shibukawa
4+
Copyright (c) 2006,2007,2009,2010,2011,2014-2019, Olly Betts
5+
All rights reserved.
6+
7+
Redistribution and use in source and binary forms, with or without
8+
modification, are permitted provided that the following conditions
9+
are met:
10+
11+
1. Redistributions of source code must retain the above copyright notice,
12+
this list of conditions and the following disclaimer.
13+
2. Redistributions in binary form must reproduce the above copyright notice,
14+
this list of conditions and the following disclaimer in the documentation
15+
and/or other materials provided with the distribution.
16+
3. Neither the name of the Snowball project nor the names of its contributors
17+
may be used to endorse or promote products derived from this software
18+
without specific prior written permission.
19+
20+
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
21+
ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
22+
WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
23+
DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR
24+
ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
25+
(INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
26+
LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON
27+
ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
28+
(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
29+
SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

0 commit comments

Comments
 (0)