Skip to content

Commit c109c85

Browse files
committedJul 3, 2023
feat: expand dataset to include words / phrases from Spanish
1 parent 2263bc0 commit c109c85

File tree

3 files changed

+63
-8
lines changed

3 files changed

+63
-8
lines changed
 

‎.changeset/popular-comics-jog.md

+5
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
---
2+
"ytid": minor
3+
---
4+
5+
expand dataset to include words / phrases from Spanish

‎README.md

+57-7
Original file line numberDiff line numberDiff line change
@@ -106,12 +106,62 @@ As a result, ytid doesn't generate IDs like ```7-GoToHell3``` or ```shit9RcYjcM`
106106

107107
The dataset of offensive / profane words is a combination of various datasets -
108108

109-
| Dataset | Source | Instances (Rows) |
110-
| --- | --- | --- |
111-
| Google's ["what do you love" project](https://en.wikipedia.org/wiki/WDYL_(search_engine)) | https://gist.github.com/jamiew/1112488 | 451 |
112-
| Bad Bad Words | https://www.kaggle.com/datasets/nicapotato/bad-bad-words | 1617
113-
| [Surge AI](https://www.surgehq.ai/)'s The Obscenity List | https://github.com/surge-ai/profanity | 1599 |
114-
| washyourmouthoutwithsoap | https://github.com/thisandagain/washyourmouthoutwithsoap | 147 |
109+
<table>
110+
<thead>
111+
<tr>
112+
<th>Dataset</th>
113+
<th>Source</th>
114+
<th>Language</th>
115+
<th>Instances (Rows)</th>
116+
</tr>
117+
</thead>
118+
<tbody>
119+
<tr>
120+
<td>Google's <a href="https://en.wikipedia.org/wiki/WDYL_(search_engine)">&quot;what do you love&quot;
121+
project</a></td>
122+
<td><a href="https://gist.github.com/jamiew/1112488">https://gist.github.com/jamiew/1112488</a></td>
123+
<td rowspan="4">English</td>
124+
<td>451</td>
125+
</tr>
126+
<tr>
127+
<td>Bad Bad Words</td>
128+
<td><a
129+
href="https://www.kaggle.com/datasets/nicapotato/bad-bad-words">https://www.kaggle.com/datasets/nicapotato/bad-bad-words</a>
130+
</td>
131+
<td>1617</td>
132+
</tr>
133+
<tr>
134+
<td>Surge AI's The Obscenity List</td>
135+
<td><a href="https://github.com/surge-ai/profanity">https://github.com/surge-ai/profanity</a></td>
136+
<td>1598</td>
137+
</tr>
138+
<tr>
139+
<td>washyourmouthoutwithsoap</td>
140+
<td><a href="https://github.com/thisandagain/washyourmouthoutwithsoap">https://github.com/thisandagain/washyourmouthoutwithsoap</a></td>
141+
<td>147</td>
142+
</tr>
143+
<tr>
144+
<td>Multilingual swear profanity</td>
145+
<td><a href="https://www.kaggle.com/datasets/miklgr500/jigsaw-multilingual-swear-profanity">https://www.kaggle.com/datasets/miklgr500/jigsaw-multilingual-swear-profanity</a>
146+
</td>
147+
<td rowspan="3">Spanish</td>
148+
<td>366</td>
149+
</tr>
150+
<tr>
151+
<td>Surge AI's Spanish Dataset</td>
152+
<td>
153+
<a href="https://www.surgehq.ai/datasets/spanish-profanity-list">https://www.surgehq.ai/datasets/spanish-profanity-list</a>
154+
</td>
155+
<td>178</td>
156+
</tr>
157+
<tr>
158+
<td>washyourmouthoutwithsoap</td>
159+
<td><a href="https://github.com/thisandagain/washyourmouthoutwithsoap">https://github.com/thisandagain/washyourmouthoutwithsoap</a>
160+
</td>
161+
<td>125</td>
162+
</tr>
163+
</tbody>
164+
</table>
115165

116166
These datasets undergo the following preprocessing steps -
117167

@@ -124,7 +174,7 @@ These datasets undergo the following preprocessing steps -
124174
5. Then, duplicate values are removed from this new dataset.
125175
6. Finally, only the instances that match the regex pattern ```^[A-Za-z0-9_-]{0,11}$``` are kept, while the rest are removed. This keeps the number of instances to a minimum by removing unnecessary words or phrases.
126176

127-
Preprocessing yields a dataset of 2885 instances, that helps ensure the generated IDs are safe for using in URLs and for sharing on social media platforms.
177+
Preprocessing yields a dataset of 3279 instances, that helps ensure the generated IDs are safe for using in URLs and for sharing on social media platforms.
128178

129179
The preprocessing was done on this [Colab Jupyter notebook](https://colab.research.google.com/drive/1LRA3_Qa_0qCL9bkfo06ztjWkr-aP4rz1).
130180

‎datasets/profaneWords.ts

+1-1
Large diffs are not rendered by default.

0 commit comments

Comments
 (0)
Please sign in to comment.