File tree 1 file changed +31
-0
lines changed
1 file changed +31
-0
lines changed Original file line number Diff line number Diff line change @@ -237,6 +237,37 @@ Perform BERT for NER supervised training and test/cross-validation.
237
237
bert-ner --help
238
238
```
239
239
240
+ ## BERT-Pre-training:
241
+
242
+ ### collectcorpus
243
+
244
+ ```
245
+ collectcorpus --help
246
+
247
+ Usage: collectcorpus [OPTIONS] FULLTEXT_FILE SELECTION_FILE CORPUS_FILE
248
+
249
+ Reads the fulltext from a CSV or SQLITE3 file (see also altotool) and
250
+ write it to one big text file.
251
+
252
+ FULLTEXT_FILE: The CSV or SQLITE3 file to read from.
253
+
254
+ SELECTION_FILE: Consider only a subset of all pages that is defined by the
255
+ DataFrame that is stored in <selection_file>.
256
+
257
+ CORPUS_FILE: The output file that can be used by bert-pregenerate-trainingdata.
258
+
259
+ Options:
260
+ --chunksize INTEGER Process the corpus in chunks of <chunksize>.
261
+ default:10**4
262
+
263
+ --processes INTEGER Number of parallel processes. default: 6
264
+ --min-line-len INTEGER Lower bound of line length in output file.
265
+ default:80
266
+
267
+ --help Show this message and exit.
268
+
269
+ ```
270
+
240
271
### bert-pregenerate-trainingdata
241
272
242
273
Generate data for BERT pre-training from a corpus text file where
You can’t perform that action at this time.
0 commit comments