Skip to content

Commit 9057148

Browse files
committed
fix README
1 parent 5c2b5ec commit 9057148

File tree

1 file changed

+31
-0
lines changed

1 file changed

+31
-0
lines changed

README.md

+31
Original file line numberDiff line numberDiff line change
@@ -237,6 +237,37 @@ Perform BERT for NER supervised training and test/cross-validation.
237237
bert-ner --help
238238
```
239239

240+
## BERT-Pre-training:
241+
242+
### collectcorpus
243+
244+
```
245+
collectcorpus --help
246+
247+
Usage: collectcorpus [OPTIONS] FULLTEXT_FILE SELECTION_FILE CORPUS_FILE
248+
249+
Reads the fulltext from a CSV or SQLITE3 file (see also altotool) and
250+
write it to one big text file.
251+
252+
FULLTEXT_FILE: The CSV or SQLITE3 file to read from.
253+
254+
SELECTION_FILE: Consider only a subset of all pages that is defined by the
255+
DataFrame that is stored in <selection_file>.
256+
257+
CORPUS_FILE: The output file that can be used by bert-pregenerate-trainingdata.
258+
259+
Options:
260+
--chunksize INTEGER Process the corpus in chunks of <chunksize>.
261+
default:10**4
262+
263+
--processes INTEGER Number of parallel processes. default: 6
264+
--min-line-len INTEGER Lower bound of line length in output file.
265+
default:80
266+
267+
--help Show this message and exit.
268+
269+
```
270+
240271
### bert-pregenerate-trainingdata
241272

242273
Generate data for BERT pre-training from a corpus text file where

0 commit comments

Comments
 (0)