This repository provides the code for BioSEPBERT, a neuroscience representation model designed for Brain Region text mining tasks such as named entity recognition and relation extraction.
You can use requirements.txt to install BioSEPBERT as follows (Python version >= 3.8):
pip install -r requirements.txt
If you want to install the cuda version, do the following:
pip install torch==1.8.1+cu101 torchvision==0.9.1+cu101 torchaudio==0.8.1 -f https://download.pytorch.org/whl/torch_stable.html
We provide two versions of pre-trained weights.
- BioSEPBERT-NER - fine-tuning on WhiteText corpus
- BioSEPBERT-RE - fine-tuning on WhiteText connectivity corpus
You can also use other pre-trained weights as follows:
- BioBERT - fine-tuning on biomedical corpus
- PubMedBERT - fine-tuning on biomedical corpus
We provide a pre-processed version of benchmark datasets for each task as follows:
Named Entity Recognition: (36.3 MB), a dataset on brain region named entity recognition
Relation Extraction: (46.6 MB), a dataset on brain region connectivity relation extraction
You can get all datasets on the dataset folder.
After downloading one of the pre-trained weights, unpack it to /model
.
You can change the model_name
and model_type
to PubMedBERT
or biobert
to use other pre-trained weights, the other parameters can also be changed refer to our paper.
Following command runs fine-tuning code on NER with default arguments.
python run_ner.py --task_name=BioSEPBERT --data_dir=../dataset/NER/1 --model_dir=../model/ --model_name=BioSEPBERT --model_type=BioSEPBERT --output_dir=../ --max_length=512 --train_batch_size=16 --eval_batch_size=16 --learning_rate=5e-5 --epochs=3 --logging_steps=-1 --save_steps=10 --seed=2022 --do_train --do_predict
Following command runs fine-tuning code on RE with default arguments.
python run_re.py --task_name=BioSEPBERT --data_dir=../dataset/RE/1 --model_dir=../model/ --model_name=BioSEPBERT --model_type=BioSEPBERT --output_dir=../ --max_length=512 --train_batch_size=16 --eval_batch_size=16 --learning_rate=5e-5 --epochs=3 --warmup_proportion=0.1 --earlystop_patience=100 --max_grad_norm=0.0 --logging_steps=-1 --save_steps=1 --seed=2021 --do_train --do_predict
The following command runs on simple evaluation.
python run_ner.py --task_name=BioSEPBERT --data_dir=../dataset/NER/1 --model_dir=../model/ --model_name=BioSEPBERT --model_type=BioSEPBERT --output_dir=../ --max_length=512 --train_batch_size=16 --eval_batch_size=16 --learning_rate=5e-5 --epochs=3 --logging_steps=-1 --save_steps=10 --seed=2022 --do_predict
For questions and additional information, or would like to give us any suggestions, please contact us. The email address: aali@hust.edu.cn.