We provide the source code for the paper "Structure-Infused Copy Mechanisms for Abstractive Summarization", accepted at COLING'18. If you find the code useful, please cite the following paper.
Author = {Kaiqiang Song and Lin Zhao and Fei Liu},
Title = {Structure-Infused Copy Mechanisms for Abstractive Summarization},
Booktitle = {Proceedings of the 27th International Conference on Computational Linguistics (COLING)},
Year = {2018}}
Our system seeks to re-write a lengthy sentence, often the 1st sentence of a news article, to a concise, title-like summary. The average input and output lengths are 31 words and 8 words, respectively.
The code takes as input a text file with one sentence per line. It generates a text file in the same directory as the output, ended with ".result.summary", where each source sentence is replaced by a title-like summary.
Example input and output are shown below.
An estimated 4,645 people died in Hurricane Maria and its aftermath in Puerto Rico , according to an academic report published Tuesday in a prestigious medical journal .
hurricane maria kills 4,645 in puerto rico .
The code is written in Python (v2.7) and Theano (v1.0.1). We suggest the following environment:
- A Linux machine (Ubuntu) with GPU (Cuda 8.0)
- Python (v2.7)
- Theano (v1.0.1)
- Stanford CoreNLP
- Pyrouge
To install Python (v2.7), run the command:
$ wget https://repo.continuum.io/archive/Anaconda2-5.0.1-Linux-x86_64.sh
$ bash Anaconda2-5.0.1-Linux-x86_64.sh
$ source ~/.bashrc
To install Theano and its dependencies, run the below command (you may want to add export MKL_THREADING_LAYER=GNU
to "~/.bashrc" for future use).
$ conda install numpy scipy mkl nose sphinx pydot-ng
$ conda install theano pygpu
To download the Stanford CoreNLP toolkit and use it as a server, run the command below. The CoreNLP toolkit helps derive structure information (part-of-speech tags, dependency parse trees) from source sentences.
$ wget http://nlp.stanford.edu/software/stanford-corenlp-full-2018-02-27.zip
$ unzip stanford-corenlp-full-2018-02-27.zip
$ cd stanford-corenlp-full-2018-02-27
$ nohup java -mx16g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer -port 9000 -timeout 15000 &
$ cd -
To install Pyrouge, run the command below. Pyrouge is a Python wrapper for the ROUGE toolkit, an automatic metric used for summary evaluation.
$ pip install pyrouge
Clone this repo. Download this TAR file (
) containing vocabulary files and pretrained models. Move the TAR file to folder "struct_infused_summ" and uncompress.$ git clone https://github.com/KaiQiangSong/struct_infused_summ/ $ mv model_coling18.tar.gz struct_infused_summ $ cd struct_infused_summ $ tar -xvzf model_coling18.tar.gz $ rm model_coling18.tar.gz
Extract structural features from a list of input files. The file
contains absolute (or relative) paths to individual files (test_000.txt and test_001.txt are toy files). Each file contains a number of source sentences, one sentence per line. Then, execute the command:$ python toolkit.py -f ./test_data/test_filelist.txt
Generate the model configuration file in the
folder.$ python genTestDataSettings.py ./test_data/test_filelist.txt ./settings/my_test_settings
After that, you need to modify the "dataset" field of the
file to point it to the new settings file:'dataset':'settings/my_test_settings.json'
. -
Run the testing script. The summary files, located in the same directory as the input, are ended with ".result.summary".
$ python generate.py
is the default model. It corresponds to the "2way+relation" architecture described in the paper. You can modify the filegenerate.py
(Line 152-153) by globally replacingstruct_edge
to enable the "2way+word" architecture.
Create a folder to save the model files.
is for the "2way+word" architecture and./model/struct_edge
for the "2way+relation" architecture.$ mkdir -p ./model/struct_node ./model/struct_edge
Extract structural features from the input files.
in the./train_data/
folder are toy files containing source and summary sentences, one sentence per line. Often, tens of thousands of (source, sentence) pairs are required for training.$ python toolkit.py ./train_data/source_file.txt $ python toolkit.py ./train_data/summary_file.txt
Adjust file names using below commands.
, andNsummary
respectively contain the source sentences, structural features of source sentences, and summary sentences.$ cd ./train_data/ $ mv source_file.txt.Ndocument train.Ndocument $ mv source_file.txt.feature train.dfeature $ mv summary_file.txt.Ndocument train.Nsummary $ cd -
Repeat the previous step for validation data, which are used for early stopping.
contain toy files.$ python toolkit.py ./valid_data/source_file.txt $ python toolkit.py ./valid_data/summary_file.txt $ cd ./valid_data/ $ mv source_file.txt.Ndocument valid.Ndocument $ mv source_file.txt.feature valid.dfeature $ mv summary_file.txt.Ndocument valid.Nsummary $ cd -
Generate the model configuration file in the
folder.$ python genTrainDataSettings.py ./train_data/train ./valid_data/valid ./settings/my_train_settings
After that, you need to modify the "dataset" field of the
file to point to the new settings file:'dataset':'settings/my_train_settings.json'
. -
Download the GloVe embeddings and uncompress.
$ wget http://nlp.stanford.edu/data/glove.6B.zip $ unzip glove.6B.zip $ rm glove.6B.zip
Modify the "vocab_emb_init_path" field in the file
from"vocab_emb_init_path": "../../vocab/glove.6B.100d.txt"
to"vocab_emb_init_path": "glove.6B.100d.txt"
. -
Create a vocabulary file from
. Words appearing less than 5 times are excluded.$ python get_vocab.py my_vocab
Modify the path to the vocabulary file in
fromVocab_Giga = loadFromPKL('../../dataset/gigaword_eng_5/giga_new.Vocab')
toVocab_Giga = loadFromPKL('my_vocab.Vocab')
. -
To train the model, run the below command.
$ THEANO_FLAGS='floatX=float32' python train.py
The training program stops when it reaches the maximum number of epoches (30 epoches). This number can be modified by changing the
field in./settings/training.json
. The model files are saved in folder./model/
."2way+relation" is the default architecture. It uses the settings file
. You can modify the 'network' field of theoptions_loader.py
to train the "2way+word" architecture. -
(Optional) train the model with early stopping.
You might want to change the paramters used for early stopping. These are specified in
and explained below. If early stopping is enabled, the best model files,model_best.npz
, will be saved in the./model/struct_edge/
"sample":true, # enable model checkpoint
"sampleMin":10000, # the first checkpoint occurs after 10K batches
"sampleFreq":2000, # there is a checkpoint every 2K batches afterwards
"earlyStop":true, # enable early stopping
"earlyStop_method":"valid_err", # based on validation loss
"earlyStop_bound":62000, # the training program stops if the valid loss has no improvement after 62K batches
"rate_bound":24000 # halve the learning rate if the valid loss has no improvement after 2K batches
62K batches (used for earlyStop_bound
) correspond to about 1 epoch for our dataset. 24K batches (used for rate_Bound
) is slightly less than half of an epoch.
You will switch to the file
. Modify the path to the vocabulary file intrain_2.py
fromVocab_Giga = loadFromPKL('../../dataset/gigaword_eng_5/giga_new.Vocab')
toVocab_Giga = loadFromPKL('my_vocab.Vocab')
to point it to your vocabulary file. -
Run the below command to perform the 2nd-stage training. Two files
will be generated, containing the best model parameters and system configurations for the "2way+relation" architecture.$ python train_2.py
This project is licensed under the BSD License - see the LICENSE.md file for details.
We grateful acknowledge the work of Kelvin Xu whose code in part inspired this project.