An Annotated Transformer

The best way to learn is by doing. This project is a from-scratch Transformer implementation, built using PyTorch with help from the following resources:

This model comes with some starter training code, using the Gigawords dataset and the uncased DistilBERT tokenizer. It uses the following hyperparameter configuration:

Embeddings and hidden layer dimensionality: 256
Inner feedforward layer dimensionality: 1024
Number of attention heads: 8
Number of encoder and decoder layers: 8
Dropout rate: 0.1
Label smoothing rate: 0.1
Learning rate: 0.0001 (AdamW), weight decay: 0.05
Gradient clipping norm: 2.0
Temperature: 2.0

The beam search implementation was copied from mikecvet/beam and slightly modified

Printing an instance of the Transformer class:

  Model: Transformer(src_vocab_size=32100, tgt_vocab_size=32100, dim=256, dim_inner=1024, heads=8, num_layers=4,
         max_seq_length=256, dropout_rate=Dropout(p=0.1, inplace=False), device=mps)
  	--> PositionalEncoder(dim=256, max_seq_len=256, device=mps)
  	--> Embedding(src_vocab_size=32100, dim=256) (encoder)
  	--> Embedding(tgt_vocab_size=32100, dim=256) (decoder)
  	--> [EncoderBlock(dim=256, num_attn_heads=8, num_layers=4, dim_inner=1024, dropout_rate=0.1, device=mps)]
  		--> (x4) [Encoder(dim=256, num_attn_heads=8, dim_inner=1024, dropout_rate=0.1, device=mps)]
  	--> [DecoderDecoder(dim=256, num_attn_heads=8, num_layers=4, dim_inner=1024, dropout_rate=0.1, device=mps)]
  		--> (x4) [Decoder(dim=256, num_attn_heads=8, dim_inner=1024, dropout_rate=0.1, device=mps)]
  	--> Linear(dim=256, tgt_vocab_size=32100)
  	--> Dropout(dropout_rate=0.1)

It was difficult to sufficiently train a model of this size on my Macbook Air, however the summarization task results are ~alright. I experimented with various configruations of attention heads, layers and dimensionality, and found that the above settings seemed to work best for this dataset specifically:

$ python3 src/main.py --temperature 2.0 --train 20
epoch loss: 2122.236578941345
Saved model data to transformer_state.data
epoch loss: 1960.4350719451904
Saved model data to transformer_state.data
epoch loss: 1945.9114050865173
Saved model data to transformer_state.data
epoch loss: 1935.8571014404297
(etc)

test struct: {
'document': 'the ruble fell to #,### here on friday from #,### on friday and the central bank intervened by selling ##.# million dollars , dealers said .',
'summary': 'ruble falls to #,### to the dollar',
'input_ids': tensor([  101,  1996, 14548,  2571,  3062,  2000,  1001,  1010,  1001,  1001,
         1001,  2182,  2006,  5958,  2013,  1001,  1010,  1001,  1001,  1001,
         2006,  5958,  1998,  1996,  2430,  2924, 21116,  2011,  4855,  1001,
         1001,  1012,  1001,  2454,  6363,  1010, 16743,  2056,  1012,   102,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     (etc)]),
'attention_mask': tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, (etc)]),
'labels': tensor([  101, 14548,  2571,  4212,  2000,  1001,  1010,  1001,  1001,  1001,
         2000,  1996,  7922,   102,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     (etc)])
}
labels: ['[CLS]', 'rub', '##le', 'falls', 'to', '#', ',', '#', '#', '#', 'to', 'the', 'dollar', '[SEP]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', (etc)]

(etc)

next step candidates:
	-9.81064919: [dollar rub french #, in falls to. at dollars million percent on - of billion euros january jobs]
	-9.81797765: [dollar rub french #, in falls to. at dollars million on percent march - euros billion of trade]
	-9.82840920: [dollar rub french #, in falls to. at dollars million percent on - of billion euros january]
	(more omitted)

generated sequence IDs: [101, 7922, 14548, 2413, 1001, 1010, 1999, 4212, 2000, 1012, 2012, 6363, 2454, 3867, 2006, 1011, 1997, 4551, 19329, 2254, 5841, 102]
tokens: ['[CLS]', 'dollar', 'rub', 'french', '#', ',', 'in', 'falls', 'to', '.', 'at', 'dollars', 'million', 'percent', 'on', '-', 'of', 'billion', 'euros', 'january', 'jobs', '[SEP]']
expected tokens (labels): ['[CLS]', 'rub', '##le', 'falls', 'to', '#', ',', '#', '#', '#', 'to', 'the', 'dollar', '[SEP]', '[PAD]', '[PAD]',

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
src		src
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

An Annotated Transformer

About

Languages

License

mikecvet/annotated-transformer

Folders and files

Latest commit

History

Repository files navigation

An Annotated Transformer

About

Topics

Resources

License

Stars

Watchers

Forks

Languages