Here we explain the data format for LogicNLG, please note that we only select a subset of columns beforehand to decrease the irrelevant information input and alleviate the over-size problem. The method to link subset of columns is described in the parser documentation.
The three files (train_lm, val_lm, test_lm) are used for training/testing all the models in the following format:
{
table_id: [
[
sent1,
linked columns1,
table title,
template1
],
[
sent2,
linked columns2,
table title,
template2
],
...
]
table_id: [
...
]
}
The template sentence is generated by using entity linking file, which is not 100% accurate, it could miss some numbers or entities. Besides that, to accelerate the dataloading, we also preprocess the training file to have train_lm_preprocessed.json, which appends the "linearized table" in each sentence.
These files (val_lm_pos_neg.json, test_lm_pos_neg.json) are used for adversarial evaluation, where each sentence is paired with an adversarial example with mild modification to test model's sensitivity against logic errors. The data is in the following format:
{
table_id: [
{
pos:[
sent1,
linked columns1,
table title,
template1
]
neg:[
sent1-adv,
linked columns1,
table title,
template1-adv
]
},
{
pos:[
sent2,
linked columns2,
table title,
template2
]
neg:[
sent2-adv,
linked columns2,
table title,
template2-adv
]
},
...
],
table_id: [
{
...
},
{
...
}
...
]
...
}
vocab.json and full_vocab.json are for the Transformer model with copy mechanism, freq_list.json and stop_words.json is for the entity linking model, tabfact_bootstrap.json is for training the semantic parser.