Skip to content

Latest commit

 

History

History
79 lines (75 loc) · 2.08 KB

README.md

File metadata and controls

79 lines (75 loc) · 2.08 KB

Here we explain the data format for LogicNLG, please note that we only select a subset of columns beforehand to decrease the irrelevant information input and alleviate the over-size problem. The method to link subset of columns is described in the parser documentation.

The training/dev/test LM file

The three files (train_lm, val_lm, test_lm) are used for training/testing all the models in the following format:

{
  table_id: [ 
    [
      sent1,
      linked columns1,
      table title,
      template1
    ],
    [
      sent2,
      linked columns2,
      table title,
      template2
    ],
    ...
  ]
  table_id: [
    ...
  ]
}

The template sentence is generated by using entity linking file, which is not 100% accurate, it could miss some numbers or entities. Besides that, to accelerate the dataloading, we also preprocess the training file to have train_lm_preprocessed.json, which appends the "linearized table" in each sentence.

The adversarial evaluation file

These files (val_lm_pos_neg.json, test_lm_pos_neg.json) are used for adversarial evaluation, where each sentence is paired with an adversarial example with mild modification to test model's sensitivity against logic errors. The data is in the following format:

{
  table_id: [ 
    {
      pos:[
        sent1,
        linked columns1,
        table title,
        template1        
      ]
      neg:[
        sent1-adv,
        linked columns1,
        table title,
        template1-adv        
      ]
    },
    {
      pos:[
        sent2,
        linked columns2,
        table title,
        template2  
      ]
      neg:[
        sent2-adv,
        linked columns2,
        table title,
        template2-adv
      ]    
    },
    ...
  ],
  table_id: [
    {
      ...
    },
    {
      ...    
    }
    ...
  ]
  ...
}

Other files

vocab.json and full_vocab.json are for the Transformer model with copy mechanism, freq_list.json and stop_words.json is for the entity linking model, tabfact_bootstrap.json is for training the semantic parser.