Novelty Goes Deep. A Deep Neural Solution To Document Level Novelty Detection (COLING 2018)

ABOUT

RDV-CNN model for document level novelty detection. Comparision of our model with baselines on three popular datasets:

Original paper: https://www.aclweb.org/anthology/C18-1237/
Datasets used:

REQUIREMENTS

Python 2.7
Infersent (https://github.com/facebookresearch/InferSent): Infersent is used for training a sentence encoder on SNLI corpus. Required files are already present in the sentence_encoder directory, please use them, dont use the files from the git repo since they are updated and are no longer compatible with our scripts. A pretrained model is also available in sentence_encoder/encoder directory.
PyTorch (for training the sentence encoder and inferring sentence embeddings)
- Version: 1.3.0
Keras (for BiLSTM + MLP Baseline)
- Version: 2.3.1
Tensorflow (for Keras backend)
- Version: 1.14.0
Theano (for RDV-CNN model)
- Version: 1.0.0 (Upgrade as necessary if you face any issues)

Description of important files in each directory

DLND

extract_sentence_embedding.py: Produces pre-trained sentence embeddings for dlnd data, dependency: ../infersent directory must be present, dlnd corpus must be present. Creates a pickle file which contains the sentence embeddings.
rdv.py: Produces Relative document matrix based on sentence embeddings for input to CNN , input: name of pickle file which has sentence embeddings, this is hardcoded.
process.py: Takes the rdv file and converts it to format which is suitable for input to CNN program, produces a mr_dlnd.p pickle file
conv_net_sentences.py: The most important file, this is the main CNN program, give as command line argument path of mr_dlnd.p file. It creates the output file which has the predictions for each target and source document pair

WEBIS

webis_data_preprocessing.py: Converts Webis CPC data to .pickle format which contains source and target sentences as well as the gold values.
- Input: Webis-CPC-11 directory should be present in the working directory
- Output: webis_data.pickle
webis_sentence_embedding.py: Produces sentence embeddings for webis data.
- Input: webis_data.pickle should exist in the cwd
- Output: webis_embeddings_data_{1024/2048}_attn.p
process.py: Takes the sentence embedding and converts it to format which is suitable for input to CNN program
- Input: webis_embeddings_data_{1024/2048}_attn.p
- Output: mr_webis_1024_attn.p
webis_baselines.py: Produces class probabilities for various baselines.
- Input: webis_data.pickle and doc2vec.bin
- Output: webis_baselines_class_probs.p
webis_bilstm_mlp_baseline.py: Runs BiLSTM + MLP model on webis sentence embeddings and evaluates using 10 fold cross validation, saves the result after each cross validation also prints the result after all the cv are done.
- Input: webis_embeddings_data_{1024/2048}_attn.p
- Output: webis_ten_fold_progress_bilstm_mlp_baseline.p
conv_net_sentences.py: The most important file, this is the main CNN program, give as command line argument path of mr_webis_1024.p file. It creates the output file which has the predictions for each target and source document pair
- Input mr_webis_1024.p
- Output: webis_1024_cnn_output.pickle
make_prc_curves.py: Analyze the result of various baselines and BiLSTM + MLP method, produces various scores and plots a precision recall curve. Also stores the class probabilities for each technique in a pickle file.
- Input: webis_1024_cnn_output.pickle, webis_baselines_class_probs.p, webis_ten_fold_progress_bilstm_mlp_baseline.p
- Output: webis_prc_curves_data.p
analyze_cnn_output.py: Analyze the output of conv_net_sentences.py to display the results (precision, recall etc..) of the RDV + CNN model
- Input: webis_1024_cnn_output.pickle

APWSJ

make_sentence_embedding.py: Produces pre-trained sentence embeddings for documents in apwsj_parsed_documents directory, dependency: /novelty/infersent directory must be present, apwsj_parsed_documents directory must be present. Creates a pickle file which contains the sentence embeddings.
make_rdvs.py: Generates Relative document matrix (rdv file ) based on sentence embeddings for input to CNN , input: name of pickle file which has sentence embeddings, output is rdv file
process.py: It converts the rdv file to format which is suitable for input to CNN program, produces a mr_apwsj.p pickle file
conv_net_sentences.py: The most important file, this is the main CNN program, give as command line argument path of mr_apwsj.p file. It creates the output file which has the predictions for each target and source document pair

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Novelty Goes Deep. A Deep Neural Solution To Document Level Novelty Detection (COLING 2018)

ABOUT

REQUIREMENTS

Description of important files in each directory

DLND

WEBIS

APWSJ

Files

README.md

Latest commit

History

README.md

File metadata and controls

Novelty Goes Deep. A Deep Neural Solution To Document Level Novelty Detection (COLING 2018)

ABOUT

REQUIREMENTS

Description of important files in each directory

DLND

WEBIS

APWSJ