-
Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context
-
Label-Agnostic Sequence Labeling by Copying Nearest Neighbors
-
Fixed Encoder Self-Attention Patterns in Transformer-Based Machine Translation
simplify encoder self-attention of Transformer-based NMT models by replacing all but one attention head with fixed positional attentive patterns that require neither training nor external knowledge. Improve translation quality in low-resource settings thanks to the strong injected prior knowledge about positional attention -
Attention Is All You Need
Discuss transformer architecture and first archi to get rid of CNN and RNN (which are computatinally expensive and cannot be parallelised). BERT is based on Transformers -
BERT
Use transfomers to come up with deep bidirectional model to encode both L2R and R2L contexts. Simplify task-specific architecture by re-using pre-training archi also for finetuning method (all parameters are updated end-to-end but quickly). Works well for feature-based approaches as well. Using MLM and NSP objective (based on utility for downstream tasks). -
GPT-2
GPT-1 transformer uses constrained self-attention where every token can only attend to context to its left. This way of using transformer is called "Transformer decoding" since this can be used in text generation (auto-regressive style of decoding). -
THIEVES ON SESAME STREET! MODEL EXTRACTION OF BERT-BASED APIS
- Course Webpage
- Variational and Information Theoretic Principles in Neural Networks
- Generating Sentences from a Continuous Space
- Expectation Maximization (EM)
- Reading Wikipedia to Answer Open-Domain Questions
- Latent Retrieval for Weakly Supervised Open Domain Question Answering
- A Primer in BERTology: What we know about how BERT works
- A Structural Probe for Finding Syntax in Word Representations
- Do NLP Models Know Numbers? Probing Numeracy in Embeddings
- Knowledge Enhanced Contextual Word Representations
- Designing and Interpreting Probes with Control Tasks
- Language Models as Knowledge Bases?
-
On Extractive and Abstractive Neural Document Summarization with Transformer Language Models
-
Faithful to the Original - Fact Aware Neural Abstractive Summarization
Augment the attention mechanism of neural models with factual triples extracted with open information extraction system
Ensure the Correctness of the Summary: Incorporate Entailment Knowledge into Abstractive Sentence Summarization entailment aware encoder (MTL) and entailment aware decoder (Entailment reward maximisation) -
Evaluating the Factual Consistency of Abstractive Text Summarization
-
Assessing The Factual Accuracy of Generated Text
compared different information extraction systems to evaluate the factual accuracy of generated text -
Optimizing the Factual Correctness of a Summary: A Study of Summarizing Radiology Reports
-
Retrieve, Rerank and Rewrite - Soft Template Based Neural Summarization
-
Ranking Generated Summaries by Correctness
An Interesting but Challenging Application for Natural Language Inference
Studied whether existing natural language infer- ence systems can be used to evaluate the factual correctness of generated summaries, and found models trained on existing datasets to be inade- quate for this task. -
GENERATING WIKIPEDIA BY SUMMARIZING LONG SEQUENCES
Also, we found this very relevant paper that does something similar to our core idea- extract crucial parts from long documents, and then use abstractive ways to summarize. However, they used simple extractive methods for first stage filtering whereas our ideas can be fancier and incorporate more intuition about 'facts'. Also, they have used older SOTA models for step 2 (abstractive summarization) whereas we have better alternatives available now. They also hint at "..results...suggesting future work in improving the extraction step could result in significant improvements. One possibility is to train a supervised model to predict relevance which we leave as future work". Their major contribution is to modify the transformer architecture to introduce a Transformer decoder which supports really long documents, however, they did this for multi-document summarization scenario. They also mention "for our task optimizing for perplexity correlates with increased ROUGE and human judgment. As perplexity decreases we see improvements in the model outputs, in terms of fluency, factual accuracy, and narrative complexity" so proceeding with the perplexity idea (for extractor) we discussed last time could be good.
-
YOUR CLASSIFIER IS SECRETLY AN ENERGY BASED MODEL AND YOU SHOULD TREAT IT LIKE ONE
-
A Simple Framework for Contrastive Learning of Visual Representations
-
XLNet and TransformerXL
For a Transformer, this is impossible because Transformers take fixed-length sequences as input have no notion of "memory". All its computations are stateless (this was actually one of the major selling points of the Transformer: no state means computation can be parallelized) so there is an upper limit on the distance of relationships a vanilla Transformer can model. The Transformer XL is a simple extension of the Transformer that seeks to resolve this problem. The idea is simple: what if we added recurrence to the Transformer? Adding recurrence at the word level would just make it an RNN. But what if we added recurrence at a "segment" level. In other words, what if we added state between consecutive sequences of computations? The Transformer XL accomplishes this by caching the hidden states of the previous sequence and passing them as keys/values when processing the current sequence. the Transformer XL introduces the notion of relative positional embeddings. Instead of having an embedding represent the absolute position of a word, the Transformer XL uses an embedding to encode the relative distance between words. This embedding is used while computing the attention score between any two words: in other words, the relative positional embedding enables the model to learn how to compute the attention score for words that are n words before and after the current word.
XlLnet model is forced to model bidirectional dependencies with permutation language modeling. In expectation, the model should learn to model the dependencies between all combinations of inputs in contrast to traditional language models that only learn dependencies in one direction. The conceptual difference between BERT and XLNet. XLNet learns to predict the words in an arbitrary order but in an autoregressive, sequential manner (not necessarily left-to-right). BERT predicts all masked words simultaneously. In permutation language modeling, we are not changing the actual order of words in the input sentence. We are just changing the order in which we predict them. -
RNN – Andrej Karpathy’s blog The Unreasonable Effectiveness of Recurrent Neural Networks
-
LSTM – Christopher Olah’s blog Understanding LSTM Networks and R2Rt.com Written Memories: Understanding, Deriving and Extending the LSTM
Use of RNN (Sequential data): when we don’t need any further context – it’s pretty obvious the next word is going to be "sky". In such cases, where the gap between the relevant information and the place that it’s needed is small, RNNs can learn to use the past information. Unfortunately, as that gap grows, RNNs become unable to learn to connect the information. “LSTMs” – are a special kind of RNN, capable of learning long-term dependencies. The cell state is kind of like a conveyor belt. It runs straight down the entire chain, with only some minor linear interactions. It’s very easy for information to just flow along it unchanged. The LSTM does have the ability to remove or add information to the cell state, carefully regulated by structures called gates. helps in gradient flow and structured gates helps adds more flexibility to the model. -
Attention – Christopher Olah Attention and Augmented Recurrent Neural Networks
Discusses use of attention for various applications like translation, image captioning and audio transcribing -
Seq2Seq - Nathan Lintz Sequence Modeling With Neural Networks
Using of Seq2Seq: Since the decoder model sees an encoded representation of the input sequence as well as the translation sequence, it can make more intelligent predictions about future words based on the current word. For example, in a standard language model, we might see the word “crane” and not be sure if the next word should be about the bird or heavy machinery. However, if we also pass an encoder context, the decoder might realize that the input sequence was about construction, not flying animals. Given the context, the decoder can choose the appropriate next word and provide more accurate translations.
Without attention: Unfortunately, compressing an entire input sequence into a single fixed vector tends to be quite challenging. And, the context is biased towards the end of the encoder sequence, and might miss important information at the start of the sequence.
This mechanism will hold onto all states from the encoder and give the decoder a weighted average of the encoder states for each element of the decoder sequence. Now, the decoder can take “glimpses” into the encoder sequence to figure out which element it should output next. Our decoder network can now use different portions of the encoder sequence as context while it’s processing the decoder sequence, instead of using a single fixed representation of the input sequence. This allows the network to focus on the most important parts of the input sequence instead of the whole input sequence, therefore producing smarter predictions for the next word in the decoder sequence. Helps in better backpropagation to diff encoder states -
Transformer Google Blog Need for transformer: Recurrent models due to sequential nature (computations focused on the position of symbol in input and output) are not allowing for parallelization along training, thus have a problem with learning long-term dependencies from memory.
Constraint of sequential computation: attempted by CNN models. However, in those CNN-based approaches, the number of calculations in parallel computation of the hidden representation, for input→output position in sequence, grows with the distance between those positions. The complexity of O(n) for ConvS2S and O(nlogn) for ByteNet makes it harder to learn dependencies on distant positions.
Transformer reduces the number of sequential operations to relate two symbols from input/output sequences to a constant O(1) number of operations. Transformer achieves this with the multi-head attention mechanism that allows to model dependencies regardless of their distance in input or output sentence.
The novel approach of Transformer is however, to eliminate recurrence completely and replace it with attention to handle the dependencies between input and output. The Transformer moves the sweet spot of current ideas toward attention entirely. It eliminates the not only recurrence but also convolution in favor of applying self-attention (a.k.a intra-attention). Additionally Transformer gives more space for parallelization. Transformer is claimed by authors to be the first to rely entirely on self-attention to compute representations of input and output. The encoder-decoder model is designed at its each step to be auto-regressive - i.e. use previously generated symbols as extra input while generating next symbol. Thus, xi+yi−1→yi
In each step, it applies a self-attention mechanism which directly models relationships between all words in a sentence, regardless of their respective position. In the earlier example “I arrived at the bank after crossing the river”, to determine that the word “bank” refers to the shore of a river and not a financial institution, the Transformer can learn to immediately attend to the word “river” and make this decision in a single step.Positional emb: In paper authors have decided on fixed variant using sin and cos functions to enable the network to learn information about tokens relative positions to the sequence. Of course authors motivate the use of sinusoidal functions due to enabling model to generalize to sequences longer than ones encountered during training.
Transformer reduces the number of operations required to relate (especially distant) positions in input and output sequence to a O(1). However, this comes at cost of reduced effective resolution because of averaging attention-weighted positions. To reduce this cost authors propose the multi-head attention.
Self-attention: In encoder, self-attention layers process input queries,keys and values that comes form same place i.e. the output of previous layer in encoder. Each position in encoder can attend to all positions from previous layer of the encoder.
In encoder phase (shown in the Figure 1.), transformer first generates initial representation/embedding for each word in input sentence (empty circle). Next, for each word, self-attention aggregates information form all other words in context of sentence, and creates new representation (filled circles). The process is repeated for each word in sentence. Successively building new representations, based on previous ones is repeated multiple times and in parallel for each word (next layers of filled circles).
Decoder acts similarly generating one word at a time in a left-to-right-pattern. It attends to previously generated words of decoder and final representation of encoder.
-
Reformer Blog focusing primarily on how the self-attention operation scales with sequence length, and proposing an alternative attention mechanism to incorporate information from much longer contexts into language models.
-
BERT Google Blog deeply bidirectional vs ELMO (shallow way to plugging two representations together)
Why does this matter? Pre-trained representations can either be context-free or contextual, and contextual representations can further be unidirectional or bidirectional. Context-free models such as word2vec or GloVe generate a single word embedding representation for each word in the vocabulary. For example, the word “bank” would have the same context-free representation in “bank account” and “bank of the river.” Contextual models instead generate a representation of each word that is based on the other words in the sentence. For example, in the sentence “I accessed the bank account,” a unidirectional contextual model would represent “bank” based on “I accessed the” but not “account.” However, BERT represents “bank” using both its previous and next context — “I accessed the ... account” — starting from the very bottom of a deep neural network, making it deeply bidirectional. -
Autoregressive Models At each step the model is auto-regressive, consuming the previously generated symbols as additional input when generating the next.
Self Supervised Learning
SVM, Kernel and kernel Functions
K-means, PCA, SVD
Bagging Boosting
Feature Selection
Model Selection
Optimization Algorithms
HMM
Transformer
Active Learning
Dependency Parsing
POS tagging
AdaBoast, AdaGrad, Ensembles: Check ML/NLP whatsapp group
Random Forests