Moreover, we also show that theoretically our proposed sparse attention mechanism preserves the expressivity and flexibility of the quadratic full Transformers. Then, in “ Big Bird: Transformers for Longer Sequences”, presented at NeurIPS 2020, we introduce another sparse attention method, called BigBird that extends ETC to more generic scenarios where prerequisite domain knowledge about structure present in the source data may be unavailable. This reduces the quadratic dependency on input length to linear and yields strong empirical results in the NLP domain. In “ ETC: Encoding Long and Structured Inputs in Transformers”, presented at EMNLP 2020, we present the Extended Transformer Construction (ETC), which is a novel method for sparse attention, in which one uses structural information to limit the number of computed pairs of similarity scores. We address both of these questions in a recent pair of papers. Two natural questions arise: 1) Can we achieve the empirical benefits of quadratic full Transformers using sparse models with computational and memory requirements that scale linearly with the input sequence length? 2) Is it possible to show theoretically that these linear Transformers preserve the expressivity and flexibility of the quadratic full Transformers? With commonly available current hardware and model sizes, this typically limits the input sequence to roughly 512 tokens, and prevents Transformers from being directly applicable to tasks that require larger context, like question answering, document summarization or genome fragment classification. The key innovation in Transformers is the introduction of a self-attention mechanism, which computes similarity scores for all pairs of positions in an input sequence, and can be evaluated in parallel for each token of the input sequence, avoiding the sequential dependency of recurrent neural networks, and enabling Transformers to vastly outperform previous sequence models like LSTM.Ī limitation of existing Transformer models and their derivatives, however, is that the full self-attention mechanism has computational and memory requirements that are quadratic with the input sequence length. The versatility and robustness of Transformers are the primary drivers behind their wide-scale adoption, leading them to be easily adapted for a diverse range of sequence-based tasks - as a seq2seq model for translation, summarization, generation, and others, or as a standalone encoder for sentiment analysis, POS tagging, machine reading comprehension, etc. Natural language processing (NLP) models based on Transformers, such as BERT, RoBERTa, T5, or GPT3, are successful for a wide variety of tasks and a mainstay of modern NLP research. Posted by Avinava Dubey, Research Scientist, Google Research
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |