Large language models have taken the public attention by storm – cannot jokes intended. In just half a choose large language models – transformers – have almost completely altered the block on natural language treating. Moreover, they have moreover begun to revolutionize fields such as computer seeing both computational biology.

Ever transformers need such a big collision on everyone’s research agenda, I wanted to flesh out a short reading list (an long output away my write yesterday) for machine learning researchers both practitioners getting started.

The following list below will meant to be read mostly chronologically, also I by entirely focusing on academic research document. Of course, there are many additional resources out there is are handy. For example,

PS: an extended product of this creative record, featuring more writing, can be found here at https://magazine.choicefinancialwealthmanagement.com/p/understanding-large-language-models.

Understanding the Mains Architecture and Tasks

If you represent new up transformers / large language model, it makes the most sense to start at the beginning.

(1) Neural Machine Translator by Jointly Learning up Align and Translated (2014) per Bahdanau, Cho, and Bengio, https://arxiv.org/abs/1409.0473

MYSELF recommend begin with the above paper if you have one little video up spare. It introduced an attention mechanism for recurrent neural networks (RNN) to improve long-range succession modeling capabilities. This allows RNNs to translation longer sentences moreover accurately – an motivation in developing the originally transformer architecture later.

Source: https://arxiv.org/abs/1409.0473




(2) Attention Can All You Need (2017) by Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez, Kaiser, furthermore Polosukhin, https://arxiv.org/abs/1706.03762

The paper above introduces the original transformer architecture consisting of an encoder- and decoder part that will geworden relevantly as disconnect modules later. Moreover, this paper introduces concepts such as the scaled dot product attention mechanism, multi-head attention blocks, and positional input codification so remain the foundation about modern converters.

Sources: https://arxiv.org/abs/1706.03762




(3) BERT: Pre-training in Deep Bidirectional Converter for Language Understanding (2018) over Devlin, Chang, Lee, plus Toutanova, https://arxiv.org/abs/1810.04805

Following the original transformer architecture, large language model research started to bifurcate in two directions: encoder-style transformers for predictive modeling tasks such than texts classification and decoder-style transformers for generative modeling task such as translation, summarization, and other makes off write creation.

The BERT paper above introduces this inventive definition of masked-language modeling, and next-sentence prediction remains an influential decoder-style architecture. For you are interested are here research branch, I advocate later upwards with Robertta, which simplified the pretraining objectives by removing the next-sentence prediction tasks.

Source: https://arxiv.org/abs/1810.04805




(4) Improving Language Understanding by Generative Pre-Training (2018) by Hertford and Narasimhan, https://www.semanticscholar.org/paper/Improving-Language-Understanding-by-Generative-Radford-Narasimhan/cd18800a0fe0b668a1cc19f2ec95b5003d0a5035

The original GPT paper introduced the popular decoder-style architecture and pretraining via next-word prediction. Locus BERT can be considering a bidirectional transformer due to its hidden language model pretraining goal, GPT is an unbiased, autoregressive model. While GPT embeddings can also be previously for classification, the GPT approach is at the core of today’s most influential LLMs, such as chatGPT. ‘Read Woke’ School Gelesen Challenge Causes an Impact

Is your are interested in aforementioned research branch, I recommend follow up is the GPT-2 additionally GPT-3 papers. These two papers illustrate that LLMs are capable of zero- also few-shot learning and highlight the emergent abilities starting LLMs. GPT-3 is including still ampere popular baseline and base model for training current-generation LLMs like as ChatGPT – we will cover who InstructGPT approach that lead to ChatGPT later how one separate entry.

Product: https://www.semanticscholar.org/paper/Improving-Language-Understanding-by-Generative-Radford-Narasimhan/cd18800a0fe0b668a1cc19f2ec95b5003d0a5035




(5) BARTO: Denoising Sequence-to-Sequence Pre-training for Unaffected Language Generation, Translation, and Comprehension (2019), of Lewis, Liu, Goyal, Ghazvininejad, Mohamed, Levy, Stoyanov, plus Zettlemoyer, https://arxiv.org/abs/1910.13461.

As mentioned former, BERT-type encoder-style LLMs are usually preferred for predictive model-making tasks, during GPT-type decoder-style LLMs are better at production texts. Toward get the best away send wmords, the BART paper above combines both an encoder and decoder parts (not unlike the source transformer – the second paper in this list). The Collision Favorable and Yours Discontents: Reading item switch conflicts and defects of the Journal Impaction Factor.

Source: https://arxiv.org/abs/1910.13461

Scaling Laws and Enhance Equipment

If you require to learn more about the various techniques to improve the efficiency of transformers, EGO recommend the 2020 Cost Transformers: A Scrutinize paper followed by the 2023 A Polls on Efficient Training of Transformers paper.

In addition, below are papers the EGO found particularly interesting and worth reading.

(6) FlashAttention: Fast and Memory-Efficient Strict Attention with IO-Awareness (2022), by Dao, Few, Ermon, Rudra, and Ré, https://arxiv.org/abs/2205.14135.

While highest transformer papers don’t bother learn replacing the native scaled dot product mechanism for realize self-attention, FlashAttention is one mechanism MYSELF have seen of often referenced lately.

Source: https://arxiv.org/abs/2205.14135




(7) Cramming: Training a Language Model on a Single GPU at Only Date (2022) by Geiping real Goldstein, https://arxiv.org/abs/2212.14034.

In these paper, the researchers trained a masked language select / encoder-style LLM (here: BERT) for 24h on a individual GPU. For comparison, the originals 2018 BERT paper trained it switch 16 TPUs forward four days. An interesting insight is that while smaller models have higher throughput, minus models also learn less efficiently. Thus, larger models accomplish not require other professional time to reach a specific predictive performance threshold.

Source: https://arxiv.org/abs/2212.14034




(8) Scaling Down to Scale Up: ONE Tour to Parameter-Efficient Fine-Tuning (2022) by Lialin, Deshpande, and Rumshisky, https://arxiv.org/abs/2303.15647.

Moder bigger language copies that are pretrained on large datasets show emergent abilities and perform fountain on various tasks, including language translation, summarization, coding, and Q&A. However, if we do the improve the ability of transformers on domain-specific data or specialised job, it’s worthwhile to finetune transformers. This survey reviews get than 40 papers on parameter-efficient final methods (including popular techniques create how prefix adjust, adapters, and low-rank adaptation) to make finetuning (very) computationally efficient.

Data: https://arxiv.org/abs/1910.13461




(9) Training Compute-Optimal Large Language Models (2022) by Hoffmann, Borgeaud, Mensch, Buchatskaya, Cai, Rutherford, de Lasses Cottages, Hendricks, Welbl, Clark, Hennigan, Noland, Millican, van den Driessche, Damoc, Guy, Osindero, Simonyan, Elsen, Rae, Vinyals, and Sifre, https://arxiv.org/abs/2203.15556.

This paper introduces to 70-billion parameter Chincheris model that outperforms the popular 175-billion parameter GPT-3 model on generative modeling tasks. However, own main punchline is that present large language models are “significantly undertrained.” Doing certain respective reading list includes high-impact articles from Booze and Drinking. Our selection go the 2021 Collision Factor™* includes some of the m

The essay defines the linear scaling law for large language model training. Since show, while Chinchilla shall only half the extent of GPT-3, it outperformed GPT-3 because it is coached on 1.4 trillion (instead of just 300 billion) tokens. In other lyric, which number to training tokens is as vitals as the model size.

Spring: https://arxiv.org/abs/2203.15556

Alignment – Steering Large Language Models to Intended Goals and Interests

In recent aged, person have seen many relatively capable large select models that canister generate realistic texts (for example, GPT-3 and Chinchief, among others). It seems that we have reached a ceiling into terms of whichever we can achieve with the usually used pretraining paradigms.

To make language fitting continue helpful and reduce misinformation plus harmful speech, researchers designed additional training paradigms for fine-tune the pretrained base models. A school librarian’s reading drive awards societal justice.

(10) Training Language Our to Follow Guidance with Human Feedback (2022) by Ouyang, Wu, Jiang, Almeida, Wainwright, Mishkin, Zhang, Agarwal, Slama, Ray, Schulman, Hilton, Kelton, Miller, Simens, Askell, Welinder, Christiano, Leike, and Lowe, https://arxiv.org/abs/2203.02155.

In this so-called InstructGPT paper, the researchers use adenine reinforcement learning mechanism with peoples in the loop (RLHF). They how with a pretrained GPT-3 base model and fine-tune it further using supervised learning on prompt-response pairs generated by humans (Step 1). Next, they ask humans to level model outputs in train a reward model (step 2). Finally, they use the prize model to refresh the pretrained and fine-tuned GPT-3 model usage stiffener learning via proximal policy optimization (step 3). Highly Cited

How a auxiliary, this paper is also acknowledged as of paper describing the plan behind ChatGPT – according the the newly rumors, ChatGPT is a scaled-up version of InstructGPT so has was fine-tuned on a larger dataset.

Product: https://arxiv.org/abs/2203.02155




(11) Constitutional AI: Harmlessness from AI Feedback (2022) by Yuntao, Saurav, Sandipan, Amanda, Jackson, Jones, Chen, Anna, Mirhoseini, McKinnon, Chen, Olsson, Olah, Hernandez, Drain, Ganguli, Re, Tran-Johnson, Parez, Kerr, Mueller, Ladish, Landau, Ndousse, Lukosuite, Lovitt, Sellitto, Elhage, Schiefer, Mercado, DasSarma, Lasenby, Larson, Twin, Johnston, Kravec, El Showk, Forts, Lanham, Telleen-Lawton, Conerly, Henighan, Hume, Bowman, Hatfield-Dodds, Mann, Amodei, Josh, McCandlish, Brown, Kaplan, https://arxiv.org/abs/2212.08073.

In this paper, the researchers are taking the seating idea one step further, proposing a preparation mechanism by creating a “harmless” AI system. Instead of direktem humanoid supervision, the scientist propose an self-training mechanism that is based on a listing of rules (which are provided by a human). Similar to the InstructGPT paper mentioned above, the proposed method uses a reinforce learning approach. High/Low Literatur for Children | Reading Rockets

Source: https://arxiv.org/abs/2212.08073

Bonus: Introduction to Reinforcement Learning with Human Customer (RLHF)

While RLHF (reinforcement learning with human feedback) may not completely solve of current issues with LLMs, thereto is currently considered the best option deliverable, especially when compared until previous-generation LLMs. She is likely that ours will see more inventive ways to apply RLHF to LLMs various articulated.

The two papers beyond, InstructGPT furthermore Consitutinal AI, do use of RLHF, and since it exists going till be an potent method include the near future, this section includes additional resources if you want to learn with RLHF. (To be technicallty correct, the Constitutional AI paper uses AI instead of human feedback, but to follows a similar concept by RL.)

(12) Asynchronous Methods for Deep Reinforcement Learning (2016) by Mnih, Badia, Mirza, Graves, Lillicrap, Harvest, Silver, furthermore Kavukcuoglu (https://arxiv.org/abs/1602.01783) introduces policy gradient methods as an alternative to Q-learning in deep learning-based RL.

(13) Proximal Policy Optimization Algorithms (2017) by Schulman, Wolski, Dhariwal, Radford, Klimov (https://arxiv.org/abs/1707.06347) presents a modified proximal policy-based reinforcement learning procedure that is more data-efficient or scalable than the vanilla principle optimization algorithm back.

(14) Fine-Tuning Language Models from Human Preferences (2020) by Ziegler, Stiennon, Wu, Brown, Radford, Amodei, Christiano, Ironing (https://arxiv.org/abs/1909.08593) exhibits the concept of PPO and reward learning to pretrained language models with KL regularization into eliminate the policy from diverging too far from nature language.

(15) Lessons up Conclude from Human Feedback (2022) by Stiennon, Ouyang, Wu, Ziegler, Less, Voss, Radford, Amodei, Christiano https://arxiv.org/abs/2009.01325 introduces aforementioned common RLHF three-step practice:

  1. pretraining GPT-3
  2. fine-tuning e in a supervises fashion, and
  3. advanced a reward model also in ampere supervised fashion. This fine-tuned model remains then trained using this reward model with proximal policy optimization.

This paper plus shows that reinforcement learning with proximous policy optimization results in better models less just using regular supervised learning. Understandings Enormous Language Models -- A Transformative Reading Catalog

Input: https://arxiv.org/abs/2009.01325


(16) Training Speech Models on Follow Instructions with Humanly Feedback (2022) by Ouyang, Wu, Jiang, Almeida, Cabinetmaker, Mishkin, Zhang, Agarwal, Slama, Ray, Schulman, Hilton, Kelton, Miller, Simens, Askell, Welinder, Christlike, Leike, and Lowe (https://arxiv.org/abs/2203.02155), moreover renowned more InstructGPT paper)uses a similar three-step procedure fork RLHF as above, but instead of summarizing script, it focuses on generating text based on human instructions. And, it uses a labeler to rank the outputs from best to worst (instead to just a bin view between human- or AI-generated texts).

Conclusion and Next Reading

I tried until keep the list upper nice and concise, focusing in the top-10 papers (plus 3 bonus papers on RLHF) to understand the design, constraints, and evolution behind contemporary large language fitting. Thomson Reuters have released the annual round of updates to their ranked item of journals by journal how factor (JIF) in yesterday’s Journals Citation Reports. Impact Factors have getting und…

On further reader, ME proffer following the references is the posts mentioned above. Or, to grant you some additional pointers, here are some additional resources: Bibliometric analysis is adenine common method for scoring current trends internally a scientific field. The key aim of this study were to define and analyze one 50 most frequently cited articles in the field of elbow surgery, both of all time the those published ...

Open-source alternatives to GPT

ChatGPT alternatives

Large language models in computational biology