We introduce the Sheared-LLaMA models, the strongest 1.3B and 2.7B public base large language models (LLMs). Our models are produced by LLM-Shearing, an efficient method of constructing LLMs by first pruning a larger existing model and then continually pre-training it. Sheared-LLaMA models are first pruned from the LLaMA2-7B model, and then trained on only 50B tokens, 5% budget of the previous strongest public 3B model.
Table of Contents
Highlight of our results
- Sheared-LLaMA-2.7B achieves better performance than existing open-source models of the same scale with 3% (1/32) of the compute.
- The trajectory shows a compelling case that if we invest more tokens and compute, the capability of Sheared-LLaMA can be further improved.
Swift Iterations of Open-Source LLMs
|Model||Date||Model Scale||Training Tokens||Training Corpora|
|Pythia||02/13/2023||70M - 12B||300B||The Pile|
|LLaMA||02/27/2023||7B - 70B||1T||RedPajama*|
|INCITE||05/05/2023||3B - 7B||800B||RedPajama|
|OpenLLaMA-v1||06/07/2023||3B - 13B||1T||RedPajama|
|OpenLLaMA-v2||07/07/2023||3B - 13B||1T||RedPajama, StarCoder, RefinedWeb|
|LLaMA2||07/18/2023||7B - 70B||2T||Unknown|
*RedPajama is a public reproduction of the LLaMA training data.
Various institutions are actively and consistently releasing more capable open-source LLMs, trained with an increasing amount of compute. Despite their comparatively smaller sizes in comparison to proprietary models (GPT-4, Claude, PaLM), training these open-source models remains a costly endeavor. To put it into perspective, the training process for a LLaMA2 7B model, for instance, demands a substantial 184,320 A100 GPU hours. In this blog post, we introduce our methodology to accelerate pre-training via pruning existing strong LLMs.
Can we produce a smaller, general-purpose, and competitive LLM by leveraging existing pre-trained LLMs, while using much less compute than training one from scratch?
Our answer is yes! And surprisingly, the compute savings will be tremendous. Specifically, We use structured pruning to achieve this goal. To link the approach to some past works:
- The utilization of pre-existing larger language models for training is not new
- Reduce, Reuse, Recycle (Blakeney et al., 2022) shows than distilling from an existing LM is more compute-efficient
- Large Language Model Distillation Doesn’t Need a Teacher (Jha et al., 2023) shows that the prune and continue pre-training recipe leads is more cost effective than distillation
- Pruning has been widely studied as a model compression technique
- Task-specific structured pruning (Xia et al., 2022, Kuric et al, 2023) retains the performance of large models with a moderate sparsity
Our Approach: LLM-Shearing
We propose two techniques in LLM-Shearing:
Targeted structured pruning: We prune a source model to to a pre-specified target architecture (e.g., an existing model's config), and meanwhile maximizing the pruned model’s performance
Dynamic batch loading: Pruning results in varying information retainment across domains. Inspired by (Xie et al., 2023), we load more data for domains that recover slow, and the loading proportion is dynamically decided on the fly.
Combining these two steps allow us to produce a smaller model
- With a specified model shape
- Reaching a target performance more efficiently
- Increasing computational resources will undoubtedly enhance performance.
- The stronger the initial base model, the stronger the resulting pruned model will be; repeatedly pre-training to outperform is not a cost-effective approach.
- Our approach is applicable to models of varying architectures and scales.
We evaluate on an extensive set of downstream tasks including reasoning, reading comprehension, language modeling and knowledge intensive tasks. Our Sheared-LLaMA models outperform existing large language models.
|Model||# Pre-training Tokens||Average Performance|
We instruction-tuned Sheared-LLaMA and other public LMs of similar scale on ShareGPT and evaluate their open-ended generation ability by GPT-4. We show that Sheared-LLaMA’s instruction following ability is also better.
When compared to continuing pre-training an existing LM and a pruned model with the same amount of compute, we find that continuing pre-training the pruned model leads to a consistently better performance. When there exists a larger source model that is significantly stronger than all existing smaller ones (e.g., LLaMA2-7B is superior compared to all 3B models), pruning from the larger model is more cost-efficient than continually training existing small models.
Consider using it!
We propose a pruning approach LLM-Shearing which
- Accelerates pre-training: more cost effective than training models from scratch when competitive large LLMs are readily available
- Is Extensible to new models: easily adaptable to newly emerged and strong LLMs such as Mistral-7B, and bigger and better-curated pre-training data, such as Dolma, RedPajama, SlimPajama
- Is Extensible to models of any scales: though we only conduct experiment at the 7B scale, this method is applicable to LMs of any scales.
If you pre-train stronger LLMs with better data compositions or new data:
- Consider starting from a strong existing model, and prune it down to your target scale
If you are a LLM practitioner who are looking for strong small-scale LLMs to prototype your experiments:
- Check out our Sheared-LLaMA series on HuggingFace!
- We will keep producing small-scale models when more LLMs are released!