Deploy LLM on any device

Running LLaMA and LLaMA based models on your workstation, or your laptop, or even on your Synology NAS

Posted by Jingbiao on April 10, 2023, Reading time: 4 minutes.

Ubuntu Docker setup for Synology NAS

  1. Install Docker on Synology NAS as add-on package
    • Tutorial here
    • Open Package Center
    • Search Docker
    • Install Docker
  2. Download the Ubuntu 20.04 docker image

  3. Create a new Docker container with Ubuntu 20.04

  4. Install Anaconda on the docker container

  5. Install compiler tools
1
2
3
4
apt-get update
apt-get install make
apt-get install gcc
apt-get install build-essential

Install LLaMA.cpp to run native LLaMA model

Clone the LLaMA.cpp repo

1
2
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp

Compile LLaMA.cpp

1
make

Prepare data and run inference

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
# obtain the original LLaMA model weights and place them in ./models
# the model weights can be found from the original LLaMA repo 
# at https://github.com/facebookresearch/llama
ls ./models
65B 30B 13B 7B tokenizer_checklist.chk tokenizer.model

# install Python dependencies
python3 -m pip install -r requirements.txt

# convert the 7B model to ggml FP16 format
python3 convert.py models/7B/

# quantize the model to 4-bits (using q4_0 method)
./quantize ./models/7B/ggml-model-f16.bin ./models/7B/ggml-model-q4_0.bin q4_0

# run the inference
./main -m ./models/7B/ggml-model-q4_0.bin -n 128

Using interative mode

1
./main -m ./models/7B/ggml-model-q4_0.bin -n 256 --repeat_penalty 1.0 --color -i -r "User:" -f prompts/chat-with-bob.txt

Using Alpaca

Alpaca

Alpaca is the instruction following LLaMA model

Recover official weights:

  • The weight diff between Alpaca-7B and LLaMA-7B is located here. To recover the original Alpaca-7B weights, follow these steps:
1
2
3
4
5
6
1. Convert Meta's released weights into huggingface format. Follow this guide:
    https://huggingface.co/docs/transformers/main/model_doc/llama
2. Make sure you cloned the released weight diff into your local machine. The weight diff is located at:
    https://huggingface.co/tatsu-lab/alpaca-7b/tree/main
3. Run this function with the correct paths. E.g.,
    python weight_diff.py recover --path_raw <path_to_step_1_dir> --path_diff <path_to_step_2_dir> --path_tuned <path_to_store_recovered_weights>

Detailed steps for step 1, you need to download the python script from transformers library and install probobuf from here: https://github.com/protocolbuffers/protobuf/tree/main/python#installation

1
2
pip install protobuf==3.20.0
PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python
1
2
python src/transformers/models/llama/convert_llama_weights_to_hf.py \
    --input_dir /path/to/downloaded/llama/weights --model_size 7B --output_dir /output/path

The weights get from here should be the similar from this repo: https://huggingface.co/chavinlo/alpaca-native

Note that this repo uses slightly modified training procedure

Unofficial weights for LLaMA based models and Alpaca

  1. alpaca-native-7B by chavinlo
    • Mostly similar to the official weights, but with slightly different training procedure: FSDP
  2. 4 Bit quantised version of the alpaca-native-7B by chavinlo
  3. alpaca-native-13B by chavinlo
    • Similar to 1. but with 13B parameters LLaMA model
  4. GPT4-x-alpaca by chavinlo
  5. LORA by tloen
    • Low rank attention model based on Alpaca for runtime optimization
  6. Vicuna-13b by eachadea

GPT4-x-alpaca

gpt4-x-alpaca’s HuggingFace page states that it is based on the Alpaca 13B model, fine-tuned with GPT4 responses for 3 epochs.

GPTeacher

A collection of modular datasets generated by GPT-4, General-Instruct - Roleplay-Instruct - Code-Instruct - and Toolformer.

Vicuna-13b

Vicuna-13B is an open-source chatbot trained by fine-tuning LLaMA on 70K user-shared conversations collected from ShareGPT. Then, GPT-4 was used as judge to evaluate the performance of Vicuna-13B. The results show that Vicuna-13B can achieve more than 90%* quality of OpenAI ChatGPT and Google Bard while outperforming other models like LLaMA and Stanford Alpaca in more than 90%* of cases. The code and weights, along with an online demo, are publicly available for non-commercial use.

Relative Response Quality Assessed by GPT-4

Relative Response Quality Assessed by GPT-4

More models to try out

Koala

WizardLM

Reference

  1. https://agi-sphere.com/install-llama-mac/
  2. https://github.com/ggerganov/llama.cpp
  3. https://agi-sphere.com/llama-models/
  4. LLaMA paper
  5. Introduction to Vicuna
  6. ShareGPT
  7. Koala blog page
  8. WizardLM paper
  9. WizardLM Github page