Technology

Run Meta’s LLaMA locally in a Macbook

It is now possible to run Meta's LLaMA LLM model locally in a M1 Mac!

Laxman Vijay
Laxman VijayMar 13, 2023
Run Meta’s LLaMA locally in a Macbook

Last month Meta announced they are releasing a new LLM model for research purposes. The model named LLaMA(Large Language Model Meta AI) is a state of the art LLM to help researchers run large models in a local environment (possibly in a single machine) to impact various research areas in the LLM domain.

This week Bulgaria based ****Georgi Gerganov** has released a repo to execute the model in a M1 Mac. The port called as LLaMA.cpp uses 4 bit quantization to reduce the memory requirements of the model and run in a single Mac device.

Running the actual LLaMA requires the following steps mentioned in the official repo: https://github.com/facebookresearch/llama. A request must be made to Meta to download the weights for the model. Here is the official Google form from Meta to request to download the weights.

Although the weights have leaked online. It is not legal to download and use this for commercial purpose. Use this magnet to download the weights:

magnet:?xt=urn:btih:b8287ebfa04f879b048d4d4404108cf3e8014352&dn=LLaMA&tr=udp%3a%2f%2ftracker.opentrackr.org%3a1337%2fannounce

Note: The magnet above is posted by one of the original requesters to which Meta has sent a download link. Technically, it is still not legal to download files that is intended for use by someone else but it should be fine for experiments and never for commercial use.

Refer LLaMA model card for info: llama/MODEL_CARD.md at main · facebookresearch/llama · GitHub

Here are the steps to run the model after you obtained the weights:

  • Clone the repo:
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make
  • Place the weights in a model folder: (obtained from above methods)
ls ./models
65B 30B 13B 7B tokenizer_checklist.chk tokenizer.model
  • Install dependencies:
python3 -m pip install torch numpy sentencepiece
  • Convert to ggml format: (which is a tensor format developed by the same author)
python3 convert-pth-to-ggml.py models/7B/ 1
  • Quantize:
./quantize.sh 7B
  • Run inference:
./main -m ./models/7B/ggml-model-q4_0.bin -t 8 -n 128
  • Profit!

The above steps are just too much!

If you feel you don’t want to hustle with the above steps, there is a simple alternative in this repo: cocktailpeanut/dalai: The simplest way to run LLaMA on your local machine (github.com)

Simple run:

npx dalai llama
npx dalai serve

Above two commands do the following:

  1. First installs the 7B module (default)
  2. Then starts a web/API server at port 3000

Interesting to watch what people come up with using LLaMA!

Share
Laxman Vijay

Staff Writer

Laxman Vijay

A software engineer who likes writing.

Learn the newest gaming tech updates with our weekly newsletter

Get inspired from our coverage of the latest trends and breakthroughs in the world of gaming and AI.

Your privacy is important to us. We promise not to send you spam!