To setup Llama-3 locally, we will use Ollama — an open-source framework that enables open-source Large Language Models (LLMs) to run locally in computer.
CPU: Any modern CPU with at least 4 cores recommended for running smaller models. For running 13B models, CPU with at least 8 cores is recommended. GPU is optional for Ollama, but if available can improve the performance drastically.
RAM: At least 8 GB of RAM to run the 7B models, 16 GB to run the 13B models, and 32 GB to run the 33B models.
Disk Capacity: Recommend at least 12 GB of disk space available, to install Ollama and the base models. Additional space will be required if more models are planned to be installed.
Downloaded the Llama installation file from this link https://ollama.com/download
It downloaded the Ollama dmg file. Installed it and ran the below command which downloaded the model file
ollama run llama3
This is an 8B instruct model of Llama-3
To download specific model, llama3:70b can be used
Lamma-3 8B Instruct model, takes about ~4.7 GB download size.
On Mac, the model file is stored under ~/.ollama/models
To upgrade ollama, below command can be used
ollama pull llama3
To remove below can be used
ollama rm llama3
There are multiple prompting options.
command-line: This is the simplest of all option. As we saw in Step-2, with the run command, Ollama command-line is ready to accept prompt messages. We can type in the prompt message there, to get Llama-3 responses, as shown below. To exit the conversation, type the command /bye.
ReST API (HTTP Request): As we saw in Step-1, Ollama is ready to serve Inference API requests, on local HTTP port 11434 (default). You can hit the Inference API endpoint with HTTP POST request containing the prompt message payload. Here is an example of a CURL request for a prompt
curl -X POST http://localhost:11434/api/generate -d "{\"model\": \"llama3\", \"prompt\":\"Tell me a good joke?\", \"stream\": false}"
{"model":"llama3","created_at":"2024-06-27T02:26:06.929468Z","response":"Here's one:\n\nWhy couldn't the bicycle stand up by itself?\n\n(wait for it...)\n\nBecause it was two-tired!\n\nHope that made you smile! Do you want to hear another one?","done":true,"done_reason":"stop","context":[128006,882,128007,271,41551,757,264,1695,22380,30,128009,128006,78191,128007,271,8586,596,832,1473,10445,7846,956,279,36086,2559,709,555,5196,1980,65192,369,433,62927,18433,433,574,1403,2442,2757,2268,39115,430,1903,499,15648,0,3234,499,1390,311,6865,2500,832,30,128009],"total_duration":10178937333,"load_duration":8604434667,"prompt_eval_count":16,"prompt_eval_duration":159149000,"eval_count":40,"eval_duration":1408267000}%
stream” flag as “false” in the CURL request, to get all responses at once. The default value for “stream” is true, in which case, you will receive multiple HTTP responses with a streaming result of tokens. For the last response of the streaming results, the “done” attribute will be returned as “true”.
To execute via Python script, below can be done
pip install langchain-community
from langchain_community.llms import Ollama
llm = Ollama(model="llama3")
prompt = "Tell me a joke about llama"
result = llm.invoke(prompt)
print(result)
# 'Why did the llama go to the party?\n\nBecause it was a hair-raising experience!'
References:
https://medium.com/@renjuhere/llama-3-running-locally-in-just-2-steps-e7c63216abe7