-- Living Mobile --: Setting up Llama-3 locally using OLlama

Wednesday, June 26, 2024

Setting up Llama-3 locally using OLlama

To setup Llama-3 locally, we will use Ollama — an open-source framework that enables open-source Large Language Models (LLMs) to run locally in computer.

CPU: Any modern CPU with at least 4 cores recommended for running smaller models. For running 13B models, CPU with at least 8 cores is recommended. GPU is optional for Ollama, but if available can improve the performance drastically.

RAM: At least 8 GB of RAM to run the 7B models, 16 GB to run the 13B models, and 32 GB to run the 33B models.

Disk Capacity: Recommend at least 12 GB of disk space available, to install Ollama and the base models. Additional space will be required if more models are planned to be installed.

Downloaded the Llama installation file from this link https://ollama.com/download

It downloaded the Ollama dmg file. Installed it and ran the below command which downloaded the model file

ollama run llama3

This is an 8B instruct model of Llama-3

To download specific model, llama3:70b can be used

Lamma-3 8B Instruct model, takes about ~4.7 GB download size.

On Mac, the model file is stored under ~/.ollama/models

To upgrade ollama, below command can be used

ollama pull llama3

To remove below can be used

ollama rm llama3

There are multiple prompting options.

command-line: This is the simplest of all option. As we saw in Step-2, with the run command, Ollama command-line is ready to accept prompt messages. We can type in the prompt message there, to get Llama-3 responses, as shown below. To exit the conversation, type the command /bye.

ReST API (HTTP Request): As we saw in Step-1, Ollama is ready to serve Inference API requests, on local HTTP port 11434 (default). You can hit the Inference API endpoint with HTTP POST request containing the prompt message payload. Here is an example of a CURL request for a prompt

curl -X POST http://localhost:11434/api/generate -d "{\"model\": \"llama3\", \"prompt\":\"Tell me a good joke?\", \"stream\": false}"

{"model":"llama3","created_at":"2024-06-27T02:26:06.929468Z","response":"Here's one:\n\nWhy couldn't the bicycle stand up by itself?\n\n(wait for it...)\n\nBecause it was two-tired!\n\nHope that made you smile! Do you want to hear another one?","done":true,"done_reason":"stop","context":[128006,882,128007,271,41551,757,264,1695,22380,30,128009,128006,78191,128007,271,8586,596,832,1473,10445,7846,956,279,36086,2559,709,555,5196,1980,65192,369,433,62927,18433,433,574,1403,2442,2757,2268,39115,430,1903,499,15648,0,3234,499,1390,311,6865,2500,832,30,128009],"total_duration":10178937333,"load_duration":8604434667,"prompt_eval_count":16,"prompt_eval_duration":159149000,"eval_count":40,"eval_duration":1408267000}%

stream” flag as “false” in the CURL request, to get all responses at once. The default value for “stream” is true, in which case, you will receive multiple HTTP responses with a streaming result of tokens. For the last response of the streaming results, the “done” attribute will be returned as “true”.

To execute via Python script, below can be done

pip install langchain-community

from langchain_community.llms import Ollama

llm = Ollama(model="llama3")

prompt = "Tell me a joke about llama"

result = llm.invoke(prompt)

print(result)

# 'Why did the llama go to the party?\n\nBecause it was a hair-raising experience!'

References:

https://medium.com/@renjuhere/llama-3-running-locally-in-just-2-steps-e7c63216abe7

-- Living Mobile --

Wednesday, June 26, 2024

Setting up Llama-3 locally using OLlama

No comments:

Post a Comment

Followers

Blog Archive

About Me