Wednesday, June 26, 2024

Setting up Llama-3 locally using OLlama

To setup Llama-3 locally, we will use Ollama — an open-source framework that enables open-source Large Language Models (LLMs) to run locally in computer.

CPU: Any modern CPU with at least 4 cores recommended for running smaller models. For running 13B models, CPU with at least 8 cores is recommended. GPU is optional for Ollama, but if available can improve the performance drastically.

RAM: At least 8 GB of RAM to run the 7B models, 16 GB to run the 13B models, and 32 GB to run the 33B models.

Disk Capacity: Recommend at least 12 GB of disk space available, to install Ollama and the base models. Additional space will be required if more models are planned to be installed.

Downloaded the Llama installation file from this link https://ollama.com/download

 It downloaded the Ollama dmg file. Installed it and ran the below command which downloaded the model file 


ollama run llama3

This is an 8B instruct model of Llama-3 

To download specific model,  llama3:70b  can be used 

Lamma-3 8B Instruct model, takes about ~4.7 GB download size.

On Mac, the model file is stored under ~/.ollama/models 

To upgrade ollama, below command can be used


ollama pull llama3


To remove below can be used 


ollama rm llama3


There are multiple prompting options. 


command-line: This is the simplest of all option. As we saw in Step-2, with the run command, Ollama command-line is ready to accept prompt messages. We can type in the prompt message there, to get Llama-3 responses, as shown below. To exit the conversation, type the command /bye.


ReST API (HTTP Request): As we saw in Step-1, Ollama is ready to serve Inference API requests, on local HTTP port 11434 (default). You can hit the Inference API endpoint with HTTP POST request containing the prompt message payload. Here is an example of a CURL request for a prompt


 curl -X POST http://localhost:11434/api/generate -d "{\"model\": \"llama3\",  \"prompt\":\"Tell me a good joke?\", \"stream\": false}"

{"model":"llama3","created_at":"2024-06-27T02:26:06.929468Z","response":"Here's one:\n\nWhy couldn't the bicycle stand up by itself?\n\n(wait for it...)\n\nBecause it was two-tired!\n\nHope that made you smile! Do you want to hear another one?","done":true,"done_reason":"stop","context":[128006,882,128007,271,41551,757,264,1695,22380,30,128009,128006,78191,128007,271,8586,596,832,1473,10445,7846,956,279,36086,2559,709,555,5196,1980,65192,369,433,62927,18433,433,574,1403,2442,2757,2268,39115,430,1903,499,15648,0,3234,499,1390,311,6865,2500,832,30,128009],"total_duration":10178937333,"load_duration":8604434667,"prompt_eval_count":16,"prompt_eval_duration":159149000,"eval_count":40,"eval_duration":1408267000}%            



stream” flag as “false” in the CURL request, to get all responses at once. The default value for “stream” is true, in which case, you will receive multiple HTTP responses with a streaming result of tokens. For the last response of the streaming results, the “done” attribute will be returned as “true”.


To execute via Python script, below can be done 


pip install langchain-community


from langchain_community.llms import Ollama


llm = Ollama(model="llama3")

prompt = "Tell me a joke about llama"

result = llm.invoke(prompt)

print(result)

# 'Why did the llama go to the party?\n\nBecause it was a hair-raising experience!'


References:

https://medium.com/@renjuhere/llama-3-running-locally-in-just-2-steps-e7c63216abe7

No comments:

Post a Comment