-- Living Mobile --: Running LLM on CPU

Tuesday, December 12, 2023

Running LLM on CPU

We can run an LLM on the CPU with the help of Ctransformers. CTransformer library is a Python package that provides access to Transformer models implemented in C/C++ using the GGML library.

GGML

GGML is a C library for machine learning. It helps run large language models (LLMs) on regular computer chips (CPUs). It uses a special way of representing data (binary format) to share these models. To make it work well on common hardware, GGML uses a technique called quantization. This technique comes in different levels, like 4-bit, 5-bit, and 8-bit quantization. Each level has its own balance between efficiency and performance.

Drawbacks

The main drawback here is latency. Since this model will run on CPUs, we won’t receive responses as quickly as a model deployed on a GPU , but the latency is not excessive. On average, it takes around 1 minute to generate 140–150 tokens on huggingface space. Actually, it performed quite well on local system with a 16-core CPU, providing responses is less than 15 sec.

The attempt is to deploy e Zephyr-7B-Beta SOTA model

You can find various sizes of Zephyr GGUF quantized format in Huggingface. https://huggingface.co/TheBloke/zephyr-7B-beta-GGUF/tree/main

GGUF is updated version of GGML, offering more flexibility, extensibility, and compatibility. It aims to simplify the user experience and accommodate various models. GGML, while a valuable early effort, had limitations that GGUF seeks to overcome.

1. Deployment LLM for API

Deployment structure

LLM_Deployment_at_zerocost

├── Dockerfile

├── main.py

├── requirements.txt

└── zephyr-7b-beta.Q4_K_S.gguf

Requirement text file is as below

python-multipart

fastapi

pydantic

uvicorn

requests

python-dotenv

ctransformers

The main.py file contains a FastAPI function that returns a Zephyr-7B completion.

from ctransformers import AutoModelForCausalLM

from fastapi import FastAPI

from pydantic import BaseModel

llm = AutoModelForCausalLM.from_pretrained("zephyr-7b-beta.Q4_K_S.gguf",

model_type='mistral',

max_new_tokens = 1096,

threads = 3,

)

#Pydantic object

class validation(BaseModel):

prompt: str

#Fast API

app = FastAPI()

@app.post("/llm_on_cpu")

async def stream(item: validation):

system_prompt = 'Below is an instruction that describes a task. Write a response that appropriately completes the request.'

E_INST = "</s>"

user, assistant = "<|user|>", "<|assistant|>"

prompt = f"{system_prompt}{E_INST}\n{user}\n{item.prompt}{E_INST}\n{assistant}\n"

return llm(prompt)

To dockerize below can be done FROM python:3.9

WORKDIR /code

COPY ./requirements.txt /code/requirements.txt

RUN pip install --no-cache-dir --upgrade -r /code/requirements.txt

COPY ./zephyr-7b-beta.Q4_K_S.gguf /code/zephyr-7b-beta.Q4_K_S.gguf

COPY ./main.py /code/main.py

CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "7860"]

References:

https://gathnex.medium.com/how-to-deploy-llm-for-free-of-cost-6e7947d9b64a

-- Living Mobile --

Tuesday, December 12, 2023

Running LLM on CPU

No comments:

Post a Comment

Followers

Blog Archive

About Me