We can run an LLM on the CPU with the help of Ctransformers. CTransformer library is a Python package that provides access to Transformer models implemented in C/C++ using the GGML library.
GGML
GGML is a C library for machine learning. It helps run large language models (LLMs) on regular computer chips (CPUs). It uses a special way of representing data (binary format) to share these models. To make it work well on common hardware, GGML uses a technique called quantization. This technique comes in different levels, like 4-bit, 5-bit, and 8-bit quantization. Each level has its own balance between efficiency and performance.
Drawbacks
The main drawback here is latency. Since this model will run on CPUs, we won’t receive responses as quickly as a model deployed on a GPU , but the latency is not excessive. On average, it takes around 1 minute to generate 140–150 tokens on huggingface space. Actually, it performed quite well on local system with a 16-core CPU, providing responses is less than 15 sec.
The attempt is to deploy e Zephyr-7B-Beta SOTA model
You can find various sizes of Zephyr GGUF quantized format in Huggingface. https://huggingface.co/TheBloke/zephyr-7B-beta-GGUF/tree/main
GGUF is updated version of GGML, offering more flexibility, extensibility, and compatibility. It aims to simplify the user experience and accommodate various models. GGML, while a valuable early effort, had limitations that GGUF seeks to overcome.
1. Deployment LLM for API
Deployment structure
LLM_Deployment_at_zerocost
├── Dockerfile
├── main.py
├── requirements.txt
└── zephyr-7b-beta.Q4_K_S.gguf
Requirement text file is as below
python-multipart
fastapi
pydantic
uvicorn
requests
python-dotenv
ctransformers
The main.py file contains a FastAPI function that returns a Zephyr-7B completion.
from ctransformers import AutoModelForCausalLM
from fastapi import FastAPI
from pydantic import BaseModel
llm = AutoModelForCausalLM.from_pretrained("zephyr-7b-beta.Q4_K_S.gguf",
model_type='mistral',
max_new_tokens = 1096,
threads = 3,
)
#Pydantic object
class validation(BaseModel):
prompt: str
#Fast API
app = FastAPI()
@app.post("/llm_on_cpu")
async def stream(item: validation):
system_prompt = 'Below is an instruction that describes a task. Write a response that appropriately completes the request.'
E_INST = "</s>"
user, assistant = "<|user|>", "<|assistant|>"
prompt = f"{system_prompt}{E_INST}\n{user}\n{item.prompt}{E_INST}\n{assistant}\n"
return llm(prompt)
To dockerize below can be done FROM python:3.9
WORKDIR /code
COPY ./requirements.txt /code/requirements.txt
RUN pip install --no-cache-dir --upgrade -r /code/requirements.txt
COPY ./zephyr-7b-beta.Q4_K_S.gguf /code/zephyr-7b-beta.Q4_K_S.gguf
COPY ./main.py /code/main.py
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "7860"]
References:
https://gathnex.medium.com/how-to-deploy-llm-for-free-of-cost-6e7947d9b64a
No comments:
Post a Comment