Monday, December 4, 2023

Running Llama2 on a docker

The LLama 2 model comes in multiple forms. You are going to see 3 versions of the models 7B, 13B, and 70B, where B stands for billion parameters

Besides the model size, you will see that there are 2 model types:


LLama 2, the text completion version of the model, which doesn’t have a specific prompt template

Llama 2 Chat, the fine-tuned version of the model, which was trained to follow instructions and act as a chat bot. This version needs a specific prompt template in order to perform the best, which we are going to discuss below.


All Llama 2 models are available on HuggingFace. In order to access them you will have to apply for an access token by accepting the terms and conditions. Let’s take for example LLama 2 7B Chat. After opening the page you will see a form where you can apply for model access. After your request is approved, you will be able to download the model using your HuggingFace access token.


Thanks to the excellent work of the community, we have llama.cpp, which does the magic and allows running LLama models solely on your CPU. the llama.cpp applies a custom quantization approach to compress the models in a GGUF format. This reduces their size and resource needs.


The llama-2–7b-chat.Q2_K.gguf file, which is the most compressed version of the 7B chat model and requires the least resources. It can be downloaded from https://huggingface.co/TheBloke/Llama-2-7b-Chat-GGUF/tree/main 


Python bindings for llama.cpp : There exists as a project called llama-cpp-python which can be installed with "pip install llama-cpp-python" 


Below code can be used to run the model 


from llama_cpp import Llama


# Put the location of to the GGUF model that you've download from HuggingFace here

model_path = "**path to your llama-2–7b-chat.Q2_K.gguf**"


# Create a llama model

model = Llama(model_path=model_path)


# Prompt creation

system_message = "You are a helpful assistant"

user_message = "Generate a list of 5 funny dog names"


prompt = f"""<s>[INST] <<SYS>>

{system_message}

<</SYS>>

{user_message} [/INST]"""


# Model parameters

max_tokens = 100


# Run the model

output = model(prompt, max_tokens=max_tokens, echo=True)


# Print the model output

print(output)



To run it with Flask, below can be done 


pip install Flask


from flask import Flask, request, jsonify

from llama_cpp import Llama


# Create a Flask object

app = Flask("Llama server")

model = None


@app.route('/llama', methods=['POST'])

def generate_response():

    global model

    

    try:

        data = request.get_json()


        # Check if the required fields are present in the JSON data

        if 'system_message' in data and 'user_message' in data and 'max_tokens' in data:

            system_message = data['system_message']

            user_message = data['user_message']

            max_tokens = int(data['max_tokens'])


            # Prompt creation

            prompt = f"""<s>[INST] <<SYS>>

            {system_message}

            <</SYS>>

            {user_message} [/INST]"""

            

            # Create the model if it was not previously created

            if model is None:

                # Put the location of to the GGUF model that you've download from HuggingFace here

                model_path = "**path to your llama-2–7b-chat.Q2_K.gguf**"

                

                # Create the model

                model = Llama(model_path=model_path)

             

            # Run the model

            output = model(prompt, max_tokens=max_tokens, echo=True)

            

            return jsonify(output)


        else:

            return jsonify({"error": "Missing required parameters"}), 400


    except Exception as e:

        return jsonify({"Error": str(e)}), 500


if __name__ == '__main__':

    app.run(host='0.0.0.0', port=5000, debug=True)


python llama_cpu_server.py



curl -X POST -H "Content-Type: application/json" -d '{

  "system_message": "You are a helpful assistant",

  "user_message": "Generate a list of 5 funny dog names",

  "max_tokens": 100

}' http://127.0.0.1:5000/llama



To dockerize, below can be done 


# Use python as base image

FROM python


# Set the working directory in the container

WORKDIR /app


# Copy the current directory contents into the container at /app

COPY ./llama_cpu_server.py /app/llama_cpu_server.py

COPY ./llama-2-7b-chat.Q2_K.gguf /app/llama-2-7b-chat.Q2_K.gguf


# Install the needed packages

RUN pip install llama-cpp-python

RUN pip install Flask


# Expose port 5000 outside of the container

EXPOSE 5000


# Run llama_cpu_server.py when the container launches

CMD ["python", "llama_cpu_server.py"]



references:

https://medium.com/@penkow/how-to-run-llama-2-locally-on-cpu-docker-image-731eae6398d1

No comments:

Post a Comment