The LLama 2 model comes in multiple forms. You are going to see 3 versions of the models 7B, 13B, and 70B, where B stands for billion parameters
Besides the model size, you will see that there are 2 model types:
LLama 2, the text completion version of the model, which doesn’t have a specific prompt template
Llama 2 Chat, the fine-tuned version of the model, which was trained to follow instructions and act as a chat bot. This version needs a specific prompt template in order to perform the best, which we are going to discuss below.
All Llama 2 models are available on HuggingFace. In order to access them you will have to apply for an access token by accepting the terms and conditions. Let’s take for example LLama 2 7B Chat. After opening the page you will see a form where you can apply for model access. After your request is approved, you will be able to download the model using your HuggingFace access token.
Thanks to the excellent work of the community, we have llama.cpp, which does the magic and allows running LLama models solely on your CPU. the llama.cpp applies a custom quantization approach to compress the models in a GGUF format. This reduces their size and resource needs.
The llama-2–7b-chat.Q2_K.gguf file, which is the most compressed version of the 7B chat model and requires the least resources. It can be downloaded from https://huggingface.co/TheBloke/Llama-2-7b-Chat-GGUF/tree/main
Python bindings for llama.cpp : There exists as a project called llama-cpp-python which can be installed with "pip install llama-cpp-python"
Below code can be used to run the model
from llama_cpp import Llama
# Put the location of to the GGUF model that you've download from HuggingFace here
model_path = "**path to your llama-2–7b-chat.Q2_K.gguf**"
# Create a llama model
model = Llama(model_path=model_path)
# Prompt creation
system_message = "You are a helpful assistant"
user_message = "Generate a list of 5 funny dog names"
prompt = f"""<s>[INST] <<SYS>>
{system_message}
<</SYS>>
{user_message} [/INST]"""
# Model parameters
max_tokens = 100
# Run the model
output = model(prompt, max_tokens=max_tokens, echo=True)
# Print the model output
print(output)
To run it with Flask, below can be done
pip install Flask
from flask import Flask, request, jsonify
from llama_cpp import Llama
# Create a Flask object
app = Flask("Llama server")
model = None
@app.route('/llama', methods=['POST'])
def generate_response():
global model
try:
data = request.get_json()
# Check if the required fields are present in the JSON data
if 'system_message' in data and 'user_message' in data and 'max_tokens' in data:
system_message = data['system_message']
user_message = data['user_message']
max_tokens = int(data['max_tokens'])
# Prompt creation
prompt = f"""<s>[INST] <<SYS>>
{system_message}
<</SYS>>
{user_message} [/INST]"""
# Create the model if it was not previously created
if model is None:
# Put the location of to the GGUF model that you've download from HuggingFace here
model_path = "**path to your llama-2–7b-chat.Q2_K.gguf**"
# Create the model
model = Llama(model_path=model_path)
# Run the model
output = model(prompt, max_tokens=max_tokens, echo=True)
return jsonify(output)
else:
return jsonify({"error": "Missing required parameters"}), 400
except Exception as e:
return jsonify({"Error": str(e)}), 500
if __name__ == '__main__':
app.run(host='0.0.0.0', port=5000, debug=True)
python llama_cpu_server.py
curl -X POST -H "Content-Type: application/json" -d '{
"system_message": "You are a helpful assistant",
"user_message": "Generate a list of 5 funny dog names",
"max_tokens": 100
}' http://127.0.0.1:5000/llama
To dockerize, below can be done
# Use python as base image
FROM python
# Set the working directory in the container
WORKDIR /app
# Copy the current directory contents into the container at /app
COPY ./llama_cpu_server.py /app/llama_cpu_server.py
COPY ./llama-2-7b-chat.Q2_K.gguf /app/llama-2-7b-chat.Q2_K.gguf
# Install the needed packages
RUN pip install llama-cpp-python
RUN pip install Flask
# Expose port 5000 outside of the container
EXPOSE 5000
# Run llama_cpu_server.py when the container launches
CMD ["python", "llama_cpu_server.py"]
references:
https://medium.com/@penkow/how-to-run-llama-2-locally-on-cpu-docker-image-731eae6398d1
No comments:
Post a Comment