Llama.cpp (by Georgi Gerganov)
GGUF (new)
GGML (old)
Transformers (by Huggingface)
bin (unquantized)
safetensors (safer unquantized)
safetensors (quantized using GPTQ algorithm via AutoGPTQ)
AutoGPTQ (quantization library based on GPTQ algorithm, also available via Transformers)
safetensors (quantized using GPTQ algorithm)
koboldcpp (fork of Llama.cpp)
bin (using GGML algorithm)
ExLlama v2 (extremely optimized GPTQ backend for LLaMA models)
safetensors (quantized using GPTQ algorithm)
AWQ (low-bit quantization (INT3/4))
safetensors (using AWQ algorithm)
Notes:
* GGUF contains all the metadata it needs in the model file (no need for other files like tokenizer_config.json) except the prompt template
* llama.cpp has a script to convert *.safetensors model files into *.gguf
* Transformers & Llama.cpp support both CPU, GPU and MPU inference
No comments:
Post a Comment