Tiktoken is an open-source tool developed by OpenAI that is utilized for tokenizing text.
Tokenization is when you split a text string to a list of tokens. Tokens can be letters, words or grouping of words (depending on the text language).
For example, “I’m playing with AI models” can be transformed to this list [“I”,”’m”,” playing”,” with”,” AI”,” models”].
Then these tokens can be encoded in integers.
OpenAI uses a technique called byte pair encoding (BPE) for tokenization. BPE is a data compression algorithm that replaces the most frequent pairs of bytes in a text with a single byte. This reduces the size of the text and makes it easier to process.
You can use tiktoken to count tokens, because:
You need to know whether the text your are using is very long to be processed by the model
You need to have an idea about OpenAI API call costs (The price is applied by token).
For example, if you are using GPT-3.5-turbo model you will be charged: $0.002 / 1K tokens
How to count the number of tokens using tiktoken?
pip install tiktoken
import tiktoken
Encoding
Different encodings are used in openai: cl100k_base, p50k_base, gpt2.
These encodings depend on the model you are using:
For gpt-4, gpt-3.5-turbo, text-embedding-ada-002, you need to use cl100k_base.
All this information is already included in OpenAI API, you don’t need to remember it. Therefore, you can call the encoding using 2 methods:
If you know the exact encoding name:
encoding = tiktoken.get_encoding("cl100k_base")
Alternatively, you can allow the OpenAI API to provide a suitable tokenization method based on the model you are using:
encoding = tiktoken.encoding_for_model("gpt-3.5-turbo")
print(encoding)
Tokenization
Let’s tokenize this text:
text = "I'm playing with AI models"
This will return a list of tokens integer:
tokens_integer=encoding.encode(text)
tokens_integer
[40, 2846, 5737, 449, 15592, 4211]
print(f"{len(tokens_integer)} is the number of tokens in my text")
6 is the number of tokens in my text
It’s worth mentioning that we can obtain the corresponding token string for each integer token by utilizing the ‘encoding.decode_single_token_bytes()’ function (each string will be a bytes ‘b’ string)
tokens_string = [encoding.decode_single_token_bytes(token) for token in tokens_integer]
tokens_string
[b'I', b"'m", b' playing', b' with', b' AI', b' models']
the space before each word? This is how it works in OpenAI with tiktoken.
Count the number of token in the message to be sent using the API:
message =[{
"role": "user",
"content": "Explain to me how tolenization is working in OpenAi models?",
}]
tokens_per_message = 4
# every message follows <|start|>{role/name}\n{content}<|end|>\n
num_tokens = 0
num_tokens += tokens_per_message
for key, value in message[0].items():
text=value
num_tokens+=len(encoding.encode(value))
print(f"{len(encoding.encode(value))} is the number of token included in {key}")
num_tokens += 3
# every reply is primed with <|start|>assistant<|message|>
print(f"{num_tokens} number of tokens to be sent in our request")
1 is the number of token included in role
15 is the number of token included in content
23 number of tokens to be sent in our request
import openai
openai.api_key='YOUR_API_KEY'
response = openai.ChatCompletion.create(
model='gpt-3.5-turbo-0301',
messages=message,
temperature=0,
max_tokens=200
)
num_tokens_api = response["usage"]["prompt_tokens"]
print(f"{num_tokens_api} number of tokens used by the API")
23 number of tokens used by the API
The number of tokens is the same as what we calculated using ‘tiktoken’.
Furthermore, let’s count the number of tokens in ChatGPT answer :
resp=response["choices"][0]["message"].content
len(encoding.encode(resp))
200