Sunday, July 2, 2023

What is TikToken Library

Tiktoken is an open-source tool developed by OpenAI that is utilized for tokenizing text.

Tokenization is when you split a text string to a list of tokens. Tokens can be letters, words or grouping of words (depending on the text language).



For example, “I’m playing with AI models” can be transformed to this list [“I”,”’m”,” playing”,” with”,” AI”,” models”].


Then these tokens can be encoded in integers.


OpenAI uses a technique called byte pair encoding (BPE) for tokenization. BPE is a data compression algorithm that replaces the most frequent pairs of bytes in a text with a single byte. This reduces the size of the text and makes it easier to process.




You can use tiktoken to count tokens, because:


You need to know whether the text your are using is very long to be processed by the model

You need to have an idea about OpenAI API call costs (The price is applied by token).

For example, if you are using GPT-3.5-turbo model you will be charged: $0.002 / 1K tokens


How to count the number of tokens using tiktoken?


pip install tiktoken


import tiktoken



Encoding

Different encodings are used in openai: cl100k_base, p50k_base, gpt2.


These encodings depend on the model you are using:


For gpt-4, gpt-3.5-turbo, text-embedding-ada-002, you need to use cl100k_base.


All this information is already included in OpenAI API, you don’t need to remember it. Therefore, you can call the encoding using 2 methods:

If you know the exact encoding name:

encoding = tiktoken.get_encoding("cl100k_base")



Alternatively, you can allow the OpenAI API to provide a suitable tokenization method based on the model you are using:


encoding = tiktoken.encoding_for_model("gpt-3.5-turbo")

print(encoding)



Tokenization

Let’s tokenize this text:



text = "I'm playing with AI models"


This will return a list of tokens integer:


tokens_integer=encoding.encode(text)

tokens_integer


[40, 2846, 5737, 449, 15592, 4211]


print(f"{len(tokens_integer)} is the number of tokens in my text")

6 is the number of tokens in my text



It’s worth mentioning that we can obtain the corresponding token string for each integer token by utilizing the ‘encoding.decode_single_token_bytes()’ function (each string will be a bytes ‘b’ string)




tokens_string = [encoding.decode_single_token_bytes(token) for token in tokens_integer]

tokens_string


[b'I', b"'m", b' playing', b' with', b' AI', b' models']


the space before each word? This is how it works in OpenAI with tiktoken.



Count the number of token in the message to be sent using the API:


message =[{

   "role": "user",

   "content": "Explain to me how tolenization is working in OpenAi models?",

   }]


tokens_per_message = 4 

# every message follows <|start|>{role/name}\n{content}<|end|>\n


num_tokens = 0

num_tokens += tokens_per_message


for key, value in message[0].items():

   text=value

   num_tokens+=len(encoding.encode(value))

   print(f"{len(encoding.encode(value))} is the number of token included in {key}")


num_tokens += 3

# every reply is primed with <|start|>assistant<|message|>


print(f"{num_tokens} number of tokens to be sent in our request")



1 is the number of token included in role

15 is the number of token included in content

23 number of tokens to be sent in our request




import openai


openai.api_key='YOUR_API_KEY'


response = openai.ChatCompletion.create(

     model='gpt-3.5-turbo-0301',

     messages=message,

     temperature=0,

     max_tokens=200 

 )


num_tokens_api = response["usage"]["prompt_tokens"]


print(f"{num_tokens_api} number of tokens used by the API")



23 number of tokens used by the API


The number of tokens is the same as what we calculated using ‘tiktoken’.



Furthermore, let’s count the number of tokens in ChatGPT answer :


resp=response["choices"][0]["message"].content

len(encoding.encode(resp))


200


No comments:

Post a Comment