Saturday, July 8, 2023

Why OpenAI tokens are more in number

 import tiktoken


# Get the encoding for the davinci GPT3 model, which is the "r50k_base" encoding.

encoding = tiktoken.encoding_for_model("davinci")


text = "We need to stop anthropomorphizing ChatGPT."

print(f"text: {text}")


token_integers = encoding.encode(text)

print(f"total number of tokens: {encoding.n_vocab}")


print(f"token integers: {token_integers}")

token_strings = [encoding.decode_single_token_bytes(token) for token in token_integers]

print(f"token strings: {token_strings}")

print(f"number of tokens in text: {len(token_integers)}")


encoded_decoded_text = encoding.decode(token_integers)

print(f"encoded-decoded text: {encoded_decoded_text}")


With this simple string, the token count printed is around 50K. OpenAI choses such a big token length! 


However, we can’t encode nearly as much information as in OpenAI’s approach. If we used letter-based tokens in the example above, 11 tokens could only encode “We need to”, while 11 of OpenAI’s tokens can encode the entire sentence. It turns out that the current language models have a limit on the maximum number of tokens that they can receive. Therefore, we want to pack as much information as possible in each token.



Now let’s consider the scenario where each word is a token. Compared to OpenAI’s approach, we would only need seven tokens to represent the same sentence, which seems more efficient. And splitting by word is also straighforward to implement. However, language models need to have a complete list of tokens that they might encounter, and that’s not feasible for whole words — not only because there are so many words in the dictionary, but also because it would be difficult to keep up with domain-specific terminology and any new words that are invented.


f you’ve played with OpenAI’s ChatGPT, you know that it produces many tokens, not just a single token. That’s because this basic idea is applied in an expanding-window pattern. You give it n tokens in, it produces one token out, then it incorporates that output token as part of the input of the next iteration, produces a new token out, and so on. This pattern keeps repeating until a stopping condition is reached, indicating that it finished generating all the text you need.


While playing with ChatGPT, you may also have noticed that the model is not deterministic: if you ask it the exact same question twice, you’ll likely get two different answers. That’s because the model doesn’t actually produce a single predicted token; instead it returns a probability distribution over all the possible tokens. In other words, it returns a vector in which each entry expresses the probability of a particular token being chosen. The model then samples from that distribution to generate the output token.



References:

https://towardsdatascience.com/how-gpt-models-work-b5f4517d5b5

No comments:

Post a Comment