As large language models continue to scale, tokens have become a fundamental unit of both computation and cost. For training, more tokens mean more data and more optimization steps. For inference, more tokens directly translate into higher latency and API expenses. Since most LLM providers charge per hundred thousand or per million tokens, it has become a financial concern.
This made me wonder: if English dominates most LLM usage, does switching to another language help reduce token usage? In particular, could Chinese, which is often considered more compact in written form, require fewer tokens than English?
In this article, I will run a controlled experiment comparing token efficiency across multiple languages, focusing primarily on Chinese and English, and analyze how modern tokenizers handle different linguistic systems.
For this experiment, I will use OpenAI's open-source tokenizer tiktoken, which is widely used in many LLM implementations. I will compare the token counts for a fixed length of English text.
The sentence I will use (generated by ChatGPT):
Large language models process text by breaking it into smaller units called tokens. These tokens are not the same as words or characters. Instead, they are generated by algorithms such as Byte Pair Encoding or SentencePiece, which split text into statistically frequent subword segments.
Because most modern LLM APIs charge based on token usage, the number of tokens directly affects cost and latency. A shorter character count does not necessarily mean fewer tokens. Different languages may produce different tokenization patterns due to variations in writing systems, morphology, and training data distribution.
To understand whether language choice influences token efficiency, we must measure token counts under controlled conditions rather than relying on intuition.
To start, we will have to set up a simple script that counts the token numbers for each version; below is the code snippet for the experiment:
import tiktoken
enc = tiktoken.encoding_for_model("gpt-4o")
text = "text"
tokens = enc.encode(text)
print(len(tokens))The result is as followsToken Count ResultsPercentage difference is calculated as .:
| Language | Token Count | Difference vs English | % Difference |
|---|---|---|---|
| English | 129 | 0 | 0% |
| Chinese | 162 | +33 | +25.6% |
| Japanese | 240 | +111 | +86.0% |
| German | 178 | +49 | +38.0% |
| French | 188 | +59 | +45.7% |
| Spanish | 180 | +51 | +39.5% |
From the results, we surprisingly see that English actually has the lowest token count among the languages tested. Why is this so?
Most BPE tokenizersBPEsByte Pair Encoding (BPE) is a common tokenization algorithm that merges the most frequent pairs of characters or subwords iteratively to create a vocabulary of tokens. are trained on predominantly English corpora, which means they are optimized for English text. This leads to more efficient tokenization for English, while other languages may require more tokens to represent the same content. For example, Chinese words may not be merged as consistently.
English also benefits from spaces. In byte-level BPE tokenizers, tokens are often learned together with leading spaces. For example, tokens such as " language" or " models" are common because the space character becomes part of the merge rule, allowing frequent English words to be represented as single tokens that include both the word and its preceding space.
Languages such as Chinese and Japanese, on the other hand, do not use whitespace to separate words. As a result, tokenizers must rely solely on character-level or subword-level statistical patterns, which can lead to less efficient merges.
In practice, this means that English often maps cleanly to fewer tokens, while languages without explicit word boundaries may require more token splits.
Another factor may be how text is encoded at the byte level. English characters in UTF-8UTF-8UTF-8 is a variable-length character encoding that uses 1 to 4 bytes per character, typically in the form of \u000 are typically represented using one byte. In contrast, Chinese and Japanese characters usually require three bytes per character.
Since many modern tokenizers operate at the byte level before performing subword merges, languages that require multi-byte representations may produce more granular token segments.
This is a common follow-up question. While LLM providers typically charge based on token usage, the relationship between token count and cost is more nuanced.
In most API pricing models, cost is directly proportional to the number of input and output tokens. Under identical conditions, fewer tokens do indeed translate into lower billing cost.
However, token efficiency does not fully determine overall system cost in several ways:
First, output length may vary across languages. Even if a prompt in English uses fewer tokens, the model's response in another language might be longer or more verbose, increasing total token usage.
Second, reasoning behavior can differ. Some languages may trigger longer explanatory chains or stylistic differences in generation.
Third, model performance across languages is not perfectly uniform. If a model performs better in English due to training distribution, it may require fewer clarification prompts or retries, indirectly reducing total tokens consumed in practice.
With that said, token count is a necessary factor in cost calculation, but it is not a sufficient measure of actual efficiency.
The experiment above shows that English is more token-efficient than other languages currently under modern byte-level BPE tokenizers. However, this efficiency does not originate from linguistic compactness or inherent simplicity. Instead, it reflects how tokenizers are trained and optimized.
While fewer tokens often translate into lower billing under standard API pricing models, real-world system efficiency depends on the full interaction loop, which may require more or fewer tokens depending on a language's information density and the question itself.
Copyright © 2026 Sicheng Ouyang. All rights reserved.
This article may not be reproduced, redistributed, or republished without permission.