Sunday, May 18, 2025

What is Cost, Throughput, Latency of LLM Inference? What are factors affecting these two and how to compute these ?

Throughput = Query / sec => Maximise for batch job speed or to allow more users 

Latency = sec / token => minimise for user experience ( how fast the application will look) = users can read 200 words per sec, so as long as the latency is below this, application will look to be performing good. 

Cost : Cheaper is better 

Let's break down the cost, throughput, and latency of LLM inference.

Cost of LLM Inference

What it is: The expense associated with running an LLM to generate responses or perform tasks based on input prompts.


Factors Affecting Cost:


Model Size: Larger models with more parameters generally require more computational resources, leading to higher costs.

Number of Tokens: Most LLM APIs charge based on the number of input and output tokens processed. Longer prompts and longer generated responses increase costs. Output tokens are often more expensive than input tokens.

Complexity of the Task: More complex tasks might require more processing and thus incur higher costs.

Hardware Used: The type and amount of hardware (GPUs, CPUs) used for inference significantly impact costs, especially for self-hosted models. Cloud-based services abstract this but factor it into their pricing.

Pricing Model: Different providers have varying pricing models (per token, per request, compute time, etc.).

Model Provider: Different providers offer the same or similar models at different price points.

Mixture of Experts (MoE) Models: These models might be priced based on the total number of parameters or the number of active parameters during inference.

How to Compute Cost:


The cost calculation depends on the pricing model of the LLM service or the infrastructure cost if self-hosting.


Per Token Pricing (Common for API services):

Cost = (Input Tokens / 1000 * Input Price per 1k tokens) + (Output Tokens / 1000 * Output Price per 1k tokens)

Self-Hosting: This involves calculating the cost of hardware (amortized over time), electricity, data center costs, and potentially software licenses. This is more complex and depends on your specific infrastructure.

Cloud Inference Services: These typically provide a per-token cost, and you can estimate based on your expected token usage. Some might have per-request fees as well.

Key Considerations:


Input vs. Output Tokens: Be mindful of the different costs for input and output tokens.

Context Length: Longer context windows can lead to higher token usage and thus higher costs.

Tokenization: Different models tokenize text differently, affecting the number of tokens for the same input.

Throughput of LLM Inference

What it is: The rate at which an LLM can process inference requests. It's often measured in:


Tokens per second (TPS): The number of input and/or output tokens the model can process or generate in one second. This is a common metric.

Requests per second (RPS): The number of independent inference requests the model can handle in one second. This depends on the total generation time per request.

Factors Affecting Throughput:


Model Size and Architecture: Smaller, less complex models generally have higher throughput.

Hardware: More powerful GPUs or CPUs with higher memory bandwidth lead to higher throughput. Using multiple parallel processing units (GPUs) significantly increases throughput.

Batch Size: Processing multiple requests together (batching) can significantly improve throughput by better utilizing the hardware. However, very large batch sizes can increase latency due to memory limitations.

Input/Output Length: Shorter prompts and shorter generated responses lead to higher throughput.

Optimization Techniques: Techniques like quantization, pruning, key-value caching, and efficient attention mechanisms (e.g., FlashAttention, Group Query Attention) can significantly boost throughput.

Parallelism: Techniques like tensor parallelism and pipeline parallelism distribute the model and its computations across multiple devices, improving throughput for large models.

Memory Bandwidth: The speed at which data (model weights, activations) can be transferred to the processing units is a crucial bottleneck for throughput.

How to Compute Throughput:


Tokens per Second:


Measure the total number of tokens processed (input + output) or generated (output only) over a specific time period.

TPS= 

Total Time (in seconds)

Total Tokens

 

Requests per Second:


Measure the total number of inference requests completed over a specific time period.

RPS= 

Total Time (in seconds)

Total Requests

 

RPS is also related to the average total generation time per request:


RPS≈ 

Average Total Generation Time per Request

1

 

Key Considerations:


Input vs. Output Tokens: Specify whether the throughput refers to input, output, or the sum of both.

Concurrency: Throughput often increases with the number of concurrent requests, up to a certain point.

Latency Trade-off: Increasing batch size to improve throughput can increase the latency for individual requests.

Latency of LLM Inference

What it is: The delay between sending an inference request to the LLM and receiving the response. It's a critical factor for user experience, especially in real-time applications. Common metrics include:


Time to First Token (TTFT): The time it takes for the model to generate the very first token of the response after receiving the prompt. This is crucial for perceived responsiveness.

Time Per Output Token (TPOT) / Inter-Token Latency (ITL): The average time taken to generate each subsequent token after the first one.

Total Generation Time / End-to-to-End Latency: The total time from sending the request to receiving the complete response.

Total Generation Time = TTFT + (TPOT * Number of Output Tokens)

Factors Affecting Latency:


Model Size and Complexity: Larger models generally have higher latency due to the increased computations required.

Input/Output Length: Longer prompts require more processing time before the first token can be generated (affecting TTFT). Longer desired output lengths naturally increase the total generation time.

Hardware: Faster GPUs or CPUs with lower memory access times reduce latency.

Batch Size: While batching improves throughput, it can increase the latency for individual requests as they wait to be processed in a batch.

Optimization Techniques: Model compression (quantization, pruning) and optimized attention mechanisms can reduce the computational overhead and thus lower latency.

Network Conditions: For cloud-based APIs, network latency between the user and the inference server adds to the overall latency. Geographical distance to the server matters.

System Load: High load on the inference server can lead to queuing and increased latency.

Cold Starts: The first inference request after a period of inactivity might experience higher latency as the model and necessary data are loaded into memory.

Tokenization: The time taken to tokenize the input prompt also contributes to the initial latency (TTFT).

How to Compute Latency:


Time to First Token (TTFT): Measure the time difference between sending the request and receiving the first token.

Time Per Output Token (TPOT): Measure the time taken to generate the entire response (excluding TTFT) and divide it by the number of output tokens.

TPOT= 

Number of Output Tokens

Total Generation Time - TTFT

 

Total Generation Time: Measure the time difference between sending the request and receiving the last token of the response.

Key Considerations:


TTFT Importance: For interactive applications, minimizing TTFT is crucial for a good user experience.

Trade-off with Throughput: Optimizations for higher throughput (like large batch sizes) can negatively impact latency.

Variability: Latency can vary depending on the specific prompt, the model's state, and the server load. It's often useful to measure average and percentile latencies.

Understanding these three aspects – cost, throughput, and latency – and the factors that influence them is crucial for effectively deploying and utilizing LLMs in various applications. There's often a trade-off between these metrics, and the optimal balance depends on the specific use case and requirements.


No comments:

Post a Comment