Monday, November 18, 2024

What is promptim - Langchain prompt optimization

Promptim is an experimental prompt optimization library to help you systematically improve your AI systems.

Promptim automates the process of improving prompts on specific tasks. You provide initial prompt, a dataset, and custom evaluators (and optional human feedback), and promptim runs an optimization loop to produce a refined prompt that aims to outperform the original.

From evaluation-driven development to prompt optimization

A core responsibility of AI engineers is prompt engineering. This involves manually tweaking the prompt to produce better results.

A useful way to approach this is through evaluation-driven development. This involves first creating a dataset of inputs (and optionally, expected outputs) and then defining a number of evaluation metrics. Every time you make a change to the prompt, you can run it over the dataset and then score the outputs. In this way, you can measure the performance of your prompt and make sure its improving, or at the very least not regressing. Tools like LangSmith help with dataset curation and evaluation.


The idea behind prompt optimization is to use these well-defined datasets and evaluation metrics to automatically improve the prompt. You can suggest changes to the prompt in an automated way, and then score the new prompt with this evaluation method. Tools like DSPy have been pioneering efforts like this for a while.

How Promptim works

We're excited to release our first attempt at prompt optimization. It is an open source library (promptim) that integrates with LangSmith (which we use for dataset management, prompt management, tracking results, and (optionally) human labeling.


The core algorithm is as follows:

Specify a LangSmith dataset, a prompt in LangSmith, and evaluators defined locally. Optionally, you can specify train/dev/test dataset splits.

We run the initial prompt over the dev (or full) dataset to get a baseline score.

We then loop over all examples in the train (or full) dataset. We run the prompt over all examples, then score them. We then pass the results (inputs, outputs, expected outputs, scores) to a metaprompt and ask it to suggest changes to the current prompt

We then use the new updated prompt to compute metrics again on the dev split.

If the metrics show improvement, the the updated prompt is retained. If no improvement, then the original prompt is kept.

This is repeated N times

Optionally, you can add a step where you leave human feedback. This is useful when you don't have good automated metrics, or want to optimize the prompt based on feedback beyond what the automated metrics can provide. This uses LangSmith's Annotation Queues.

references:

https://blog.langchain.dev/promptim/



What is OpenAI Operator?

According to a recent Bloomberg report, OpenAI is developing an AI assistant called “Operator” that can perform computer-based tasks like coding and travel booking on users’ behalf. The company reportedly plans to release it in January as a research preview and through their API.


This development aligns with a broader industry trend toward AI agents that can execute complex tasks with minimal human oversight. Anthropic has unveiled new capabilities for its GenAI model Claude, allowing it to manipulate desktop environments, a significant step toward more independent systems. Meanwhile, Salesforce introduced next-generation AI agents focused on automating intricate tasks for businesses, signaling a broader adoption of AI-driven workflows. These developments underscore a growing emphasis on creating AI systems that can perform advanced, goal-oriented functions with minimal human oversight


AI agents are software programs that can independently perform complex sequences of tasks on behalf of users, such as booking travel or writing code, by understanding context and making decisions. These agents represent an evolution beyond simple chatbots or models, as they can actively interact with computer interfaces and web services to accomplish real-world goals with minimal human supervision.


“AI can help you track your order, issue refunds, or help prevent cancellations; this frees up human agents to become product experts,” he added. “By automating with AI, human support agents become product experts to help guide customers through which products to buy, ultimately driving better revenue and customer happiness.”


References:

https://www.pymnts.com/artificial-intelligence-2/2024/openai-readies-operator-agent-with-ecommerce-web-browsing-capabilities/


Saturday, November 16, 2024

LLM Cost: Bit of Basics

In the context of large language models, a token is a unit of text that the model processes. A token can be as small as a single character or as large as a word or punctuation mark. The exact size of a token depends on the specific tokenization algorithm used by the model. For example:

The word “computer” is one token.

The sentence “Hello, how are you?” consists of 6 tokens: “Hello”, “,”, “how”, “are”, “you”, “?”

Typically, the model splits longer texts into smaller components (tokens) for efficient processing, making it easier to understand, generate, and manipulate text at a granular level.

For many LLMs, including OpenAI’s GPT models, usage costs are determined by the number of tokens processed, which includes both input tokens (the text prompt given to the model) and output tokens (the text generated by the model). Since the computational cost of running these models is high, token-based pricing provides a fair and scalable way to charge for usage.

Calculating Tokens in a Request

Before diving into cost calculation, let’s break down how tokens are accounted for in a request:

Input Tokens:

The text or query sent to the model is split into tokens. For example, if you send a prompt like “What is the capital of France?”, this prompt will be tokenized, and each word will contribute to the token count.

Output Tokens:

The response generated by the model also consists of tokens. For example, if the model responds with “The capital of France is Paris.”, the words in this sentence are tokenized as well.

For instance:

Input: “What is the capital of France?” (7 tokens)

Output: “The capital of France is Paris.” (7 tokens)

Total tokens used in the request: 14 tokens

Tokenize the Input and Output

First, determine the number of tokens in your input text and the model’s output.

Example:

Input Prompt: “What is the weather like in New York today?” (8 tokens)

Output: “The weather in New York today is sunny with a high of 75 degrees.” (14 tokens)

Total Tokens: 8 + 14 = 22 tokens

2. Identify the Pricing for the Model

Pricing will vary depending on the model provider. For this example, let’s assume the pricing is:

$0.02 per 1,000 tokens

3. Calculate Total Cost Based on Tokens

Multiply the total number of tokens by the rate per 1,000 tokens:

Step-by-Step Guide to Calculating the Cost

Tokenize the Input and Output

First, determine the number of tokens in your input text and the model’s output.

Example:

Input Prompt: “What is the weather like in New York today?” (8 tokens)

Output: “The weather in New York today is sunny with a high of 75 degrees.” (14 tokens)

Total Tokens: 8 + 14 = 22 tokens

2. Identify the Pricing for the Model

Pricing will vary depending on the model provider. For this example, let’s assume the pricing is:

$0.02 per 1,000 tokens

TOTAL COST = (22/1000) * 0.02 = 0.00044

Factors Influencing Token Costs

Several factors can influence the number of tokens generated and therefore the overall cost:


Length of Input Prompts:


Longer prompts result in more input tokens, increasing the overall token count.

Length of Output Responses:


If the model generates lengthy responses, more tokens are used, leading to higher costs.

Complexity of the Task:


More complex queries that require detailed explanations or multiple steps will result in more tokens, both in the input and output.

Model Used:


Different models (e.g., GPT-3, GPT-4) may have different token limits and pricing structures. More advanced models typically charge higher rates per 1,000 tokens.

Token Limits Per Request:


Many LLM providers impose token limits on each request. For instance, a single request might be capped at 2,048 or 4,096 tokens, including both input and output tokens.


Reducing Costs When Using LLMs

Optimize Prompts:


Keep prompts concise but clear to minimize the number of input tokens. Avoid unnecessary verbosity.

Limit Response Length:


Control the length of the model’s output using the maximum tokens parameter. This prevents the model from generating overly long responses, saving on tokens.

Batch Processing:


If possible, group related queries together to reduce the number of individual requests.

Choose the Right Model:


Use smaller models when applicable, as they are often cheaper per token compared to larger, more advanced models.


 

What are Small Language Models ( SLMs)

 Types: 

1. Distilled Models

2. Pruned Models

3. Quantized Models

4. Models Trained from Scratch

Key Characteristics of Small Language Models

Model Size and Parameter Count

Small Language Models (SLMs) typically range from hundreds of millions to a few billion parameters, unlike Large Language Models (LLMs), which can have hundreds of billions of parameters. This smaller size allows SLMs to be more resource-efficient, making them easier to deploy on local devices such as smartphones or IoT devices.

Ranges from millions to a few billion parameters.

Suitable for resource-constrained environments.

Easier to run on personal or edge devices


Training Data Requirements

Require less training data overall.

Emphasize the quality of data over quantity.

Faster training cycles due to smaller model size.


Inference Speed

Reduced latency due to fewer parameters.

Suitable for real-time applications.

Can run offline on smaller devices like mobile phones or embedded systems.



Creating small language models involves different techniques, each with unique approaches and trade-offs. Here's a breakdown of the key differences among Distilled Models, Pruned Models, Quantized Models, and Models Trained from Scratch:


1. Distilled Models

Approach: Knowledge distillation involves training a smaller model (the student) to mimic the behavior of a larger, pre-trained model (the teacher). The smaller model learns by approximating the outputs or logits of the larger model, rather than directly training on raw data.

Key Focus: Reduce model size while retaining most of the teacher model's performance.

Use Case: When high accuracy is needed with a smaller computational footprint.

Advantages:

Retains significant accuracy compared to the teacher model.

Faster inference and reduced memory requirements.

Drawbacks:

The process depends on the quality of the teacher model.

May require additional resources for the distillation process.

2. Pruned Models

Approach: Model pruning removes less significant weights, neurons, or layers from a large model based on predefined criteria, such as low weight magnitudes or redundancy.

Key Focus: Reduce the number of parameters and improve efficiency.

Use Case: When the original model is overparameterized, and optimization is required for resource-constrained environments.

Advantages:

Reduces computation and memory usage.

Can target specific hardware optimizations.

Drawbacks:

Risk of accuracy loss if pruning is too aggressive.

Pruning techniques can be complex to implement effectively.

3. Quantized Models

Approach: Quantization reduces the precision of the model's parameters from floating-point (e.g., 32-bit) to lower-precision formats (e.g., 8-bit integers).

Key Focus: Improve speed and reduce memory usage, especially on hardware with low-precision support.

Use Case: Optimizing models for edge devices like smartphones or IoT devices.

Advantages:

Drastically reduces model size and computational cost.

Compatible with hardware accelerators like GPUs and TPUs optimized for low-precision arithmetic.

Drawbacks:

Can lead to accuracy degradation, especially for sensitive models.

May require fine-tuning to recover performance after quantization.

4. Models Trained from Scratch

Approach: Building and training a model from the ground up, using a new or smaller dataset, rather than modifying a pre-trained large model.

Key Focus: Design a small model architecture tailored to the specific use case or dataset.

Use Case: When there is sufficient training data and computational resources to create a highly specialized model.

Advantages:

Customizable to specific tasks or domains.

No dependency on pre-trained models.

Drawbacks:

Resource-intensive training process.

Typically requires significant expertise in model design and optimization.

May underperform compared to fine-tuned pre-trained models on general tasks.


References: 

https://medium.com/@kanerika/why-small-language-models-are-making-big-waves-in-ai-0bb8e0b6f20c




What is State in Langgraph

In LangGraph, State is a fundamental concept that represents the data being passed and transformed through nodes in the workflow. It acts as a shared data container for the graph, enabling nodes to read from and write to it during execution.

Breaking Down the Example

python

Copy code

class State(TypedDict):

    # The operator.add reducer fn makes this append-only

    messages: Annotated[list, operator.add]

1. TypedDict

State is a subclass of Python's TypedDict. This allows you to define the expected structure (keys and types) of the state dictionary in a strongly typed manner.

Here, the state has one key, messages, which is a list.

2. Annotated

Annotated is a way to add metadata to a type. In this case:

python

Copy code

Annotated[list, operator.add]

It indicates that messages is a list.

The operator.add is used as a reducer function.

3. operator.add

operator.add is a Python function that performs addition for numbers or concatenation for lists.

In this context, it is used as a reducer function for the messages list.

4. Reducer Function Behavior

A reducer function specifies how new values should be combined with the existing state during updates.

By using operator.add, the messages list becomes append-only, meaning any new items added to messages will concatenate with the current list instead of replacing it.

Why Use operator.add in State?

Append-Only Behavior:

Each node in the workflow can add to the messages list without overwriting previous values. This is useful for:

Logging messages from different nodes.

Maintaining a sequential record of events.

Thread Safety:

Using a reducer function ensures that state updates are predictable and consistent, even in concurrent workflows.

Flexibility in State Updates:

Reducer functions allow complex operations during state updates, such as appending, merging dictionaries, or performing custom logic.

references:

OpenAI 




Friday, November 15, 2024

What is Elixir ?

Elixir is a dynamic, functional language for building scalable and maintainable applications.

Elixir runs on the Erlang VM, known for creating low-latency, distributed, and fault-tolerant systems. These capabilities and Elixir tooling allow developers to be productive in several domains, such as web development, embedded software, machine learning, data pipelines, and multimedia processing, across a wide range of industries.


Here is a peek:


iex> "Elixir" |> String.graphemes() |> Enum.frequencies()

%{"E" => 1, "i" => 2, "l" => 1, "r" => 1, "x" => 1}


Platform features

Scalability

All Elixir code runs inside lightweight threads of execution (called processes) that are isolated and exchange information via messages:


Due to their lightweight nature, you can run hundreds of thousands of processes concurrently in the same machine, using all machine resources efficiently (vertical scaling). Processes may also communicate with other processes running on different machines to coordinate work across multiple nodes (horizontal scaling).


Together with projects such as Numerical Elixir, Elixir scales across cores, clusters, and GPUs.


Fault-tolerance

The unavoidable truth about software in production is that things will go wrong. Even more when we take network, file systems, and other third-party resources into account.


To react to failures, Elixir supervisors describe how to restart parts of your system when things go awry, going back to a known initial state that is guaranteed to work:


children = [

  TCP.Pool,

  {TCP.Acceptor, port: 4040}

]


Supervisor.start_link(children, strategy: :one_for_one)

The combination of fault-tolerance and message passing makes Elixir an excellent choice for event-driven systems and robust architectures. Frameworks, such as Nerves, build on this foundation to enable productive development of reliable embedded/IoT systems.


Functional programming

Functional programming promotes a coding style that helps developers write code that is short, concise, and maintainable. For example, pattern matching allows us to elegantly match and assert specific conditions for some code to execute:


def drive(%User{age: age}) when age >= 16 do

  # Code that drives a car

end


drive(User.get("John Doe"))

#=> Fails if the user is under 16

Elixir relies on those features to ensure your software is working under the expected constraints. And when it is not, don't worry, supervisors have your back!


Extensibility and DSLs

Elixir has been designed to be extensible, allowing developers naturally extend the language to particular domains, in order to increase their productivity.


As an example, let's write a simple test case using Elixir's test framework called ExUnit:


defmodule MathTest do

  use ExUnit.Case, async: true


  test "can add two numbers" do

    assert 1 + 1 == 2

  end

end

The async: true option allows tests to run in parallel, using as many CPU cores as possible, while the assert functionality can introspect your code, providing great reports in case of failures.


Other examples include using Elixir to write SQL queries, compiling a subset of Elixir to the GPU, and more.


Tooling features

A growing ecosystem

Elixir ships with a great set of tools to ease development. Mix is a build tool that allows you to easily create projects, manage tasks, run tests and more:


$ mix new my_app

$ cd my_app

$ mix test

.


Finished in 0.04 seconds (0.04s on load, 0.00s on tests)

1 test, 0 failures

Mix also integrates with the Hex package manager for dependency management and hosting documentation for the whole ecosystem.


Interactive development

Tools like IEx (Elixir's interactive shell) leverage the language and platform to provide auto-complete, debugging tools, code reloading, as well as nicely formatted documentation:


$ iex

Interactive Elixir - press Ctrl+C to exit (type h() ENTER for help)

iex> h String.trim           # Prints the documentation

iex> i "Hello, World"        # Prints information about a data type

iex> break! String.trim/1    # Sets a breakpoint

iex> recompile               # Recompiles the current project

Code notebooks like Livebook allow you to interact with Elixir directly from your browser, including support for plotting, flowcharts, data tables, machine learning, and much more!


What are Pros and Cons of Erlang VM

The Erlang Virtual Machine (VM), also known as BEAM, is the runtime system that executes Erlang and Elixir code. It's designed for building concurrent, distributed, and fault-tolerant systems. Below are the pros and cons of using the Erlang VM:


Pros of Erlang VM (BEAM)

1. Concurrency and Scalability

Lightweight Processes: Erlang VM supports millions of lightweight processes, which are independent and do not share memory. This is ideal for building highly concurrent systems.

Efficient Scheduling: BEAM uses preemptive scheduling to ensure fair execution among processes, making it well-suited for multi-core CPUs.

2. Fault Tolerance

Supervisor Trees: Built-in mechanisms allow processes to monitor each other and restart failed processes seamlessly.

Isolation: Processes are isolated, so a crash in one does not affect others.

3. Distributed Systems Support

Erlang VM has first-class support for distributed computing, enabling nodes to communicate over a network as easily as within the same system.

4. Real-Time Systems

Soft Real-Time Capabilities: The VM is designed to handle soft real-time requirements, ensuring timely responses in applications like telecommunications and messaging.

5. Hot Code Upgrades

BEAM allows code to be updated in a running system without downtime, which is crucial for high-availability systems.

6. Garbage Collection

Each process has its own heap and garbage collection, making memory management efficient and avoiding global pauses.

7. Built-in Tools

BEAM provides robust tools for debugging, profiling, and tracing (e.g., Observer, DTrace).

8. Community and Ecosystem

Languages like Elixir leverage BEAM, bringing modern syntax and tooling to its robust runtime.

9. Mature and Battle-Tested

BEAM has been used in production for decades, powering telecom systems, messaging platforms (e.g., WhatsApp), and databases (e.g., CouchDB).



Cons of Erlang VM (BEAM)

1. Performance Limitations

Single-threaded Execution per Scheduler: While great for concurrency, BEAM isn't optimized for raw CPU-bound tasks compared to VMs like JVM.

Limited Numerical Processing: It's less suited for heavy numerical computations or AI/ML tasks.

2. Memory Overhead

Lightweight processes consume more memory compared to raw threads in some other VMs, especially when the number of processes is extremely high.

3. Learning Curve

The functional programming paradigm, immutable data structures, and process model can be challenging for developers used to imperative programming.

4. Lack of Mainstream Libraries

While BEAM has excellent libraries for distributed systems, its ecosystem lacks the breadth of libraries available for JVM or Python.

5. Tooling

Although improving, the tooling (e.g., IDE support) may not be as polished as in more mainstream ecosystems like Java or JavaScript.

6. Latency in Large Distributed Systems

BEAM excels in small to medium-sized distributed systems but can encounter latency challenges when scaling across a very large number of nodes.

7. Limited Language Options

BEAM primarily supports Erlang and Elixir, limiting the variety of languages that can run on the VM compared to platforms like JVM or .NET.

8. Hot Code Loading Complexity

While powerful, hot code upgrades require careful planning and can introduce subtle bugs if not managed correctly.

9. Concurrency Debugging

Debugging concurrent processes and race conditions can be challenging due to the asynchronous nature of communication.

10. Not Mainstream

Erlang and Elixir are not as widely adopted as JavaScript, Python, or Java, which might make finding experienced developers or community support harder.