Wednesday, February 4, 2026

How AppSync can be used with Lambda resolvers for Bedrock inferencing

Using **AWS AppSync with Lambda resolvers** is a flexible way to integrate GraphQL with **Amazon Bedrock**. While AppSync now supports direct integration with Bedrock (no-code), using a Lambda resolver is still preferred when you need to perform **data validation, prompt engineering, or complex post-processing** before returning the AI's response to the client.


### The Architectural Flow


1. **Client Request:** A user sends a GraphQL query or mutation (e.g., `generateSummary(text: String!)`) to the AppSync endpoint.

2. **AppSync Resolver:** AppSync identifies the field and triggers the associated **Lambda Data Source**.

3. **Lambda Function:** The function receives the GraphQL arguments, constructs a prompt, and calls the **Bedrock Runtime API**.

4. **Bedrock Inference:** Bedrock processes the prompt and returns a JSON response.

5. **Return to Client:** Lambda parses the result and returns it to AppSync, which maps it back to the GraphQL schema.


---


### Step-by-Step Implementation


#### 1. Define the GraphQL Schema


In the AppSync console, define the types and the mutation that will trigger the AI.


```graphql

type AIResponse {

  content: String

  usage: String

}


type Mutation {

  askBedrock(prompt: String!): AIResponse

}


```


#### 2. Create the Lambda Resolver (Node.js Example)


The Lambda function acts as the "middleman." It uses the `@aws-sdk/client-bedrock-runtime` to communicate with the foundation models.


```javascript

import { BedrockRuntimeClient, InvokeModelCommand } from "@aws-sdk/client-bedrock-runtime";


const client = new BedrockRuntimeClient({ region: "us-east-1" });


export const handler = async (event) => {

  // Extract the prompt from the AppSync 'arguments' object

  const { prompt } = event.arguments;


  const input = {

    modelId: "anthropic.claude-3-haiku-20240307-v1:0",

    contentType: "application/json",

    accept: "application/json",

    body: JSON.stringify({

      anthropic_version: "bedrock-2023-05-31",

      max_tokens: 500,

      messages: [{ role: "user", content: prompt }],

    }),

  };


  try {

    const command = new InvokeModelCommand(input);

    const response = await client.send(command);

    

    // Decode and parse the binary response body

    const responseBody = JSON.parse(new TextDecoder().decode(response.body));

    

    return {

      content: responseBody.content[0].text,

      usage: "Success"

    };

  } catch (error) {

    console.error(error);

    throw new Error("Failed to invoke Bedrock");

  }

};


```


#### 3. Configure IAM Permissions


Your Lambda function's execution role **must** have permission to call the specific Bedrock model.


```json

{

  "Version": "2012-10-17",

  "Statement": [

    {

      "Effect": "Allow",

      "Action": "bedrock:InvokeModel",

      "Resource": "arn:aws:bedrock:us-east-1::foundation-model/anthropic.claude-3-haiku-20240307-v1:0"

    }

  ]

}


```


---


### Why use Lambda instead of AppSync's Direct Bedrock Integration?


While AppSync can now talk to Bedrock directly using specialized "Direct Data Source" resolvers, the **Lambda approach** is better for:


* **Prompt Orchestration:** You can fetch additional data from DynamoDB or a vector database (like Pinecone or OpenSearch) to augment the prompt (**RAG architecture**) before sending it to Bedrock.

* **Response Sanitization:** You can filter the AI's output for PII (Personally Identifiable Information) or toxic content before it reaches the user.

* **Logging & Auditing:** You can easily log exact input/output tokens to CloudWatch for cost tracking and performance monitoring.

* **Error Handling:** You can provide custom "fallback" responses if the AI service is throttled or the prompt violates safety filters.


### Handling Long-Running Queries


Standard Lambda-based GraphQL queries have a **30-second timeout**. If the model (like Claude 3 Opus) takes longer to generate a response, the query will fail. In those cases, it is recommended to use **AppSync Subscriptions** to stream the response back to the client token-by-token.


How to use Amazone Kinesis Data Analysis for GraphQL ?

 sing **Amazon Kinesis Data Analytics** (now called **Amazon Managed Service for Apache Flink**) to parse GraphQL is unique because GraphQL queries are sent as **strings** inside a JSON payload. Unlike standard JSON, you cannot simply use a "dot" notation to access fields inside the query; you must parse the GraphQL DSL (Domain Specific Language) itself.


There are three main ways to achieve this, depending on how much detail you need from the query.


---


### 1. The "Robust" Path: Apache Flink with a Parser Library


If you need to extract specific fields (e.g., "how many times was the `email` field requested?"), you should use the **Managed Service for Apache Flink** with a custom Java or Python application.


* **How it works:** You write a Flink application that includes a GraphQL parsing library (like `graphql-java` for Java or `graphql-core` for Python).

* **The Logic:**

1. Flink consumes the JSON record from the Kinesis Stream.

2. A `MapFunction` extracts the `query` string from the JSON.

3. The parser library converts that string into an **AST (Abstract Syntax Tree)**.

4. You traverse the tree to find the operation name, fragments, or specific leaf fields.



* **Best for:** Deep security auditing, complexity analysis, or fine-grained usage billing.


### 2. The "Simple" Path: Kinesis SQL with Regex


If you only need to extract the **Operation Name** or verify the presence of a specific keyword, you can use the Legacy SQL runtime (or Flink SQL).


* **How it works:** Use the `REGEXP_EXTRACT` function to find patterns within the query string.

* **Example SQL:**

```sql

SELECT 

    STREAM_NAME,

    REGEXP_EXTRACT(query_payload, 'query\s+(\w+)') AS operation_name

FROM "SOURCE_SQL_STREAM_001";


```



* **Best for:** Real-time dashboards showing which queries (by name) are most popular.

* **Limitation:** This is very brittle. If a user changes their whitespace or uses aliases, the regex will likely fail.


### 3. The "Hybrid" Path: Lambda Pre-processing


The most common production pattern is to parse the GraphQL **before** it reaches Kinesis Analytics using a **Kinesis Data Firehose Transformation Lambda**.


1. **Ingest:** Data is sent to Kinesis Data Firehose.

2. **Transform:** Firehose triggers an AWS Lambda function.

3. **Parse:** The Lambda uses a standard GraphQL library to parse the query and flattens it into a standard JSON object (e.g., `{"operation": "GetUser", "fields": ["id", "name"]}`).

4. **Analyze:** The flattened JSON is sent to Kinesis Analytics, which can now use simple SQL to analyze the data because it is no longer a complex string.


---


### Comparison of Methods


| Feature | Flink + Parser Library | SQL + Regex | Lambda Pre-processor |

| --- | --- | --- | --- |

| **Parsing Depth** | Full (AST level) | Shallow (Pattern matching) | Full (JSON flattening) |

| **Complexity** | High (Custom Code) | Low (Standard SQL) | Moderate (Simple Lambda) |

| **Performance** | Highest (Native) | High | Moderate (Lambda overhead) |

| **Use Case** | Advanced Analytics | Basic Dashboards | General Purpose ETL |


### Next Step


Would you like me to provide a **sample Python script** for a Lambda transformer that flattens a GraphQL query for Kinesis, or would you prefer a **Java snippet** for a Flink MapFunction?


What is the difference between Apollo and AppSync when integring with BedRock?

 


Integrating **Apollo GraphQL** with **AWS Bedrock** creates a powerful bridge between your frontend and generative AI models. While Apollo manages your "Data Graph," Bedrock provides the "Intelligence" layer.


In this architecture, Apollo acts as the **orchestrator**, translating GraphQL queries into Bedrock API calls and shaping the AI's response to match your application's schema.


---


### 1. The Architectural Flow


The most common way to integrate these is by hosting an **Apollo Server** (on AWS Lambda, ECS, or Fargate) that uses the **AWS SDK** to communicate with Bedrock.


1. **Client Query:** The frontend sends a GraphQL query (e.g., `askAI(prompt: "...")`).

2. **Apollo Resolver:** A specific function in your Apollo Server intercepts the query.

3. **Bedrock Runtime:** The resolver calls the `InvokeModel` or `Converse` API via the `@aws-sdk/client-bedrock-runtime`.

4. **Schema Mapping:** Apollo transforms the raw JSON response from the AI (like Claude or Llama) into the structured format defined in your GraphQL schema.


---


### 2. Implementation Patterns


#### A. The "Standard" Apollo Resolver


In this pattern, you define a `Mutation` or `Query` in your schema. The resolver is responsible for the "heavy lifting."


```javascript

// Example Resolver logic

const resolvers = {

  Mutation: {

    generateResponse: async (_, { prompt }, { bedrockClient }) => {

      const command = new InvokeModelCommand({

        modelId: "anthropic.claude-3-sonnet",

        body: JSON.stringify({

          prompt: `\n\nHuman: ${prompt}\n\nAssistant:`,

          max_tokens_to_sample: 300,

        }),

        contentType: "application/json",

      });


      const response = await bedrockClient.send(command);

      const resBody = JSON.parse(new TextDecoder().decode(response.body));

      return { text: resBody.completion };

    },

  },

};


```


#### B. Streaming with Subscriptions


AI responses take time. To avoid timeouts and improve UX, you can use **GraphQL Subscriptions**.


* The client **subscribes** to a response channel.

* Apollo Server uses `InvokeModelWithResponseStream` to get tokens incrementally from Bedrock.

* As tokens arrive, Apollo "publishes" them to the subscription, appearing instantly on the user's screen.


---


### 3. Apollo vs. AWS AppSync for Bedrock


While you can build this manually with Apollo, AWS offers a managed GraphQL service called **AppSync** which has a native integration.


| Feature | Apollo Server (Self-Managed) | AWS AppSync (Managed) |

| --- | --- | --- |

| **Setup** | High control; requires hosting (Lambda/ECS). | Fully managed; serverless by default. |

| **Bedrock Integration** | Via **AWS SDK** in resolvers. | **Direct Bedrock Resolvers** (no code/Lambda needed). |

| **Streaming** | Requires WebSocket setup (Apollo Subscriptions). | Built-in via serverless WebSockets. |

| **Type Safety** | High (native GraphQL). | High (native GraphQL). |


---


### 4. Key Use Cases


* **Self-Documenting AI:** Bedrock Agents can use your Apollo GraphQL endpoint as an "Action Group." Because GraphQL is introspectable, the AI can "read" your schema to understand what data it can fetch.

* **Data Aggregation:** You can create a field like `aiSummary` on a `Product` type. When queried, Apollo fetches the product data from DynamoDB and simultaneously asks Bedrock to summarize it.


### Next Steps


Would you like me to **provide a full boilerplate for a Bedrock-enabled Apollo Server** or explain how to set up **AppSync’s direct Bedrock resolvers**?

Explain how the feature below function API gateway with VTL Transformations for GraphQL

In AWS, using **API Gateway with VTL (Velocity Template Language) for GraphQL** usually refers to one of two patterns: either using a REST API Gateway as a "proxy" or "translator" in front of a GraphQL backend, or utilizing the VTL engine within **AWS AppSync** (which is AWS’s dedicated managed GraphQL service).


VTL is a logic-heavy templating engine that allows you to manipulate data as it flows through the gateway without needing to spin up a compute resource like AWS Lambda.


### 1. The Core Role of VTL


VTL acts as a **data transformer**. In a typical request/response flow:


* **Request Mapping:** When a client sends a GraphQL query (which is essentially a JSON POST request), VTL intercepts it and reshapes it into a format the backend understands (e.g., a specific DynamoDB JSON structure or a specialized HTTP request).

* **Response Mapping:** When the backend returns data, VTL takes that raw output and reshapes it to match the GraphQL schema’s expected format before sending it back to the client.


### 2. Common Implementation Patterns


#### Pattern A: AppSync Resolvers (Native GraphQL)


AppSync is the primary service where VTL is used for GraphQL. It uses **Resolvers** to connect fields in your GraphQL schema to data sources.


* **Logic without Code:** You can write VTL to perform validation, authorization checks, or conditional logic (e.g., "if the user is an Admin, return the full object; otherwise, hide the email field").

* **Direct Integration:** VTL can translate a GraphQL `getPost(id: "123")` query directly into a DynamoDB `GetItem` request, bypassing the need for a Lambda function entirely. This reduces latency and cost.


#### Pattern B: API Gateway (REST) as a Proxy


If you have a REST API Gateway sitting in front of a GraphQL server (like an Apollo server running on Fargate):


* **Request Sanitization:** Use VTL to extract specific headers or JWT claims and inject them into the GraphQL query's `variables` or `context`.

* **Payload Wrapping:** API Gateway can take a simple REST-style input and wrap it into a valid GraphQL `query` string using VTL to send it to a downstream GraphQL endpoint.


### 3. Key VTL Features used in GraphQL


* **`$context` (or `$ctx`)**: An object containing all request details, including arguments (`$ctx.args`), identity/auth info (`$ctx.identity`), and previous results in a pipeline.

* **`$util`**: A library of helper functions for common tasks like generating UUIDs (`$util.autoId()`), converting strings to JSON, or escaping characters.

* **Directives**: VTL supports `#if`, `#foreach`, and `#set`, allowing you to loop through a list of items returned by a database and format them as a GraphQL array.


### Why use VTL instead of Lambda?


* **Performance:** VTL runs natively within the API Gateway or AppSync service. There is no "cold start" associated with it.

* **Cost:** You are not charged for the execution time of VTL templates (you pay for the API request), whereas Lambda incurs a cost per execution and duration.

* **Simplicity:** For simple data mapping (e.g., renaming a field or extracting an ID), a few lines of VTL is often easier to maintain than a full deployment package of code.


You can learn more about the mechanics of data transformations in this [AWS API Gateway Tutorial](https://www.youtube.com/watch?v=-_nYddYkd7M). This video explains how mapping templates function to restructure requests and responses, which is the foundational concept for handling GraphQL payloads.

Friday, January 30, 2026

Create own rag evaluation strategy

 Building a RAG (Retrieval-Augmented Generation) evaluation from scratch is actually a great way to deeply understand where your pipeline is failing. While frameworks like Ragas or Arize Phoenix are popular, they are essentially just wrappers for specific prompts and math.

To do this manually, you need to evaluate the two distinct pillars of RAG: Retrieval (finding the right info) and Generation (using that info correctly).

1. The Evaluation Dataset

You can’t evaluate without a "Golden Dataset." Create a spreadsheet with 20–50 rows containing:

 * Question: What the user asks.

 * Context/Source: The specific document snippet that contains the answer.

 * Ground Truth: The ideal, "perfect" answer.

2. Evaluating Retrieval (The "Search" Part)

This measures if your vector database is actually finding the right documents. You don't need an LLM for this; you just need basic math.

 * Hit Rate (Precision at K): Did the correct document appear in the top k results?

   * Calculation: (Number of queries where the right doc was found) / (Total queries).

 * Mean Reciprocal Rank (MRR): Measures where the right document appeared. It rewards the system more for having the correct answer at rank 1 than rank 5.

   * Formula: MRR = \frac{1}{Q} \sum_{i=1}^{Q} \frac{1}{rank_i}

3. Evaluating Generation (The "LLM-as-a-Judge" Method)

Since manual grading is slow, you can use a "Judge LLM" (like GPT-4o or Claude 3.5) to grade your RAG output. You feed the judge a specific prompt for each of these three metrics:

A. Faithfulness (Groundedness)

Does the answer stay true to the retrieved context, or did the LLM hallucinate?

 * The Prompt: "Given the following context and the generated answer, list every claim in the answer. For each claim, state if it is supported by the context. Score 1.0 if all claims are supported, 0.0 otherwise."

B. Answer Relevance

Does the answer actually address the user's question?

 * The Prompt: "On a scale of 1-5, how relevant is this response to the original user question? Ignore whether the facts are true for now; focus only on whether it addresses the user's intent."

C. Context Precision

Did the retrieval step provide "clean" information, or was it full of noise?

 * The Prompt: "Check the retrieved context. Is this information actually necessary to answer the user's question? Rate 1 for useful, 0 for irrelevant."

4. Simple Python Implementation Structure

You don't need a library; a simple loop will do:

results = []

for item in golden_dataset:

    # 1. Run your RAG pipeline

    retrieved_docs = retriever.get_relevant_documents(item['question'])

    response = rag_chain.invoke(item['question'])

    

    # 2. Manual/LLM Scoring

    score = call_judge_llm(

        system_prompt="You are a grader...",

        user_content=f"Question: {item['question']}\nContext: {retrieved_docs}\nAnswer: {response}"

    )

    

    results.append({"question": item['question'], "score": score})


# 3. Calculate the mean score

final_grade = sum(r['score'] for r in results) / len(results)


Summary Table: What to Track

| Metric | What it tests | Success Criteria |

|---|---|---|

| Context Recall | Retrieval | Is the ground truth present in the chunks? |

| Faithfulness | Generation | Did the LLM make things up? |

| Answer Similarity | Generation | How close is the answer to the Ground Truth? (Use Semantic Similarity) |

Would you like me to write a specific "Judge Prompt" you can use to grade your RAG's faithfulness?


Arize phonix embedding visualization and observability

 Arize Phoenix is different from Ragas or DeepEval because it is an observability tool. Instead of just giving you a score, it launches a local web dashboard that lets you visually inspect your CLI embeddings and trace exactly how your RAG pipeline is performing in real-time.

For your CLI project, Phoenix is incredibly helpful for seeing "clusters" of commands and finding out why a specific query retrieved the wrong CLI command.

1. Prerequisites

pip install arize-phoenix llama-index-callbacks-arize-phoenix


2. Implementation Code

This script connects LlamaIndex to Phoenix. Once you run this, a browser window will open showing your RAG "traces."

import phoenix as px

import llama_index.core

from llama_index.core import VectorStoreIndex, Document, Settings

from llama_index.core.callbacks import CallbackManager

from llama_index.callbacks.arize_phoenix import ArizePhoenixCallbackHandler


# 1. Start the Phoenix Search & Trace server (launches a local web UI)

session = px.launch_app()


# 2. Setup LlamaIndex to send data to Phoenix

remote_callback_handler = ArizePhoenixCallbackHandler()

callback_manager = CallbackManager([remote_callback_handler])

Settings.callback_manager = callback_manager


# 3. Your CLI JSON Data

cli_data = [

    {"command": "git checkout -b", "description": "Create and switch to a new branch", "examples": ["git checkout -b feature-login"]},

    {"command": "git branch -d", "description": "Delete a local branch", "examples": ["git branch -d old-feature"]}

]


# 4. Standard LlamaIndex Ingestion

documents = [Document(text=f"{item['command']}: {item['description']}") for item in cli_data]

index = VectorStoreIndex.from_documents(documents)

query_engine = index.as_query_engine()


# 5. Run a query

# After running this, check the Phoenix Dashboard!

response = query_engine.query("How do I make a new branch?")


print(f"Answer: {response}")

print(f"Phoenix Dashboard URL: {session.url}")


# Keep the script running so you can explore the UI

import time

time.sleep(1000)


What to look for in the Phoenix UI:

 * Traces: You will see a "timeline" of your query. You can click on it to see exactly what text was sent to the embedding model and what chunks were pulled from your JSON.

 * The Embedding Map: Phoenix can visualize your CLI commands as dots in a 3D space.

   * Example: You might see a cluster of "Docker" commands and a cluster of "Git" commands.

   * Insight: If "how do I delete a branch" pulls up a "Docker delete" command, you will see the query dot land in the wrong cluster, telling you that your embeddings need more technical context.

 * LLM Evaluation: Phoenix can run "Evals" in the background. It will flag queries that it thinks were "Unfaithful" or had "Poor Retrieval" based on its internal heuristics.

Comparison: When to use which?

| Use Case | Recommended Tool |

|---|---|

| "I want to know if my RAG is accurate." | Ragas |

| "I want to prevent breaking changes in my code." | DeepEval |

| "I want to see WHY my RAG is failing visually." | Arize Phoenix |

Would you like to know how to use Phoenix to find "Useless Commands" in your JSON (commands that never get retrieved or overlap too much with others)?


Using deepeval

 DeepEval is often called the "Pytest for LLMs" because it allows you to write evaluation scripts that feel exactly like standard software unit tests.

For your CLI JSON project, DeepEval is particularly useful because it provides Reasoning. If a command fails the test, it will tell you exactly why (e.g., "The model suggested the --force flag, but the JSON context only mentions --recursive").

1. Prerequisites

pip install deepeval


2. The DeepEval Test File (test_cli_rag.py)

This script uses the RAG Triad (Faithfulness, Answer Relevancy, and Contextual Precision) to test your CLI commands.

import pytest

from deepeval import assert_test

from deepeval.test_case import LLMTestCase

from deepeval.metrics import FaithfulnessMetric, AnswerRelevancyMetric, ContextualPrecisionMetric


# 1. Setup the metrics with passing thresholds

# Threshold 0.7 means the score must be > 0.7 to "Pass" the unit test

faithfulness_metric = FaithfulnessMetric(threshold=0.7)

relevancy_metric = AnswerRelevancyMetric(threshold=0.7)

precision_metric = ContextualPrecisionMetric(threshold=0.7)


def test_docker_ps_command():

    # --- SIMULATED RAG OUTPUT ---

    # In a real test, you would call your query_engine.query() here

    input_query = "How do I see all my containers, even stopped ones?"

    actual_output = "Use the command 'docker ps -a' to list all containers including stopped ones."

    retrieval_context = [

        "Command: docker ps. Description: List running containers. Examples: docker ps -a"

    ]

    

    # 2. Create the Test Case

    test_case = LLMTestCase(

        input=input_query,

        actual_output=actual_output,

        retrieval_context=retrieval_context

    )

    

    # 3. Assert the test with multiple metrics

    assert_test(test_case, [faithfulness_metric, relevancy_metric, precision_metric])


def test_non_existent_command():

    input_query = "How do I hack into NASA?"

    actual_output = "I'm sorry, I don't have information on that."

    retrieval_context = [] # Nothing found in your CLI JSON

    

    test_case = LLMTestCase(

        input=input_query,

        actual_output=actual_output,

        retrieval_context=retrieval_context

    )

    

    assert_test(test_case, [relevancy_metric])


3. Running the Test

You run this from your terminal just like a normal python test:

deepeval test run test_cli_rag.py


4. Why DeepEval is better than Ragas for CLI:

 * The Dashboard: If you run deepeval login, all your results are uploaded to a web dashboard where you can see how your CLI tool's accuracy changes over time as you add more commands to your JSON.

 * Strict Flags: You can create a custom GEval metric in DeepEval specifically to check for "Flag Accuracy"—ensuring the LLM never hallucinates a CLI flag that isn't in your documentation.

 * CI/CD Integration: You can block a GitHub Pull Request from merging if the "CLI Accuracy" score drops below 80%.

Comparison: Ragas vs. DeepEval

| Feature | Ragas | DeepEval |

|---|---|---|

| Primary Use | Research / Bulk Data Eval | Engineering / Unit Testing |

| Output | Raw Scores (0.0 - 1.0) | Pass/Fail + Detailed Reasoning |

| Integration | Pandas / Notebooks | Pytest / GitHub Actions |

| UI | None (requires 3rd party) | Built-in Cloud Dashboard |

Would you like me to show you how to create a "Custom Flag Metric" to ensure the LLM never invents fake CLI arguments?