Thursday, October 23, 2025

One Proportion vs Two Proportion Tests



## **One Proportion Test**


**Tests:** One sample proportion against a known/hypothesized population proportion


### **When to Use:**

- Comparing **one group** to a known standard or benchmark

- Testing if a **single proportion** differs from an expected value


### **Formula:**

```python

z = (p̂ - p₀) / √[p₀(1-p₀)/n]

```

Where:

- p̂ = sample proportion

- p₀ = hypothesized population proportion

- n = sample size


## **Two Proportion Test**


**Tests:** Difference between proportions from two independent groups


### **When to Use:**

- Comparing **two different groups** to each other

- Testing if proportions differ between two populations


### **Formula:**

```python

z = (p̂₁ - p̂₂) / √[p̂_pool(1-p̂_pool)(1/n₁ + 1/n₂)]

```

Where:

- p̂_pool = (x₁ + x₂)/(n₁ + n₂)


---


## **Decision Guide:**


```python

def choose_test():

    """Simple decision guide"""

    print("ASK YOURSELF: How many groups am I comparing?")

    print()

    print("šŸ” ONE PROPORTION TEST:")

    print("   Q: Is my SINGLE group different from a known standard?")

    print("   → Use when: Comparing to historical data/benchmark")

    print()

    print("šŸ” TWO PROPORTION TEST:") 

    print("   Q: Are these TWO GROUPS different from each other?")

    print("   → Use when: Comparing Group A vs Group B")

    

choose_test()

```


---


## **Real-World Examples:**


### **Example 1: One Proportion Test**

```python

# Scenario: Company Quality Claim

# "We deliver 95% of packages on time"

# Sample: 180 out of 200 packages delivered on time


# Question: "Does our actual performance match the 95% claim?"

# → ONE PROPORTION TEST (one group vs known standard)


from statsmodels.stats.proportion import proportions_ztest


# One proportion test

z_stat, p_value = proportions_ztest(count=180, nobs=200, value=0.95, alternative='two-sided')

print(f"One Proportion Test: z={z_stat:.3f}, p={p_value:.4f}")

```


### **Example 2: Two Proportion Test**

```python

# Scenario: Drug Effectiveness

# Drug A: 45 successes out of 50 patients

# Drug B: 35 successes out of 50 patients


# Question: "Is Drug A more effective than Drug B?"

# → TWO PROPORTION TEST (comparing two groups)


z_stat, p_value = proportions_ztest(count=[45, 35], nobs=[50, 50], value=0, alternative='larger')

print(f"Two Proportion Test: z={z_stat:.3f}, p={p_value:.4f}")

```


---


## **Detailed Comparison Table:**


| Aspect | One Proportion Test | Two Proportion Test |

|--------|---------------------|---------------------|

| **Groups Compared** | One sample vs known value | Two independent samples |

| **Research Question** | "Does our rate equal X%?" | "Are these two rates different?" |

| **Null Hypothesis** | H₀: p = p₀ | H₀: p₁ = p₂ |

| **Data Required** | p̂, n, p₀ | p̂₁, n₁, p̂₂, n₂ |

| **Common Use Cases** | Quality control, claim verification | A/B testing, treatment comparisons |


---


## **Medical Examples:**


### **One Proportion (Medical):**

```python

# Hospital Infection Rates

# National standard: Infection rate should be ≤ 2%

# Our hospital: 8 infections in 300 patients (2.67%)


# Question: "Does our hospital meet the national standard?"

# → ONE PROPORTION TEST


print("ONE PROPORTION TEST - Hospital Quality")

print("H₀: Our infection rate ≤ 2% (meets standard)")

print("H₁: Our infection rate > 2% (exceeds standard)")


z_stat, p_value = proportions_ztest(count=8, nobs=300, value=0.02, alternative='larger')

```


### **Two Proportion (Medical):**

```python

# Smoking by Gender

# Males: 40 smokers out of 150

# Females: 20 smokers out of 100


# Question: "Do smoking rates differ by gender?"

# → TWO PROPORTION TEST


print("TWO PROPORTION TEST - Smoking by Gender")

print("H₀: p_male = p_female (no difference)")

print("H₁: p_male ≠ p_female (rates differ)")


z_stat, p_value = proportions_ztest(count=[40, 20], nobs=[150, 100], value=0, alternative='two-sided')

```


---


## **Business Examples:**


### **One Proportion (Business):**

```python

# E-commerce Conversion Rate

# Industry benchmark: 3% conversion rate

# Our site: 45 conversions from 1200 visitors (3.75%)


# Question: "Is our conversion rate better than industry average?"

# → ONE PROPORTION TEST


z_stat, p_value = proportions_ztest(count=45, nobs=1200, value=0.03, alternative='larger')

```


### **Two Proportion (Business):**

```python

# Marketing Campaign A/B Test

# Version A: 120 clicks from 2000 impressions (6%)

# Version B: 90 clicks from 2000 impressions (4.5%)


# Question: "Which ad version performs better?"

# → TWO PROPORTION TEST


z_stat, p_value = proportions_ztest(count=[120, 90], nobs=[2000, 2000], value=0, alternative='larger')

```


---


## **Key Questions to Determine Which Test:**


### **Ask These Questions:**


#### **For One Proportion Test:**

1. "Am I comparing **one group** to a **known standard**?"

2. "Do I have a **historical benchmark** to compare against?"

3. "Is there a **target value** I'm trying to achieve?"

4. "Am I testing a **claim** about a single population?"


#### **For Two Proportion Test:**

1. "Am I comparing **two different groups**?"

2. "Do I want to know if **Group A differs from Group B**?"

3. "Am I running an **A/B test** or **treatment comparison**?"

4. "Are these **independent samples** from different populations?"


---


## **Complete Decision Framework:**


```python

def proportion_test_selector():

    """Interactive test selector"""

    

    print("PROPORTION TEST SELECTOR")

    print("=" * 40)

    

    questions = [

        "How many groups are you analyzing? (1/2)",

        "Do you have a known benchmark to compare against? (yes/no)", 

        "Are you comparing two different treatments/conditions? (yes/no)",

        "Is this quality control against a standard? (yes/no)",

        "Are you testing if two groups differ from each other? (yes/no)"

    ]

    

    print("\nAnswer these questions:")

    for i, question in enumerate(questions, 1):

        print(f"{i}. {question}")

    

    print("\nšŸŽÆ QUICK DECISION GUIDE:")

    print("• Known standard + One group → ONE PROPORTION TEST")

    print("• Two groups comparison → TWO PROPORTION TEST")

    print("• Quality control → ONE PROPORTION TEST") 

    print("• A/B testing → TWO PROPORTION TEST")


proportion_test_selector()

```


---


## **When to Use Each - Summary:**


### **✅ Use ONE PROPORTION TEST when:**

- Testing against **industry standards**

- **Quality control** checks

- Verifying **company claims**

- Comparing to **historical data**

- **Regulatory compliance** testing


### **✅ Use TWO PROPORTION TEST when:**

- **A/B testing** (website versions, ads, etc.)

- **Treatment comparisons** (drug A vs drug B)

- **Demographic comparisons** (male vs female, young vs old)

- **Geographic comparisons** (Region A vs Region B)

- **Time period comparisons** (before vs after campaign)


---


## **Statistical Note:**


```python

# Both tests rely on these assumptions:

assumptions = {

    'random_sampling': 'Data collected through random sampling',

    'independence': 'Observations are independent', 

    'sample_size': 'np ≥ 10 and n(1-p) ≥ 10 for each group',

    'normal_approximation': 'Sample size large enough for normal approximation'

}

```


## **Bottom Line:**


**Choose One Proportion Test when comparing to a known standard. Choose Two Proportion Test when comparing two groups to each other.**


The key distinction is whether you have an **external benchmark** (one proportion) or are making an **internal comparison** (two proportions)!

What is Open Semantic Interchange (OSI) initiative?

 The Open Semantic Interchange (OSI) initiative is a new, collaborative effort launched by companies like Snowflake, Salesforce, and dbt Labs to create a vendor-neutral, open standard for sharing semantic models across different AI and analytics tools. The goal is to solve the problem of fragmented data definitions and inconsistent business logic, which hinder data interoperability and make it difficult to trust AI-driven insights. By providing a common language for semantics, OSI aims to enhance interoperability, accelerate AI and BI adoption, and streamline operations for data teams. 

Key goals and features

Enhance interoperability: Create a shared semantic standard so that all AI, BI, and analytics tools can "speak the same language," allowing for greater flexibility in choosing best-of-breed technologies without sacrificing consistency. 

Accelerate AI and BI adoption: By ensuring semantic consistency across platforms, OSI builds trust in AI insights and makes it easier to scale AI and BI applications. 

Streamline operations: Eliminate the time data teams spend reconciling conflicting definitions or duplicating work by providing a common, open specification. 

Promote a model-first, metadata-driven architecture: OSI supports architectures where business meaning is defined in a central model, which can then be used consistently across various tools. 

Why it matters

Breaks down data silos: In today's complex data landscape, definitions are often scattered and inconsistent across different tools and platforms. OSI provides a universal way for these definitions to travel seamlessly between systems. 

Builds trust in AI: Fragmented semantics are a major roadblock to trusting AI-driven answers, as different tools may interpret the same business logic differently. A standard semantic layer ensures more accurate and trustworthy insights. 

Empowers organizations: A universal standard gives enterprises the freedom to adopt the best tools for their needs without worrying about semantic fragmentation, leading to greater agility and efficiency. 

What is Context Engineering?

“the art and science of filling the context window with just the right information at each step of an agent’s trajectory.” Lance Martin of LangChain

Lance Martin breaks down context engineering into four categories: write, compress, isolate, and select. Agents need to write (or persist or remember) information from task to task, just like humans. Agents will often have too much context as they go from task to task and need to compress or condense it somehow, usually through summarization or ‘pruning’. Rather than giving all of the context to the model, we can isolate it or split it across agents so they can, as Anthropic describes it, “explore different parts of the problem simultaneously”. Rather than risk context rot and degraded results, the idea here is to not give the LLM enough rope to hang itself. 


Context engineering needs a semantic layer

What is a Semantic Layer?

A semantic layer is a way of attaching metadata to all data in a form that is both human and machine readable, so that people and computers can consistently understand, retrieve, and reason over it.

There is a recent push from those in the relational data world to build a semantic layer over relational data. Snowflake even created an Open Semantic Interchange (OSI) initiative to attempt to standardize the way companies are documenting their data to make it ready for AI. 

VArious types of re-rankers

 a re-ranker is, after you bring the facts, how do you decide what to keep and what to throw away, [and that] has a big impact.” Popular re-rankers are 

Cohere Rerank, 

Voyage AI Rerank,

 Jina Reranker, and 

BGE Reranker. 

Re-ranking is not enough in today’s agentic world. The newest generation of RAG has become embedded into agents–something increasingly known as context engineering. 

Cohere Rerank, Voyage AI Rerank, Jina Reranker, and BGE Reranker are all models designed to improve the relevance of search results, particularly in Retrieval Augmented Generation (RAG) systems, by re-ordering a list of retrieved documents based on their semantic relevance to a given query. While their core function is similar, they differ in several key aspects:

1. Model Focus & Strengths:

Cohere Rerank: Known for its strong performance and general-purpose reranking capabilities across various data types (lexical, semantic, semi-structured, tabular). It also emphasizes multilingual support.

Voyage AI Rerank: Optimized for high-performance reranking, particularly in RAG and search applications. Recent versions (e.g., rerank-2.5) focus on instruction-following capabilities and improved context length.

Jina Reranker: Excels in multilingual support and offers high throughput, especially with its v2-base-multilingual model. It also supports agentic tasks and code retrieval.

BGE Reranker: Provides multilingual support and multi-functionality, including dense, sparse, and multi-vector (Colbert) retrieval. It can handle long input lengths (up to 8192 tokens). 

2. Performance & Accuracy:

Performance comparisons often show variations depending on the specific dataset and evaluation metrics. Voyage AI's rerank-2 and rerank-2-lite models, for instance, have shown improvements over Cohere v3 and BGE v2-m3 in certain benchmarks. Jina's multilingual model also highlights its strong performance in cross-lingual scenarios.

3. Features & Capabilities:

Multilingual Support: All models offer multilingual capabilities to varying degrees, with Jina and BGE specifically highlighting their strong multilingual performance.

Instruction Following: Voyage AI's rerank-2.5 and rerank-2.5-lite introduce instruction-following features, allowing users to guide the reranking process using natural language.

Context Length: BGE Reranker stands out with its ability to handle long input lengths (up to 8192 tokens). Voyage AI's newer models also offer increased context length.

Specific Use Cases: Jina emphasizes its suitability for agentic tasks and code retrieval, while Voyage AI focuses on RAG and general search.

4. Implementation & Accessibility:

Some rerankers are available as APIs, while others might offer open-source models for self-hosting. The ease of integration with existing systems (e.g., LangChain) can also be a differentiating factor.

5. Cost & Resources:

Model size and complexity directly impact computational cost and latency. Lighter models (e.g., Voyage AI rerank-2-lite) are designed for speed and efficiency, while larger models offer higher accuracy but demand more resources. Pricing models, such as token-based pricing, also vary between providers.

In summary, the choice of reranker depends on specific needs, including the required level of accuracy, multilingual support, context length, performance constraints, and integration preferences. Evaluating these factors against the strengths of each model is crucial for selecting the optimal solution.


What is Context Rot?

 Context rot is the degradation of an LLM's performance as the input or conversation history grows longer. It causes models to forget key information, become repetitive, or provide irrelevant or inaccurate answers, even on simple tasks, despite having a large context window. This happens because the model struggles to track relationships between all the "tokens" in a long input, leading to a decrease in performance. 

How context rot manifests

Hallucinations: The model may confidently state incorrect facts, even when the correct information is present in the prompt. 

Repetitive answers: The AI can get stuck in a loop, repeating earlier information or failing to incorporate new instructions. 

Losing focus: The model might fixate on minor details while missing the main point, resulting in generic or off-topic responses. 

Inaccurate recall: Simple tasks like recalling a name or counting can fail with long contexts. 

Why it's a problem

Diminishing returns: Even though models are built with large context windows, simply stuffing more information into them doesn't guarantee better performance and can actually hurt it. 

Impact on applications: This is a major concern for applications built on LLMs, as it can make them unreliable, especially in extended interactions like long coding sessions or conversations. 

How to mitigate context rot

Just-in-time retrieval: Instead of loading all data at once, use techniques that dynamically load only the most relevant information when it's needed. 

Targeted context: Be selective about what information is included in the prompt and remove unnecessary or stale data. 

Multi-agent systems: For complex tasks, consider breaking them down and using specialized sub-agents to avoid overwhelming a single context. 

What is DRIFT search

 However, we haven’t yet explored DRIFT search, which will be the focus of this blog post. DRIFT is a newer approach that combines characteristics of both global and local search methods. The technique begins by leveraging community information through vector search to establish a broad starting point for queries, then uses these community insights to refine the original question into detailed follow-up queries. This allows DRIFT to dynamically traverse the knowledge graph to retrieve specific information about entities, relationships, and other localized details, balancing computational efficiency with comprehensive answer quality


DRIFT search presents an interesting strategy for balancing the breadth of global search with the precision of local search. By starting with community-level context and progressively drilling down through iterative follow-up queries, it avoids the computational overhead of processing all community reports while still maintaining comprehensive coverage.

However, there’s room for several improvements. The current implementation treats all intermediate answers equally, but filtering based on their confidence scores could improve final answer quality and reduce noise. Similarly, follow-up queries could be ranked by relevance or potential information gain before execution, ensuring the most promising leads are pursued first.

Another promising enhancement would be introducing a query refinement step that uses an LLM to analyze all generated follow-up queries, grouping similar ones to avoid redundant searches and filtering out queries unlikely to yield useful information. This could significantly reduce the number of local searches while maintaining answer quality.


https://towardsdatascience.com/implementing-drift-search-with-neo4j-and-llamaindex/

https://towardsdatascience.com/implementing-drift-search-with-neo4j-and-llamaindex/

Sunday, October 19, 2025

Simple program for finding out the p-value for rejecting null hypothesis

 import numpy as np

import scipy.stats as stats


# energy expenditure (in mJ) and stature (0=obese, 1=lean)

energy = np.array([[9.21, 0],[7.53, 1],[7.48, 1],[8.08, 1],[8.09, 1],[10.15, 1],[8.40, 1],[0.88, 1],[1.13, 1],[2.90, 1],[11.51, 0],[2.79, 0],[7.05, 1],[1.85, 0],[19.97, 0],[7.48, 1],[8.79, 0],[9.69, 0],[2.68, 0],[3.58, 1],[9.19, 0],[4.11, 1]])


# Separating the data into 2 groups

group1 = energy[energy[:, 1] == 0] # elements of the array where obese == True

group1 = group1[:,0] # energy expenditure of obese

group2 = energy[energy[:, 1] == 1] # elements of the array where lean == True

group2 = group2[:,0] # energy expenditure of lean


# Perform t-test

t_statistic, p_value = stats.ttest_ind(group1, group2, equal_var=True)


print("T-TEST RESULTS: Obese (0) vs Lean (1) Energy Expenditure")

print("=" * 55)

print(f"Obese group (n={len(group1)}): Mean = {np.mean(group1):.2f} mJ, Std = {np.std(group1, ddof=1):.2f} mJ")

print(f"Lean group (n={len(group2)}): Mean = {np.mean(group2):.2f} mJ, Std = {np.std(group2, ddof=1):.2f} mJ")

print(f"\nT-statistic: {t_statistic:.4f}")

print(f"P-value: {p_value:.4f}")


# Interpretation

alpha = 0.05

print(f"\nINTERPRETATION (α = {alpha}):")

if p_value < alpha:

    print("✅ REJECT NULL HYPOTHESIS")

    print("   There is a statistically significant difference in energy expenditure")

    print("   between obese and lean individuals.")

else:

    print("❌ FAIL TO REJECT NULL HYPOTHESIS")

    print("   No statistically significant difference in energy expenditure")

    print("   between obese and lean individuals.")


# Show the actual data

print(f"\nOBESE GROUP ENERGY EXPENDITURE: {group1}")

print(f"LEAN GROUP ENERGY EXPENDITURE: {group2}")