Sunday, May 17, 2026

Friend prediction using pytorch geometry

 Implementing a "Friend Prediction" (also known as **Link Prediction**) system using GNNs, GraphSAGE, or GAT follows a highly structured pipeline. In this setup, your users are **nodes**, existing friendships are **edges**, and the goal is to predict the probability that an edge *should* exist between two currently unconnected nodes.

Here is a step-by-step guide on how to design and implement this application.

## 1. The Core Architecture (The Encoder-Decoder Framework)

Most GNN-based link prediction models use an **Encoder-Decoder** workflow:

 1. **The Encoder (GNN / GraphSAGE / GAT):** Takes the graph structure and node features (e.g., user age, location, interests) and outputs a low-dimensional vector (embedding) for every single user.

 2. **The Decoder:** Takes the embeddings of two users (User A and User B) and computes a similarity score (using Dot Product or a small Multi-Layer Perceptron). A high score means they are likely to become friends.

```

[Graph Data: Nodes & Edges] 

         │

         ▼

 ┌───────────────┐

 │    ENCODER    │ ──► Generates User Embeddings ($z_u, z_v$)

 │(SAGE/GAT/GCN) │

 └───────────────┘

         │

         ▼

 ┌───────────────┐

 │    DECODER    │ ──► Computes Link Score (e.g., $Score = z_u^T \cdot z_v$)

 │ (Dot Product) │

 └───────────────┘

         │

         ▼

 [Friend Prediction Probability]


```

## 2. Choosing the Right Layer for the Job

While the pipeline remains identical, changing the model type changes how the **Encoder** aggregates information:

 * **GraphSAGE (Best for Large Scale):** If your user base is massive or constantly growing, GraphSAGE is the practical choice. It will sample a subset of a user's current friends to update their embedding, preventing memory bottlenecks.

 * **GAT (Best for Feature-Driven Matches):** If you want the model to learn *why* people are friends (e.g., "User A and User B are friends because they share a niche hobby, ignoring the fact that they live in different cities"), GAT’s attention mechanism dynamically weights neighbor importance based on profile features.

## 3. Step-by-Step Implementation Workflow

If you are implementing this in Python, the gold standard libraries are **PyTorch Geometric (PyG)** or **DGL (Deep Graph Library)**.

### Step A: Graph Setup & Data Splitting

Unlike standard machine learning where you split rows of data, in link prediction, you must **split the edges**.

 * **Training Edges:** The friendships the GNN is allowed to "see" and message-pass through.

 * **Positive Validation/Test Edges:** Real friendships held out to evaluate if the model can predict them.

 * **Negative Validation/Test Edges:** Randomly sampled pairs of users who are *not* friends, used to teach the model what a "non-friendship" looks like.

### Step B: Defining the Model (PyTorch Geometric Style)

Here is a conceptual implementation using PyG. You can easily swap SAGEConv for GATConv or GCNConv.

```python

import torch

import torch.nn as nn

import torch.nn.functional as F

from torch_geometric.nn import SAGEConv


class FriendPredictor(nn.Module):

    def __init__(self, in_channels, hidden_channels, out_channels):

        super().__init__()

        # Encoder Layers (Using GraphSAGE as an example)

        self.conv1 = SAGEConv(in_channels, hidden_channels)

        self.conv2 = SAGEConv(hidden_channels, out_channels)


    def encode(self, x, edge_index):

        # Generates node embeddings

        x = self.conv1(x, edge_index)

        x = F.relu(x)

        x = self.conv2(x, edge_index)

        return x


    def decode(self, z, edge_label_index):

        # Decoder: Dot product between source and target node embeddings

        src = z[edge_label_index[0]]

        dst = z[edge_label_index[1]]

        return (src * dst).sum(dim=-1) # Returns a similarity score for each pair


```

### Step C: The Training Loop

To train the network, you need to pass both **positive edges** (real friends) and **negative edges** (random users) through the decoder, forcing the model to score positive edges close to 1 and negative edges close to 0.

```python

model = FriendPredictor(in_channels=num_features, hidden_channels=64, out_channels=32)

optimizer = torch.optim.Adam(model.parameters(), lr=0.01)

criterion = torch.nn.BCEWithLogitsLoss()


def train():

    model.train()

    optimizer.zero_grad()

    

    # 1. Encode: Pass the training graph structure to get node embeddings

    z = model.encode(train_data.x, train_data.edge_index)

    

    # 2. Decode Positive Edges

    pos_out = model.decode(z, train_data.pos_edge_label_index)

    

    # 3. Decode Negative Edges (sampled on the fly or pre-sampled)

    neg_out = model.decode(z, train_data.neg_edge_label_index)

    

    # 4. Combine predictions and calculate Binary Cross Entropy Loss

    predictions = torch.cat([pos_out, neg_out], dim=0)

    targets = torch.cat([torch.ones(pos_out.size(0)), torch.zeros(neg_out.size(0))], dim=0)

    

    loss = criterion(predictions, targets)

    loss.backward()

    optimizer.step()

    return loss.item()


```

## 4. Serving Recommendations in Production

Once trained, generating "People You May Know" recommendations for a specific user follows this deployment logic:

 1. Run the **Encoder** pass over your graph once (e.g., nightly or in mini-batches) to generate updated embeddings for all active users.

 2. To suggest friends for User A, extract their embedding (z_A).

 3. Calculate the dot product of z_A against the embeddings of candidates (e.g., friends-of-friends who aren't currently connected to User A).

 4. Sort the candidates by their score in descending order and serve the top 5 as friend recommendations.


No comments:

Post a Comment