Sunday, April 6, 2025

Why fine-tuning does not give much of difference compare to the model before fine tuning?

The example dataset that the fine-tuning was done was the below 

train_data = [

    ("What factors contribute to a city's livability?", "How is the quality of life in a city determined?", 1),

    ("Vienna is often ranked as highly livable.", "Many surveys place Vienna among the top cities for quality of life.", 1),

    ("High healthcare standards improve a city's livability.", "Access to good medical care is a key aspect of a comfortable urban environment.", 1),

    ("A city with poor infrastructure is less livable.", "Substandard public transport negatively impacts urban living.", 1),

    ("The weather in a city affects its livability.", "Climate plays a role in how pleasant it is to live in a location.", 1),

    ("Economic growth leads to higher livability.", "A strong economy generally correlates with better living conditions.", 1),

    ("Cultural attractions enhance a city's appeal.", "Having museums and theaters makes a city more enjoyable.", 1),

    ("High crime rates decrease livability.", "Safety and security are crucial for a good quality of life.", 1),

    ("The capital of France is Paris.", "The Eiffel Tower is in Paris.", 0),  # Dissimilar topic

    ("Apples are a type of fruit.", "Cars are used for transportation.", 0), # Dissimilar topics

    ("The ocean is vast and blue.", "The mountains are tall and majestic.", 0), # Dissimilar topics

    ("Good education improves job prospects.", "Affordable housing is important for residents.", 1), # Related aspects of livability

    ("A polluted environment reduces livability.", "Clean air and water contribute to a healthy city.", 1),

    ("Job opportunities attract people to a city.", "Employment prospects are a factor in urban migration.", 1),

    ("The price of housing impacts affordability.", "Expensive real estate can make a city less accessible.", 1),

]

 while conceptually relevant, might not be diverse or large enough to show a dramatic difference in similarity scores after just a few epochs of training with a general-purpose pre-trained model like bert-base-uncased. These models already have a broad understanding of language.

To demonstrate a significant difference, you need a dataset that:

Is Specifically Focused: Targets a particular type of semantic similarity or relationship that the base model might not perfectly capture.

Has Clear Positive and Negative Examples: Provides unambiguous pairs of similar and dissimilar sentences.

Is Reasonably Sized: Contains enough data for the model to learn the specific nuances of the task.

Here are some well-established datasets commonly used for training and evaluating sentence embedding models, particularly for Semantic Textual Similarity (STS), which would likely show a more noticeable difference after fine-tuning:

1. STS Benchmark (STS-B):

Description: A widely used dataset comprising sentence pairs drawn from news headlines, video and image captions, and natural language inference data. Each pair is annotated with a similarity score from 0 to 5 (often normalized to 0 to 1).   

Why it's good: Specifically designed for evaluating semantic similarity. The annotations are human-generated and high quality.   

How to use: You can easily load this dataset using the datasets library from Hugging Face:

from datasets import load_dataset

sts_dataset = load_dataset("sentence-transformers/stsb", split="train")

train_data_sts = []

for example in sts_dataset:

    train_data_sts.append((example['sentence1'], example['sentence2'], example['score'] / 5.0)) # Normalize score

Semantic Textual Similarity (STS) datasets from SemEval:

Description: A collection of yearly datasets from the SemEval (Semantic Evaluation Exercises) competition, focusing on STS. These cover various text domains.

Why it's good: Well-established and diverse, allowing you to test generalization across different types of text.

How to use: These can also be accessed through the datasets library, often as separate datasets (e.g., glue, config 'stsb' includes STS-B which is a subset). You might need to explore the datasets library to find specific SemEval STS datasets if needed.

Quora Question Pairs:

Description: A dataset of question pairs from the Quora platform, labeled as duplicates or non-duplicates.

Why it's good: Focuses on semantic similarity in the context of questions, which can be useful for tasks like question answering or FAQ matching.   

How to use: Available through the datasets library:

qqp_dataset = load_dataset("quora", split="train")


train_data_qqp = []

for example in qqp_dataset:

    if example['is_duplicate'] is not None:

        train_data_qqp.append((example['question1'], example['question2'], float(example['is_duplicate'])))


No comments:

Post a Comment