The example dataset that the fine-tuning was done was the below
train_data = [
("What factors contribute to a city's livability?", "How is the quality of life in a city determined?", 1),
("Vienna is often ranked as highly livable.", "Many surveys place Vienna among the top cities for quality of life.", 1),
("High healthcare standards improve a city's livability.", "Access to good medical care is a key aspect of a comfortable urban environment.", 1),
("A city with poor infrastructure is less livable.", "Substandard public transport negatively impacts urban living.", 1),
("The weather in a city affects its livability.", "Climate plays a role in how pleasant it is to live in a location.", 1),
("Economic growth leads to higher livability.", "A strong economy generally correlates with better living conditions.", 1),
("Cultural attractions enhance a city's appeal.", "Having museums and theaters makes a city more enjoyable.", 1),
("High crime rates decrease livability.", "Safety and security are crucial for a good quality of life.", 1),
("The capital of France is Paris.", "The Eiffel Tower is in Paris.", 0), # Dissimilar topic
("Apples are a type of fruit.", "Cars are used for transportation.", 0), # Dissimilar topics
("The ocean is vast and blue.", "The mountains are tall and majestic.", 0), # Dissimilar topics
("Good education improves job prospects.", "Affordable housing is important for residents.", 1), # Related aspects of livability
("A polluted environment reduces livability.", "Clean air and water contribute to a healthy city.", 1),
("Job opportunities attract people to a city.", "Employment prospects are a factor in urban migration.", 1),
("The price of housing impacts affordability.", "Expensive real estate can make a city less accessible.", 1),
]
while conceptually relevant, might not be diverse or large enough to show a dramatic difference in similarity scores after just a few epochs of training with a general-purpose pre-trained model like bert-base-uncased. These models already have a broad understanding of language.
To demonstrate a significant difference, you need a dataset that:
Is Specifically Focused: Targets a particular type of semantic similarity or relationship that the base model might not perfectly capture.
Has Clear Positive and Negative Examples: Provides unambiguous pairs of similar and dissimilar sentences.
Is Reasonably Sized: Contains enough data for the model to learn the specific nuances of the task.
Here are some well-established datasets commonly used for training and evaluating sentence embedding models, particularly for Semantic Textual Similarity (STS), which would likely show a more noticeable difference after fine-tuning:
1. STS Benchmark (STS-B):
Description: A widely used dataset comprising sentence pairs drawn from news headlines, video and image captions, and natural language inference data. Each pair is annotated with a similarity score from 0 to 5 (often normalized to 0 to 1).
Why it's good: Specifically designed for evaluating semantic similarity. The annotations are human-generated and high quality.
How to use: You can easily load this dataset using the datasets library from Hugging Face:
from datasets import load_dataset
sts_dataset = load_dataset("sentence-transformers/stsb", split="train")
train_data_sts = []
for example in sts_dataset:
train_data_sts.append((example['sentence1'], example['sentence2'], example['score'] / 5.0)) # Normalize score
Semantic Textual Similarity (STS) datasets from SemEval:
Description: A collection of yearly datasets from the SemEval (Semantic Evaluation Exercises) competition, focusing on STS. These cover various text domains.
Why it's good: Well-established and diverse, allowing you to test generalization across different types of text.
How to use: These can also be accessed through the datasets library, often as separate datasets (e.g., glue, config 'stsb' includes STS-B which is a subset). You might need to explore the datasets library to find specific SemEval STS datasets if needed.
Quora Question Pairs:
Description: A dataset of question pairs from the Quora platform, labeled as duplicates or non-duplicates.
Why it's good: Focuses on semantic similarity in the context of questions, which can be useful for tasks like question answering or FAQ matching.
How to use: Available through the datasets library:
qqp_dataset = load_dataset("quora", split="train")
train_data_qqp = []
for example in qqp_dataset:
if example['is_duplicate'] is not None:
train_data_qqp.append((example['question1'], example['question2'], float(example['is_duplicate'])))
No comments:
Post a Comment