Aim is to Generate Sentiment labels and scores using BERT models based on customer reviews. Now this info is stored in the LanceDB with their meta data and Query the LanceDB to retrieve the customer feedback on selected areas for e.g. cleanliness etc
First step done is to load the dataset
from datasets import load_dataset
# load the dataset and convert to pandas dataframe
df = load_dataset(
"ashraq/hotel-reviews",
split="train"
).to_pandas()
A simple pre-processing is done to retrieve only the first 1000 characters of the review
# keep only the first 1000 characters from the reviews
df["review"] = df["review"].str[:1000]
# glimpse the dataset
df.head()
For sentiment analysis, either RoBERTa or DistilBERT model fine-tuned for sentiment analysis is used.
import torch
# set device to GPU if available
device = 'cuda' if torch.cuda.is_available() else 'cpu'
# @title Select Sentiment Analysis Model and run this cell
model_id = "cardiffnlp/twitter-roberta-base-sentiment" # @param {type:"string"}
select_model = 'cardiffnlp/twitter-roberta-base-sentiment' # @param ["cardiffnlp/twitter-roberta-base-sentiment", "lxyuan/distilbert-base-multilingual-cased-sentiments-student"]
model_id = select_model
print("Selected Model: ", model_id)
Below code will prepare a sentiment analysis pipeline
from transformers import (
pipeline,
AutoTokenizer,
AutoModelForSequenceClassification
)
# load the model from huggingface
model = AutoModelForSequenceClassification.from_pretrained(
model_id,
num_labels=3
)
# load the tokenizer from huggingface
tokenizer = AutoTokenizer.from_pretrained(model_id)
# load the tokenizer and model into a sentiment analysis pipeline
nlp = pipeline(
"sentiment-analysis",
model=model,
tokenizer=tokenizer,
device=device
)
The sentiment analysis model returns LABEL_0 for negative, LABEL_1 for neutral and LABEL_2 for positive labels. We can add them to a dictionary to easily access them when showing the results.
Now, we have created an analysis Pipeline, The Next step is to Initialize the Retriever
A Retriever model is used to embed passages and queries, and it creates embeddings such that queries and passages with similar meanings are close in the vector space.
from sentence_transformers import SentenceTransformer
# load the model from huggingface
retriever = SentenceTransformer(
'sentence-transformers/all-MiniLM-L6-v2',
device=device
)
Below will generate Embeddings and insert them inside LanceDB for querying
import lancedb
db = lancedb.connect("./.lancedb")
def get_sentiment(reviews):
# pass the reviews through sentiment analysis pipeline
sentiments = nlp(reviews)
# extract only the label and score from the result
l = [labels[x["label"]] for x in sentiments]
s = [x["score"] for x in sentiments]
return l, s
references:
https://blog.lancedb.com/sentiment-analysis-using-lancedb-2da3cb1e3fa6
No comments:
Post a Comment