Friday, December 8, 2023

Semantic Search & LanceDB - Sentiment anslysis

Aim is to Generate Sentiment labels and scores using BERT models based on customer reviews. Now this info is stored in the LanceDB with their meta data and Query the LanceDB to retrieve the customer feedback on selected areas for e.g. cleanliness etc 


First step done is to load the dataset

from datasets import load_dataset


# load the dataset and convert to pandas dataframe

df = load_dataset(

    "ashraq/hotel-reviews",

    split="train"

).to_pandas()


A simple pre-processing is done to retrieve only the first 1000 characters of the review


# keep only the first 1000 characters from the reviews

df["review"] = df["review"].str[:1000]

# glimpse the dataset

df.head()


For sentiment analysis, either RoBERTa or DistilBERT model fine-tuned for sentiment analysis is used. 


import torch


# set device to GPU if available

device = 'cuda' if torch.cuda.is_available() else 'cpu'

# @title Select Sentiment Analysis Model and run this cell


model_id = "cardiffnlp/twitter-roberta-base-sentiment" # @param {type:"string"}

select_model = 'cardiffnlp/twitter-roberta-base-sentiment' # @param ["cardiffnlp/twitter-roberta-base-sentiment", "lxyuan/distilbert-base-multilingual-cased-sentiments-student"]

model_id = select_model

print("Selected Model: ", model_id)


Below code will prepare a sentiment analysis pipeline 


from transformers import (

    pipeline,

    AutoTokenizer,

    AutoModelForSequenceClassification

    )


# load the model from huggingface

model = AutoModelForSequenceClassification.from_pretrained(

    model_id,

    num_labels=3

)


# load the tokenizer from huggingface

tokenizer = AutoTokenizer.from_pretrained(model_id)


# load the tokenizer and model into a sentiment analysis pipeline

nlp = pipeline(

    "sentiment-analysis",

    model=model,

    tokenizer=tokenizer,

    device=device

    )



The sentiment analysis model returns LABEL_0 for negative, LABEL_1 for neutral and LABEL_2 for positive labels. We can add them to a dictionary to easily access them when showing the results.



Now, we have created an analysis Pipeline, The Next step is to Initialize the Retriever


A Retriever model is used to embed passages and queries, and it creates embeddings such that queries and passages with similar meanings are close in the vector space.


from sentence_transformers import SentenceTransformer


# load the model from huggingface

retriever = SentenceTransformer(

    'sentence-transformers/all-MiniLM-L6-v2',

    device=device

)


Below will generate Embeddings and insert them inside LanceDB for querying


import lancedb

db = lancedb.connect("./.lancedb")



def get_sentiment(reviews):

    # pass the reviews through sentiment analysis pipeline

    sentiments = nlp(reviews)

    # extract only the label and score from the result

    l = [labels[x["label"]] for x in sentiments]

    s = [x["score"] for x in sentiments]

    return l, s


references:

https://blog.lancedb.com/sentiment-analysis-using-lancedb-2da3cb1e3fa6


No comments:

Post a Comment