Semantic Analysis Using LLMs - Classification with BERT

30 Sept

Introduction

In this articles we will explore how to use Large Language Models (LLMs) for Semantic Analysis.

In recent years, Large Language Models (LLMs) have revolutionised the field of natural language processing (NLP), venturing into the new frontiers of what machines can achieve in understanding and generating human language. These cutting edge models, powered by deep learning algorithms and hundreds of gigabytes of data, have shown remarkable capabilities in tasks such as text generation, translation, summarisation, and semantic analysis. The rise of LLMs was partially fuelled by the increasing textual data generation by a whole host of different sources. From social media posts to internet articles, there is a humongous repository of textual data to analyse.

Semantic Analysis

Semantic analysis is a process for understanding the meaning behind text rather than just it's syntactic features (word frequencies, sentence structure, etc.). For examples given the sentence "Bob the builder", semantic analysis would understand that "Bob", a proper noun referring to some person, has the occupation of "builder". Another important aspect of semantic analysis is the context in which words are used. For example, the word "glass" can refer to the material glass used to make windows or the an object glass like in "a glass of water"; this is known as polysemy. Rather than just analysing the text, this process identifies the relationship between the components of the text and how they work together to form the overall picture.

Applications

Semantic analysis is a very helpful technique with a wide variety of applications such as

Sentiment analysis - This field explores identifying and categorising the opinions expressed in some text. Getting a contextual understanding of text can help disambiguate the sentiment.
Search Engine Optimisation - Understanding the query allows search engines to better position their results.
Chatbots - Understanding the meaning of the queries help chatbots have more coherent converations. We will explore semantic analysis in the context of sentiment analysis further in this article.

Baseline Model

The baseline model is a simple but common NLP model used to benchmark results. The model used in the exercise is a Support Vector Machine (SVM). A support vector machine is a supervised ML technique (requires labelled training data). It classifies data by finding the optimal hyperplane (a flat surface like a line in 2D), maximising the distance between the points of each class and the hyperplane. It is a widely used ML technique in NLP. SVMs require numeric data so the TF-IDF (Term Frequency - Inverse Document Frequency) technique was used to vectorise the textual data. TF is the number of occurences of the word in a document. The more common the word is, the more important it might be. The IDF measures word rarity across multiple documents (corpus) so the less common the word is, the greater the IDF. This is to prevent words such as 'the' from being seen as the most important in the text. The TF-IDF for a word is the TF * IDF.

Large Language Models

Most people have probably used an LLM in the past couple of years especially since the advent of OpenAIs ChatGPT-3 in 2022. In just over a month the service had amassed over a 100million users becoming the fastest growing consumer software. LLMs have provided an alternate method for querying the knowledge stored on the internet through a more engaging medium than a search engine. However there is more to LLMs than just using them as chatbots. The attention mechanism in LLMs plays a large part in making them great semantic analysis engines.

Attention is a method that allows the model to weight segments of the input data individually, which allows the model to better determine the words contributing to the meaning of the input. The specific attention mechanism used in transformer models is called self attention. It computes a weighted sum of the words in the input. The weights represent the relevance of words to the current words being processed. This is taken one step further by using multi-headed attention. This technique performs different linear transformations on the same input to identify distinct dependencies in the text, such as short and long range.

The model used in this exercise is Facebook's Bart Large LLM. This model has 409 million parameters and is quite a small LLM, which can be easily run locally on most laptops. The image below visualises the attention weights for the LLM. The bart model has 16 attention heads and this is just one of them. The bolder the line between words, the stronger the identified connection between them. The input to the model was "Bob the builder drinks a glass of water". The mechanism correctly identifies that 'builder', 'drinks' and 'water' are related to the 'Bob'.

As seen in the image below, the mechanism also identifies that glass strongly relates to the word water.

These two examples illustrate how an LLM performs semantic analysis to understand the meaning of the sentence rather than just summarise statistics. The images were created using the bertviz library.

Experiment

As mentioned earlier, semantic analysis is investigated in the context of sentiment analysis. We explore an Amazon Reviews dataset (https://huggingface.co/datasets/yassiracharki/Amazon_Reviews_Binary_for_Sentiment_Analysis) with the goal of identifying if a review is positive or negative. We load this dataset using the load_dataset module from the Huggingface datasets library.

ds = load_dataset("yassiracharki/Amazon_Reviews_Binary_for_Sentiment_Analysis")

The dataset is composed of 3 fields, the review title, review text and review sentiment. Only the first 1000 review were used and there are roughly equal numbers of positive : negative reviews in these (46% : 54%). This is then split into Train and Test set in the ratio 7:3. It was decided that the classification would happen on the review text since these were less ambiguous than the titles.

Baseline Model

First we create the TF_IDF vectors using sci-kit learn TfidfVectorizer module for the SVM. We are going to vectorise using in bigrams (two words instead of 1) to introduce more contextual awareness in the model. We are also filtering stop words such as 'a' and 'the' since these are insignificant.

tfidf = TfidfVectorizer(ngram_range=(1, 2), stop_words='english')  # Using bigrams for context
X_raw = df_train.head(n=1000)['review_text']
X = tfidf.fit_transform(X_raw)
y = df_train.head(n=1000)['class_index']

Next we predict the review sentiment using the SVM. We are using the sci-kit learn SVM classifier with a linear kernel.

from sklearn.svm import SVC
# Train an SVM classifier
svm = SVC(kernel='linear', random_state=234)
svm.fit(X_train_t, y_train_t)

# Evaluate the model
y_pred = svm.predict(X_test_t)

This model achieves an accuracy of 78% on the test set.

LLM

Now let's predict using the LLM to see if we can improve upon this accuracy. The image below shows the distribution of characters in the train set. The LLM we are using has a context length of 512 tokens so some of these inputs may need to be truncated.

We are using the LLM in a Huggingface transformers pipeline as this streamlines the sentiment analysis task. The max_length argument refers to the maximum number of tokens passed to the LLM at a time. Tokens can be words or parts of words. The candidate_labels argument indicates the expected output for the LLM; the LLM will predict either positive or negative sentiments in this case.

from transformers import pipeline
classifier = pipeline(model="facebook/bart-large-mnli")
llm_results = classifier(X_test.to_list(), truncation=True, max_length=512, candidate_labels=['negative', 'positive'])

We'll need to convert the LLM outputs to a numerical representation for comparison with the test set labels.

llm_results_bin = [(x['labels'][0]=='positive')+1 for x in llm_results]

The LLM achieves an accuracy of 90% on the test set. This is a 12% improvement despite using an off-the-shelf method.

Analysis

An example of a review the SVM predicted incorrectly is:

"Do not waste your time! I love hokie, I love cheese when it comes to horror. This just plain old sucked."

This review is fairly complex since it uses multiple sentences. It starts with a negative sentence using the word "not" and then a positive sentence using the word "love" twice. The final sentence is negative though it may be difficult for the SVM to interpret the word "sucked". The key to correctly classifying this review is understanding that only the first and second sentences are referring to the reviewed item while the second sentence is context for the review. On the other hand, the attention mechanism of them LLM broke down the review into the different sentences as seen in the image below. It shows how the word I is found to be most related to the all the other words in the same sentence.

The extensive corpus the LLM was trained allowed it to understand the meaning of each sentence and then consider the overall sentiment.

Conclusion

In this article we've explored semantic analysis and how it can be used in the context of sentiment analysis. Using the Sci-Kit learn package, an SVM achieved 78% accuracy whereas the off-the-shelf LLM achieved 90% through zero shot prompting. The LLM was able to correctly classify more reviews than the SVM because of its superior ability to comprehend the overall meaning of text.

Please find the related notebook here: https://github.com/yy1920/SemanticAnalysisBlog/tree/main

Data Science

Yash Yeola

Semantic Analysis Using LLMs - Classification with BERT

Introduction

Semantic Analysis

Applications

Baseline Model

Large Language Models

Experiment

Baseline Model

LLM

Analysis

Conclusion

Why Do We All Talk About Data Like It’s Water?

SAS Made Simple: ANYCNTRL & NOTCNTRL Regex Alternatives for SAS