Beginners Guide to Natural Language Processing (NLP) in Python — Sentiment Analysis

Maaz Khan
4 min readJul 8, 2021

Natural language processing has undergone quite the renaissance over the past decade. We have seen the likes of Google and Facebook create their own NLP models, which have encapsulated fervor amongst the data science community. Just recently, we’ve seen GPT3, from OpenAI, morph into something profound and incredulous. But what about some of the simpler, more intuitive NLP models?

In this high-level overview tutorial, we will look at how quickly and easily we can setup these NLP pipelines in Python and begin extracting value from our text data. We will be leveraging the PyTorch library and a BERT neural network transformer, from Hugging Face, which predicts a sentiment score between 1 and 5.

With the power of web-scraping, we can obtain our source data from truly anywhere our hearts desire. Today, we will be scrapping our data from Rate My Professor. To get started, these are the dependencies we will need to install.

Python version 3.8.8 was used in this tutorial.

Step 1: Install libraries

1.) PyTorch

Depending on your system/preferences, PyTorch will output the command needed to install the library

2.) Transformers

conda install transformers

3.) Requests

conda install requests

4.) BeautifulSoup

conda install beautifulsoup4

Using the Anaconda package installer will prove to be a seamless experience juxtaposed to using “pip install”, which can cause dependency issues. We will also be utilizing the Pandas and Numpy libraries, which should already be installed.

Step 2: Instantiate model

Let’s first setup our NLP model and instantiate it. This is where any model from websites like Hugging Face can be used. In our case, we are going to be using the BERT sentiment analysis transformer. You can read details about the model here. When doing this process for the first time, the model could take ~10 minutes to fully install.

Now that our model has been created, we can begin testing the functionalities by inputing any example we would like. The following code is allowing our model to intake text data and return a sentiment score from 1–5 (1 being the lowest sentiment and 5 being the highest sentiment). Let’s see what happens when our text data is “this was the best class ever. i wish i could take this class again”.

Here are the results of the above code snippet.

At first glance, it may be daunting to understand what is going on with the model’s output. However, with some context, it’s not difficult to decipher. The model is informing us that this text we inputed has a very high sentiment, hence the score of 5. Each index in the tensor represents the sentiment of the overall text. That’s why index 4 is the highest at 3.6471 and therefore is given a score of 5 (we add +1 because every python object begins at index 0). The lowest score is -2.4941 which would indicate a sentiment score of 2.

Step 3: Create custom function

Now that we intuitively understand how our model works, let’s create a custom function so we can easily extract a sentiment score from our text data without having to run that code every time.

Step 4: Scraping data

I’ll be choosing a random professor to scrape reviews from. Do keep in mind that rate my professor actually has a complicated HTML structure in terms of classes, so some exception statements need to be built in if creating a custom function. In this example, I manually deleted duplicate reviews and also added in some reviews that weren’t scrapped properly.

Step 5: Putting it all together

In our final step, let’s use our function we created earlier to iterate through all of our rows in the review column. Given that we already had information about the difficulty and quality of the class, it is not hard to imagine that the review would follow a similar structure; however, our sentiment score is still useful. We can see if there is correlation between positive sentiments and high quality scores and vice versa. I’ve performed a brief analysis on this very subject and will link the notebook here.

Here we can see our final dataframe, which includes a sentiment score.

The sentiment column adds a new facet to our data. We can potentially use this as an input to predict something like quality or difficulty (although there may be some multicollinearity issues there).

Here we see a correlation plot which yields results we intuitively conjured beforehand.

Summary:

Hopefully you learned something during this tutorial. NLP is fascinating to me and I would love to see how you apply this foundation to other areas of work. I think taking this model and applying it to Twitter data, perhaps, could yield very interesting results.

--

--