Alexandra Mendes

Vítor Bernardes

Rui Melo

•

Min Read

November 27, 2023

How to analyse customer reviews with NLP: a case study

This report analyses the customer reviews of Britannia International Hotel Canary Wharf. The analysis was performed using Natural Language Processing techniques, and the results were used to identify which aspects of the hotel's service needed to be improved.

‍
Apart from the hospitality industry, this analysis can benefit any other sector with access to customer feedback, like e-commerce, food services, or the entertainment industry.

Problem

One of the most critical aspects of understanding a business is understanding its strengths and weaknesses. Analyzing why it is thriving or not represents a key to the longevity of that business. Hotels are not strange to this scenario.

‍

As a business owner, it is essential to understand why some customers might not return to the hotel, the reason behind some aversion, or what positively stood out to them.

‍

To perform this research, we gathered a dataset of hotel reviews and focused our attention on a specific hotel: Britannia International Hotel Canary Wharf.

‍

Britannia International Hotel Canary Wharf.

‍

The dataset was gathered from the Kaggle platform, containing over 515,000 customer reviews and scoring of 1493 luxury hotels across Europe.

Find out why you need a Data Scientist.

Solution

Motivation and Objectives

To gain insights into the hotel reviews and understand the customers' feelings and feedback more accurately, we needed to understand the customer opinions and segmentation in our dataset with the available data.

‍

Additionally, the large corpus of customer feedback makes it time-consuming to manually review them to capture customers' preferences and pain points. Therefore, we also proceeded to analyze the review texts with Natural Language Processing techniques to understand the intrinsic feelings and emotions behind reviews and recognize which aspects of the hotel required improvements.

‍

While we applied this process to the hospitality industry, this type of analysis can be readily implemented for any other industry that captures customer feedback or even enabled by collecting customer comments from social media posts.

‍

Overview

We started by evaluating the available data, with particular attention to the format and soundness of each field. As is typical when dealing with datasets, especially ones that involve user-generated data, some data needed cleaning. This is an important step in every data analysis process to ensure that the data we work with and use as a foundation for insights is sound and therefore leads to reasonable and representative conclusions.

‍

In the specific case of this dataset, the actual review text needed some minor cleaning to remove redundant whitespace. However, we also noticed a significant issue: all punctuation was missing from the review. Therefore, it was necessary to perform a pre-processing step. We proceeded to recover some of the structure provided by that punctuation to ensure we could use Natural Language Processing techniques and obtain relevant results. A simple yet effective method was to approximate that structure by adding periods before each word beginning with a capital letter.

‍

The effectiveness of that method also stemmed from our additional processing, where we filtered known acronyms and named entities, so we would not add unnecessary periods. To achieve that, we employed automatic named entity recognition, a process that attempts to identify named entities in a given piece of text automatically. In the NLP context, named entities are real-world objects that can be identified with a proper name, including cities, individuals, organizations, etc.

‍

Analysis

Data profiling

The next step was creating our dataset, which we filtered to only apply to our specific hotel. With our filtering, we were able to have access to information about our particular hotel.

‍

The dataset contains the review date and the score given to that stay. It also had information regarding the reviewer's nationality and tags that described the characteristics of the visit, such as if it constituted a double or a single room and how long the stay was. In addition, it also possessed negative and positive reviews of that stay.

‍

To approximate the available data to a real scenario, we randomly meshed the negative and positive reviews into only one column to analyze later.

‍

Distribution Analysis

The first task was to see reviews' ratings by date. Identifying periods where the ratings would not be so good could be possible. This could derive from a seasonal aspect, such as not having air conditioning in the summer or the impact of a specific employee.

‍

This approach was not fruitful, but the same logic applied to analyzing the tags or nationalities. Through the tags, we could identify, for instance, if customers with an Executive Double Room stay did leave bad reviews or not. That visualization could be done through boxplots. We analyzed all the different tags and found that most of them reflected similar distributions, which prevents the possibility of obtaining relevant insights.

‍

Boxplots with reviewer score for different hotel accomodations.

‍

Regarding the nationalities, it was essential to analyze the distribution of our customers. This could provide insights into the marketing team’s effectiveness in some markets. Excluding the UK customers, which represent 80% of all the customers, we get the following world map overview, where darker shades indicate a higher number of reviewers from that nationality:

‍

World map overview indicating reviewers nationality.

‍

Sentiment Analysis

To further understand the feeling behind the reviews, we use a language model hosted on the HuggingFace platform to know whether the review was positive or negative. The multilingual XLM-roBERTa-base model was trained on ~198M tweets and fine-tuned for sentiment analysis. The sentiment fine-tuning was done in 8 languages.

‍

With the ability to split the reviews into positive and negative with a reasonable confidence level (0.76 accuracy in our dataset), we tried to analyze patterns within those reviews. A straightforward way to visualize the words is through word clouds. Following is the word cloud for Negative and Positive Reviews.

There is much information to be gained from analyzing the dynamics between positive and negative customer reviews. Customers surely want to have their say, as demonstrated by our data set, where negative reviews are, on average, over twice as long as positive reviews. Additionally, by looking at the evolution of the average number of reviews over time, we can see a potential slight increasing trend in the number of negative reviews, which the business should be attentive to.

‍

‍

Emotion Analysis

Besides identifying the sentiment behind a text, another technique in NLP is to identify the emotion behind it. To achieve this, we used the NCRLex library. NCRLex library allows us to recognize emotions from texts, such as fear, anger, or surprise. This analysis allows us to more accurately understand how customers feel about a specific service or product.

‍

Similarly to sentiment visualization, we can visualize a word cloud for each emotion within the positive or negative reviews by identifying the different emotions associated. For example, the word cloud generated from the trust emotion within the positive reviews is as follows:

‍

Word cloud generated from trust emotion within positive reviews

‍

This process allows us to have some idea of what triggers which customer emotion.

‍

Keyword Analysis

To further analyze the reviews, we wanted to identify the main objects of customer comments in their reviews. To achieve that, we extracted relevant keywords from the set of positive and negative reviews using YAKE, an unsupervised automatic keyword extraction method.

‍

This method computes statistical features related to characteristics for each review, including word case, position, frequency, context, and weights of each term according to these features.

‍

Finally, a score is computed indicating the significance of each term as a potential keyword. This is a powerful yet lightweight method that, due to its fully unsupervised nature, can be employed in different domains and even with other languages.

‍

Additionally, we employed a pure frequency-based approach to uncover the most common objects mentioned in reviews. The results were similar to our keyword analysis, reaffirming its validity and reliability.

‍

These were the keywords identified for positive and negative reviews:

‍

Positive: hotel, location, staff, view, room, breakfast
Negative: hotel, staff, room, breakfast, window, bed, Wi-Fi

‍

As expected, the identified keywords are common points addressed in the hospitality industry reviews. They already constitute a good indicator of adequate service or potential areas of improvement for the hotel.

‍

However, we wanted to go deeper into the analysis and uncover exactly what it was about these objects that were – or were not – working as expected by customers. For example, why were windows such a prominent aspect of negative reviews?

‍

To that end, we used another technique from Natural Language Processing: syntactic dependency parsing. We employed spaCy, a fast, comprehensive, and production-ready NLP library for Python, to create a syntactic dependency tree, which connects all terms in the input text according to their syntactic relation. Then, we queried this tree to pinpoint precisely what it was about a given keyword (for example, "room" or "location") that customers did or did not especially like.

‍

‍

The result was a list of modifiers for each keyword. For example, we could learn that customers might consider a "room" to be "spacious" or the "location" to be "convenient." This resulting list of modifiers enabled us to create word clouds to visualize the frequency of each modifier for the given keyword, such as the word cloud below, for the keyword "room":

‍

‍

Analyaing these frequent modifiers for each keyword, their relevance, and weight, and analyzing separately for positive and negative reviews, provided us with a profounder insight into what customers like best – and not so much – the results we present below.

‍

4 things to remember when choosing a tech stack for your web development project

Outcomes

Upon analyzing the data set as described above, we were able to identify some positive aspects of the business, as well as essential areas for improvement.

‍

One noticeable comment from customers, which frequently appears in both positive and negative reviews, is that some consider the hotel dated. The three main modifiers used to describe the hotel in negative reviews pertain to that quality. This suggests the business may want to look into renovation to appease those pain points.

‍

Modifiers for hotel keyword in negative reviews

Modifiers for hotel keyword in positive reviews.

‍

The keyword analysis reveals customers' most common points when posting their reviews. As one would expect, the room features prominently in both negative and positive reviews. While it is mentioned regularly in negative reviews throughout the period we analyzed, in approximately the last six months, there was a surge in room mentions in positive reviews, a potentially favorable trend the business should be aware of. In positive reviews, the most common comments refer to rooms as clean and spacious. There are also references to being overall comfortable and cheap.

‍

The beds were also frequently mentioned, with some users considering them stiff and uncomfortable. The prevalence of this comment also suggests an immediate area for improvement. On that note, some customers also pointed out that they found the hotel noisy.

‍

Top modifiers for negative reviews for bed.

‍

In addition to that, another major issue reported by customers is the heating, ventilation, and air conditioning system in place at the hotel — "hot" and "cold" were the main concerns from customers regarding their rooms. One particular pain point was the room window, which was so frequently mentioned to be identified as one of our keywords, especially since it required staff assistance to open some rooms' windows.

‍

Word cloud with main concerns from customers.

‍

In that sense, the staff was frequently brought up in positive and negative reviews, with some customers considering them rude. However, more often than not, they were considered friendly and helpful, although one particular point of interest is that many customers thought the hotel was understaffed. Finally, the mention of the staff in reviews remains relatively constant over time.

‍

The hotel location was another prominent factor in positive reviews. It was predominantly perceived as a positive aspect, with many general compliments, and being considered convenient and centrally located. However, one crucial trend the business should be aware of is that, over time, location has been mentioned less frequently in positive reviews while increasingly referred to in negative reviews. While this may relate to the external location and, therefore, to external factors outside of immediate hotel control, it is a potential trend worth keeping an eye out for.

‍

Finally, it is worth mentioning that a significant number of negative reviews commented upon the hotel's Wi-Fi, mainly due to it being paid and not free.

‍

Discover how Data Science can help your business.

Applications

Business intelligence and sentiment analysis projects such as this can bring value to many use cases.

‍

E-commerce

Nowadays, a significant portion of shopping is done online. E-commerce represents a growing trend of nearly unlimited access to resources, markets, and products in real-time from anywhere on the planet. Understanding the reach of the marketing in terms of customer segmentation is very important for a business to adjust efforts to reach the desired target public.

‍

Almost every e-commerce platform contains a reviews section where customers can comment on the products they bought. This comment section represents a valuable data source that can bring value to the business.

‍

Through NLP techniques, it is possible to acquire insights into what the customer likes or dislikes about the products. These insights can help understand flaws or further improvements to the product and/or the platform. We can identify key aspects that bring insecurity or other emotions to the customer, so we can act on them.

‍

It also becomes possible to see the evolution of the user sentiment on the product over time and measure how changes affected the customers' overall opinion.

‍

Hospitality Industry

The hospitality industry is a very competitive sector where little details can prove to be essential edges over competitors.

‍

Booking, Trivago, Google, and other platforms often list establishments. The common aspect between these platforms is that customers often use them to leave reviews. By analyzing the review scores and comments, it is possible to gather insights into customers' opinions on key aspects of the businesses.

‍

This data allows us to interpret which aspects of the business need changing or attention, what parts customers value, and possibly foresee some adjustments we should consider.

‍

Food services industry

Restaurants, coffee shops, and bars increasingly rely on their online presence to attract customers. This involves being listed on several platforms like Yelp, Google, Zomato, and Tripadvisor, which allow users to leave ratings and written reviews. Often, clients choose which new places to try based solely on these reviews, making them a key to understanding how the business is performing.

‍

It is in these establishments' best interest to use all this feedback to find ways to get an edge over their competitors. Analyzing possible customer pain points helps invest in worthwhile improvements, and tracking consumer sentiment over time ensures that the investments are paying off.

‍

Any establishment that grows beyond a specific size must rely on Data Science techniques to analyze many reviews they may get on different platforms. This process can be automated, providing quick feedback and a broad vision of what is attracting or disenchanting customers. This will help managers take their food services to the next level.

‍

Entertainment Industry

The entertainment industry is broad, including everything from Movies, TV Shows, and Youtube Channels to Amusement Parks and Circus Acts. Common to all of these businesses, especially in the digital age, is that they are subject to reviews and comments, both from critics and spectators.

‍

As the business grows, the number of reviews might become unmanageable, making it difficult to understand the overall sentiment of the population. This is where NLP techniques should come into play, allowing many comments to be parsed and analyzed to extract valuable and actionable insights.

Endnotes

In summary, we analyzed customer feedback about their stay in a hotel using Natural Language Processing techniques and uncovered actionable insights that can directly impact business decision-making. This analysis and the underlying processes can be used for many other applications, bringing value to businesses across many sectors.

‍

This project was completed in 3 days with a team of 2 Imaginary Cloud Data Scientists. Imaginary Cloud provides Data Science and AI development services, focusing on bringing the highest value to its clients through tailored solutions and an agile process.

‍

Artificial Intelligence Solutions done right - CTA

Alexandra Mendes

Content writer with a big curiosity about the impact of technology on society. Always surrounded by books and music.

Vítor Bernardes

Data scientist passionate about data science and watchful of its ethical implications. Besides work, I love nerding out on music and reading a good story.

Rui Melo

Data Scientist who loves exploring problems. In my free time, I teach basketball to kids and enjoy going to the beach.

How to Choose the Best Open Source LLM (2025 Guide)

Learn which open source LLMs offer the best performance and flexibility, and which ones are best suited for your use case or industry.

Alexandra Mendes

May 30, 2025

Business, Data Science

Generative AI: How It’s Transforming Industries in 2025

Discover how generative AI reshapes healthcare, finance, retail, and other industries, drives innovation, and creates new growth opportunities.

Alexandra Mendes

March 13, 2025

Data Science

Why do I need a Data Scientist?

Employing a Data Scientist is beneficial when you need help to collect, clean, visualize, and most importantly, make sense of your organizations's data correctly.

Anjali Ariscrisnã, Alicja Ochman

February 24, 2022

Data Science

Why Your Business Needs a Big Data Engineer Now

Big Data can provide businesses with a competitive edge. Know how to capture the power of information with the help of a Big Data Engineer.

Alex Gamela

October 21, 2021

Data Science

Top 21 Data Mining Tools

Data mining is a process that uses intelligent methods to discover patterns and extract relevant information from data. Find out the top data mining tools!

Mariana Berga, Pedro Coelho, Alicja Ochman

March 4, 2021

Data Science

SQL vs NoSQL: when to use?

This article explains when to use SQL or NoSQL databases and further provides a detailed comparison between both.

Mariana Berga, Tiago Franco

April 1, 2021

Data Science

Snowflake vs. Redshift: which one is right for you?

Snowflake and Redshift are two of the most used data warehouses on the market. Find out the pros and cons of each one and choose the best for your business.

Alexandra Mendes, Pedro Coelho

June 30, 2022

Data Science

PyTorch vs TensorFlow: Deep Learning Comparison

This article compares PyTorch vs TensorFlow - two deep learning frameworks -, to understand their features, key differences, and how to choose between them.

Mariana Berga, Pedro Coelho

April 22, 2021

Data Science

R vs Python: The Data Science language debate

R and Python are the most popular Data Science languages. They are both open-source and excel at data analysis. This article explains their key differences!

Mariana Berga, Pedro Coelho

May 20, 2021

Data Science

How to analyse customer reviews with NLP: a case study

Learn how to analyse customer reviews with Natural Language Processing. You can apply NLP principles to any sector with customer feedback.

Alexandra Mendes, Vítor Bernardes, Rui Melo

September 8, 2022

Data Science

Data Science: what is it and how can it help your business?

Data Science is revolutionizing many industries, providing valuable business benefits that increase efficiency, product creation, and customer experience.

Inês Rita

December 17, 2020

Data Science

Data Lake vs. Data Warehouse: What are the differences?

Explore the key differences between Data Lakes and Data Warehouses to understand which solution best fits your data storage and analysis needs.

Alex Gamela

December 9, 2021

Data Science

Data Analyst vs Data Scientist vs Data Engineer Differences

Learn the key differences between Data Analysts, Data Scientists, and Data Engineers, and discover which role fits your business needs.

Anjali Ariscrisnã, Pedro Coelho

January 27, 2022

Data Science

Can ChatGPT Be Detected? Tools, Methods, and Limits

Discover how ChatGPT-generated content is detected. Compare top tools and explore their real-world applications.

Alexandra Mendes, Vítor Bernardes

April 6, 2023

Business, Data Science

Artificial Intelligence in business: a guide for industries

Explore how Artificial Intelligence in business revolutionises industries. Learn to use AI for enhanced efficiency and growth in your sector.

Alexandra Mendes

October 13, 2022

Data Science

Advanced Analytics and the Top 6 Data Mining Techniques

This article describes the six data mining techniques a data scientist should know. It includes core techniques, as well as more advanced ones.

Mariana Berga, Alicja Ochman

May 13, 2021

Data Science

4 strategies to improve your business using Data Science

Companies all over the world are building big data strategies to gain a competitive advantage. Here are the 4 reasons for you to start building the future of your business using data science.

Anjali Ariscrisnã

March 10, 2022