Author: Palash Jain

“To understand humans, you must understand text. From a business perspective, if you want to understand your customers and how they use your product/service, you have to be able to analyze text to a very thorough degree.” Paul Hoffman, CTO, Space Time insight 

In this post, I will introduce my work on a hypothetical project for Natural Language Processing (NLP) series through some basic text pre-processing techniques. Let’s assume that you are a manager at a luxury hotel in Delhi. The hotel ratings have been stuck at just above average for quite some time now. To understand the reason for stagnant ratings, it is important to analyze the textual customer reviews and comments. 

For the purpose of this project, approximately 1950 reviews were scraped (from TripAdvisor) for a 5-star hotel in New Delhi, India. (There will be a separate post about web scraping at the end of this series). The hotel’s rating at the time of writing this post was 3.5 stars out of 5. Thus, these reviews fit perfectly for our project and we will try to apply text analytics and NLP to know more about the situation.

## [1] “The Piccadily New Delhi is a spacious, well-appointed 5-star hotel with iconic rooms and Suites, 24- hour restaurant, Business Centre, an outdoor pool, a spa and Fitness by Precor®zone, catering to discerning business and leisure travellers in the bustling West Delhi region. This Hotel is located in the west of New Delhi at Janakpuri, in the heart of the local trading and public sector community and Indira Gandhi International Airport and benefits from easy access to the main commercial, business and entertainment Malls. There are Two Delhi Metro lines connectivity from the hotel (Blue and Magenta Metro lines). Hotel offers choice of Banquet Venues, Thematic Menus, Ample basement valet parking and panel of wedding & event planners and decorators to cater social gatherings.”

## [2] “I am coming to this hotel from July 2011 when this hotel used to be Hilton then it became Piccadily and I still enjoy the hospitality of the hotel. To begin with the location is very convenient for me. The staff is extremely helpful and have always fulfilled my needs. In so many years I have never faced any kind of dissatisfaction in the hotel. Though with time this hotel needs upgradation and I can see things happening here. Mitali at reception is great.…”       

 ## [3] “I stayed fir short while but the hotel ambience is good. location is convinient. Rooms are neat and clean. Room service is promt and efficient. staff is helful and very polite. Mitali is nice at reception. will aurey come back again.”                                                                                                                  

## [4] “I was a part of my friends group and would like to say that the hotel is very well located. It had a beautiful lobby , lovely decor. The food is nice. Staff is attentive and helpful. Will surely come back again.”                              

## [5] “We are group of friends who had booked piccadily on line. The rooms are well maintained and large as compare to other hotels in city. The staff os proactive to help the guest needs. Food is good. Will recommend to our other friends to must visit hotel on their delhi trip”

The total number of reviews currently scraped is 1961. The unclean and unstructured data needs to be converted into a structured format of data frames. To do this, we will follow the basic steps of pre-processing this text. The first thing that needs to be done is to create a corpus which basically means a collection of documents. To achieve this, each review is treated as an individual document.

## <<VCorpus>>
## Metadata:  corpus specific: 0, document level (indexed): 0
## Content:  documents: 1961

As seen above, our corpus now consists of 3328 documents (equal to the number of scraped reviews). For demonstration purposes, a random review is selected and preprocessing performed.

## [1] “In one word fantastic – I came 2 weeks Back to this hotel piccadily where I have done the secret plan that I have requested to the hotel for the towel art ,So the moment we both checked in into the room we both were completely surprised the effort which housekeeping people took…”

The case of an alphabet is significant for a human, while not so much for the computer. Programming languages like R/Python are often case sensitive and thus do not perceive ‘Dad’ to be equal to ‘dad’. So to start our preprocessing we will convert our chosen review text to lower case for standardization.

## [1] “in one word fantastic – i came 2 weeks back to this hotel piccadily where i have done the secret plan that i have requested to the hotel for the towel art ,so the moment we both checked in into the room we both were completely surprised the effort which housekeeping people took…”

As seen above the entire review has been converted to lowercase.

While things like punctuation, symbols, numbers and hyphens make a lot of sense for humans, to computers they are like background noise. The key information in the text is often not found in any of these.

## [1] “in one word fantastic  i came 2 weeks back to this hotel piccadily where i have done the secret plan that i have requested to the hotel for the towel art so the moment we both checked in into the room we both were completely surprised the effort which housekeeping people took…”

As seen above the punctuation in the review has been replaced by white space.

## [1] “in one word fantastic  i came weeks back to this hotel piccadily where i have done the secret plan that i have requested to the hotel for the towel art so the moment we both checked in into the room we both were completely surprised the effort which housekeeping people took…”

As seen above any numbers present in the review have been removed.

The next step in the text preprocessing pipeline is the removal of ‘stop words’. These are auxiliary words like ‘is’, ‘are’, ‘and’ amongst others that make interpreting the text easier for humans but for the computer are irrelevant.

## [1] ” one word fantastic   came weeks back hotel piccadily    done secret plan requested hotel   towel art moment checked room completely surprised  effort housekeeping people took…”

As seen above, words like ‘in’, ‘i’, ‘this’, ‘to’, and ‘have’ amongst others have been removed. The remaining information all seems to be relevant and descriptive.

Any unnecessary white space in the corpus needs to be removed.

## [1] ” one word fantastic came weeks back hotel piccadily done secret plan requested hotel towel art moment checked room completely surprised effort housekeeping people took…”

Another key step involved in text preprocessing is dealing with words that share a common origin. For example, the words ‘coming’ and ‘came’ originates from the root word ‘come.’ Depending on the problem at hand, it might be beneficial to transform all such words to their shared root word. This process is called stemming.

## [1] “one word fantastic come week back hotel piccadily do secret plan request hotel towel art moment check room completely surprise effort housekeeping people take …”
## [2] “”

As seen above, words like ‘came’ and ‘weeks’ have been transformed into their respective root words ‘come’ and ‘week.’ While stemming can be quite beneficial in many cases, it can also lead to information loss, therefore, needs careful consideration. In this case, I choose to ignore stemming.

Let’s inspect a few other reviews from our corpus to ensure that the transformations we have done have been applied to the entire corpus.

To ensure the transformations are applied to the entire corpus, we will look at a few more reviews.

## [1] ” terrible hotel dont waste money worst hospitality ever nobody welcomes staff rude seriously specially reception staff telephones dont work everytime go reception fix everything worst experience ever”

## [1] ” stayed piccadily hotel janakpuri march experience never like repeat reached hotel pmthe check time pm requested early checkin con call starting half hourafter wait …”

## [1] “lobby huge hotel good location front desk give good room new room floor nice think network issues foods tastes good good staff reception counter guide us well breakfast location timing”

As seen above, all the inspected reviews are in lower case, have no punctuation, numbers and stop words.

Now our corpus is ready to be transformed into a structured format and ready for analysis. There might be other things in the corpus that may need cleaning on a case to case basis. However, these are the most important and fundamental text cleaning steps that need to be kept in mind.

The next post in this series will deal with transforming the corpus into a structured object and mining some basic insights from it.

JOIN OUR COMMUNITY
I agree to have my personal information transfered to MailChimp ( more information )
Join over 3.000 like minded AI enthusiasts who are receiving our weekly newsletters talking about the latest development in AI, Machine Learning and other Automation Technologies
We hate spam. Your email address will not be sold or shared with anyone else.

Leave a Reply