# More exploratory analysis (NLP) Part 3

###### Wilame
Marketing Data scientist and Master's student interested in everything concerning Data, Text Mining, and Natural Language Processing. Currently speaking Brazilian Portuguese, French, English, and a tiiiiiiiiny bit of German. Want to connect? Send me a message. Want to know more? Visit the about page.

#### Don't forget to share:

For today’s challenge, we will continue to navigate through our data using exploratory techniques. Textual data is so rich that it worths this kind of in deep analysis.

During the next days, I will explore Tensorflow for at least 1 hour per day and post the notebooks, data and models to this repository.

Today’s notebook is available here.

## Exploring our data

Today, let’s apply other exploratory techniques to better understand our corpus.

As usual, we will start by importing the packages, excluding stop-words and the extra sentence we saw yesterday.

``````# imports
import matplotlib.pyplot as plt
from wordcloud import WordCloud, STOPWORDS

import numpy as np
from numpy import mean
from numpy import std
from numpy import percentile
import pandas as pd
import scipy

# get data
!wget --no-check-certificate \

# define get_data function
def get_data(path):
return data

#get the data
data = get_data('/tmp/sentiment.csv')

# clone package repository
!git clone https://github.com/vallantin/atalaia.git

# navigate to atalaia directory
%cd atalaia

# install packages requirements
!pip install -r requirements.txt

# install package
!python setup.py install

# import it
from atalaia.atalaia import Atalaia

# get a list with all the texts
texts = data.text

#start atalaia
atalaia = Atalaia('en')

# get the number of tokens in each sentence
# get the lengths
lens = [len(atalaia.tokenize(t)) for t in texts]
data['lengths'] = lens

#delete outliers
data = data.drop(index = [1228])

# lower everything
data['text'] = [atalaia.lower_remove_white(t) for t in data['text']]

# exclude stopwords
data['text'] = [atalaia.remove_stopwords(t) for t in data['text']]
``````

Let’s try to get the most representative words for our positive and for our negative corpus.

``````def plot_representative_words_for_sentiment(sentences, data):
# start atalaia
atalaia = Atalaia('en')

# transform into corpus
sentences = atalaia.create_corpus(sentences)

# get the representative words for 80% of the corpus
token_data = atalaia.representative_tokens(0.8,
sentences,
reverse=False)

full_token_data                     = token_data.items()
full_token_data_tokens, full_counts = zip(*full_token_data)

token_data                          = list(full_token_data)[:10]
tokens, counts                      = zip(*token_data)

# plot
plt.figure(figsize=(20,10))
plt.bar(tokens,
counts,
color='b')
plt.xlabel('Tokens');
plt.ylabel('Counts');

return full_token_data_tokens``````

Plot positive first.

``````  # get sentences
positive        = list(data[data.sentiment  == 1]['text'])
positive_tokens = plot_representative_words_for_sentiment(positive,data)``````

Now, plot negative.

``````  # get sentences
negative        = list(data[data.sentiment  == 0]['text'])
negative_tokens = plot_representative_words_for_sentiment(negative,data)``````

We saw that some tokens are common for both groups of sentences. “Phone”, for instance, appears a lot on positive and on negative sentences. Same for “place” of for “food”.

Which tokens account for 80% of the sentences and are on both sets of sentences?

A few things to thing about:

• Should we filter these words from our corpus?
• Should we exclude punctuation?
• What other analysis could we make to this corpus?
``````#get the intersection of negative and positive tokens
intersection = list(set(positive_tokens) & set(negative_tokens))
intersection``````
``````>>> ['tell',
'want',
'people',
'burger',
'definitely',
'end',
'worked',
'others',
'tasty',
'easy',
'expected',
'touch',
'said',
'good',
'-',
'wait',
'server',
'much',
'came',
'problems',
'staff',
'ever',
'always',
'right',
'thats',
'seems',
're',
'every',
'3',
'comfortable',
'full',
'ago',
'day',
'shrimp',
'vegas',
'motorola',
'5',
'life',
'&',
'belt',
'come',
'stars',
'hit',
'someone',
'voice',
'even',
'nothing',
'dining',
'company',
'need',
'say',
'dinner',
'priced',
'place',
'bluetooth',
'area',
'back',
'clear',
'work',
'oh',
'loud',
'time',
'night',
'provided',
'love',
'quality',
'feature',
'absolutely',
'!',
'super',
'party',
'no',
'hands',
'amazon',
'lg',
'samsung',
'phone',
'bars',
'chicken',
'%',
'first',
'best',
'highly',
'though',
'20',
'several',
'call',
'mobile',
'service',
'new',
'wrong',
'fresh',
'hard',
'"',
')',
'disappointed',
'today',
'hold',
'not',
'?',
'worth',
'doesn',
'customer',
'two',
'used',
'10',
'plug',
'sushi',
'huge',
'found',
'side',
'4',
'anyone',
'look',
'black',
'dishes',
'completely',
'line',
'away',
'room',
'color',
'minutes',
'needed',
'steak',
'warm',
'set',
'large',
'months',
'car',
'way',
'hour',
'cable',
'different',
'table',
'expect',
'talk',
'wall',
'see',
'razr',
'put',
'connection',
'wear',
'plastic',
'experience',
'think',
'sides',
'1',
'small',
'happy',
'left',
'lunch',
'internet',
'things',
'dropped',
'impressed',
'actually',
'gave',
'going',
'screen',
'low',
'verizon',
'fit',
'extremely',
'battery',
'audio',
'light',
'cannot',
'waitress',
'case',
'camera',
'thing',
'charging',
'last',
'problem',
'part',
'dish',
'thought',
'restaurant',
'(',
'another',
'years',
'meal',
'sure',
'taste',
'high',
'item',
'pizza',
'd',
'piece',
'went',
'like',
'house',
'volume',
'device',
'since',
'places',
'long',
'great',
'ears',
'give',
'included',
'price',
'2',
'got',
'fries',
'makes',
'pho',
'home',
'bar',
'arrived',
'using',
'pretty',
'reception',
'looking',
'simple',
'next',
'times',
'find',
'cheap',
'yet',
'without',
'phones',
'use',
'getting',
'real',
'friendly',
'especially',
'food',
'get',
'year',
'keep',
'big',
'almost',
'purchase',
'hot',
'days',
'check',
'wanted',
'around',
'performance',
'know',
'pasta',
'nokia',
'thin',
'may',
'better',
'sound',
'hear',
'kind',
'decent',
'done',
'seriously',
'll',
'job',
'one',
'everything',
'cool',
'mic',
'feels',
'well',
'make',
'try',
'feel',
'cell',
'outside',
'quite',
'strong',
'quickly',
'something',
'recommend',
'looks',
'deal',
'ordered',
'needs',
'weeks',
'buffet',
'won',
'less',
'kept',
'everyone',
'couple',
'product',
'coming',
'didn',
'important',
'design',
'calls',
'go',
've',
'ear',
'working',
'flavor',
'least',
'beef',
'tried',
'care',
'gets',
'still',
'charged',
'must',
'overall',
'order',
'probably',
'rare',
'extra',
'/',
'chips',
'seafood',
'enjoy',
'eat',
'special',
'charger',
'.',
'wife',
'little',
'red',
'us',
'7',
'far',
'nice',
'treo',
'replace',
'sandwich',
'lacking',
'many',
'works',
'lot',
'take',
'finally',
'never',
'really',
'three',
'charge',
'bought',
'software',
'couldn',
'buttons',
':',
'enough']``````