Lima Vallantin
Marketing Data scientist and Master's student interested in everything concerning Data, Text Mining, and Natural Language Processing. Currently speaking Brazilian Portuguese, French, English, and a tiiiiiiiiny bit of German. Want to connect? Send me a message. Want to know more? Visit the about page.


Don't forget to share:

Share on linkedin
Share on twitter
Share on facebook

Don't forget to share:

Share on linkedin
Share on twitter
Share on whatsapp
Share on facebook

Last week, after personal reflection and some feedback received, I decided to change the format of this challenge. I came to the conclusion that it would be more enjoyable to have real examples of the application of data manipulation, rather than just exemplifying the capabilities of some tools.

So, today, let me explore how I have used data summarization to build a business plan for a new company project .

That being said… During the next days, I will explore data for at least 1 hour per day and post the notebooks, data and models, when they are available, to this repository.

Today’s notebook is available here.

The problem

A month ago, a couple of friends asked me what I thought about a startup idea they just had. The idea seemed great, but no document had been yet written to provide more background to this venture.

I proposed to help them to write the business plan. To we needed to answer a few questions, like, market situation, future projects, what competitors are doing and what customers are saying about the already available services.

This is a very long and complex step. You have to replace the “I thinks that…” by “according to …, this may happen”.

Most of people will start to look for this information on internet. They will look for reports and newspapers articles related to the industry they just want to enter.

But, you know, there’s a lot of junk outside. Since people started paying more attention to SEO, internet was invaded by low quality content articles whit the same information you just saw on another site, but written in a different way.

To avoid this, you have to find reliable sources of information for your problem. For this purpose, let’s assume that you will use data from your local commerce chamber or from big and well stablished media organisations. Try to stay away from the blog of the company X that is “doing super great on this business”.


To summarize texts, we will use Gensim’s summarizer. The algorithm is better explained by this paper, but you can use it to do extractive summarization instead of generative summarization. The difference here is that the first one will choose sentences from your corpus and will use them as they are. The generative method is more complicated to achieve. For a simple problem like ours, extractive method will just work fine.

Before using Gensim, I implemented myself an algorithm that uses cousine distance to check how similar to each other two sentences are in corpus. This method worked great, but it was too long to give results.

Gensim’s implementation works fine too and is very quick. So, why not using it?

To illustrate our problem, let’s get a summary of the article “Coronavirus: Our Ghost-Kitchen Future“. Let’s say we are wondering if it’s a good idea to open a restaurant this year. The ideas present on this article could be used in our business plan.

As usual, let’s start by importing our packages.

# genral imports
import gensim
from nltk.tokenize import sent_tokenize
import nltk'punkt')
from google.colab import files

# atalaia import
# clone package repository
!git clone

# navigate to atalaia directory
%cd atalaia

# install packages requirements
!pip install -r requirements.txt

# install package
!python install

# import it
from atalaia.atalaia import Atalaia
from atalaia.explore import Explore
from atalaia.files import get_corpus, save_file

And by loading the file with the article. You can find it on the repository, on the data directory.

# define language
lg = 'en'

# start aruana
#aruana  = Aruana(lg)
atalaia = Atalaia(lg)

# load corpus
corpus = files.upload()
corpus = get_corpus(path, ispandas=False)

# replace newlines
corpus = [atalaia.replace_newline(sentence) for sentence in corpus] 

# keep unique

Our file will be imported as a long string with the whole article on it. We need to break it into smaller sentences using NLTK sentence tokenizer.

Despite being a good way to get smaller sentences from a paragraph, NLTK sentence tokenizer is not always very precise.

# tokenize and get sentences
sentences = [sentence for text in corpus for sentence in sent_tokenize(str(text)) if sentence != '.' and sentence !=''] # flatten list

Time to explore

It’s always a good practise to explore your corpus.

This article contais very small sentences like the day it was written or the author’s name. This kind of information is irrelevant for this stage of the problem and needs to be suppressed.

This is the info for our corpus as it is:

  • Shortest sentence has length of: 2.0.
  • Longest sentence has length of: 106.0.
  • Average sentence has length of: 25.
  • Percentiles: (2.0, 15.0, 25.0, 38.0, 106.0).
# Start Explore
explore = Explore(lg)

# plot histogram
sentences_sizes, shortest, longest, average, percentiles = explore.plot_sentences_size_histogram(sentences, bins=10)

# plot boxplot

# plot most representative words

print('Shortest sentence has length of: {}.'.format(shortest))
print('Longest sentence has length of:  {}.'.format(longest))
print('Average sentence has length of:  {}.'.format(int(average)))
print('Percentiles: {}.'.format(str(percentiles)))

Let’s preprocess it to remove these stop words, short and long sentences.

# exclude short sentences
# get sentences longer than 80% of the average size
# include a maxcap to avoid outliers
min_cap   = int(average - (average*0.2))
max_cap   = 70
sentences = [sentence.strip() for sentence in sentences if len(atalaia.tokenize(sentence)) > min_cap and len(atalaia.tokenize(sentence)) < max_cap]

# keep unique
sentences = list(set(sentences))

# clean the sentences
clean_sentences = atalaia.preprocess_list(sentences,


The result:

>>> ['four storefront one pick packag fedex up dhl post offic',
 'first book uncanni valley memoir time tech industri publish januari',
 'implicit class thing blue collar worker margin go softwar develop ventur compani',
 'though reef focuss food prepar test case proof concept sort applic might make sens later futur time',
 'unlik neighborhood restaur ten year leas digit brand not necessarili angl timeless longev']

Let’s replot and see the changes. Current info:

  • Shortest sentence has length of: 9.0.
  • Longest sentence has length of: 38.0.
  • Average sentence has length of: 17.
  • Percentiles: (9.0, 13.0, 17.0, 22.0, 38.0).

After replotting, we see that we have a few less outliers and that sentences such as the article date, “written by…” etc just disappeared. Now, we can see that words like “restaurant” and “delivery” are understood as being important on this corpus.

We now can use Gensim to summarize it.

Gensim’s summarizer receives as input a corpus on a string format. Documentation says that newlines are used to indicate sentences’ boundaries.

Our processed corpus has no punctuation, so we need a way to mark the ends of our sentences when we transform it into a long textual blob.

# add a newline to the end of the processed sentences
clean_sentences_punct = [s + '\n' for s in clean_sentences]

Now, we create a dictionary that will retain the relation “processed -> original” sentence. We will need a way, later, to map the processed sentences chosen by Gensim to the original ones.

# create a dict to hold the processed and the original sentences
processed_original_sentences = dict(zip(clean_sentences_punct, sentences))

# transform into full text
text = atalaia.create_corpus(processed_original_sentences.keys())

# get summary using gensim
gensim_summary = gensim.summarization.summarize(text, ratio=0.1, split=True)


I asked Gensim to output 10% of the most important sentences in a list format. Now, I have to map these sentences to a format we can read.

# get original sentences
final_summary = [processed_original_sentences[s+'\n'] for s in gensim_summary]


This is what you will get:

“I do see an opportunity for a restaurant like Souvla, that has an established brand, that has all of these physical assets, to reëmerge as this little collection of mini Greek-food factories that can produce and distribute the same dishes that people have been eating over the years.

Recently, the owners of DOSA, a fifteen-year-old upscale Indian restaurant with locations in Oakland and San Francisco, told Eater they planned to move to a virtual model: a central commissary kitchen will supply food to a network of twenty delivery-only kitchens, where it can be reheated and delivered.

In response to the coronavirus, the company has been working on converting some parking lots—four in Suffolk County, New York, as of this writing—into drive-through covid-19 testing sites, and dispatching couriers to deliver food to hospital workers in San Francisco and Dallas.

It has also launched a new initiative in Miami, Calling All Restaurants, to help existing brick-and-mortar locations set up delivery-only restaurants in Reef kitchens.

In February, when the New York City Council held an oversight hearing on the impact of ghost kitchens on local businesses, Matt Newberg, an entrepreneur and independent journalist, testified that he had visited a CloudKitchens commissary in Los Angeles where twenty-seven kitchens, occupying eleven thousand square feet, operated a hundred and fifteen restaurants on delivery platforms.

In recent years, Souvla, a rabidly popular chain of fast-casual Greek restaurants in San Francisco, saw thirty per cent of its business come from delivery.

CloudKitchens, the new venture run by Travis Kalanick’s City Storage Systems, buys real estate, brings in kitchen facilities, and leases them to chefs and small-business owners, most of whom do not have other brick-and-mortar spaces.

During recent protests against racial injustice and police brutality, restaurants across the country opened their doors to demonstrators, offering drinks, snacks, and restrooms; meanwhile, in cities with curfews, delivery apps continued to operate after-hours, placing their couriers at risk.

At the same time, most restaurants rely on a sort of performance, and the basic ghost-kitchen model has more or less existed for decades, deliberately—for years, Domino’s has operated kitchens for only takeout and delivery, including one up the street from the Reef trailer in the Mission—as well as inadvertently.

Frjtz, a San Francisco restaurant beloved for its Belgian-style fries, closed its brick-and-mortar location in the Mission in 2019, after nearly twenty years in business, and now operates—delivery only—via CloudKitchens.

Using data from in-app searches, Uber Eats identifies opportunities for certain cuisines in various neighborhoods, then approaches existing brick-and-mortar restaurateurs to pitch them the idea of launching a virtual restaurant.

Douglas, an upscale corner store and café in San Francisco’s Noe Valley neighborhood, has emerged as a sort of pickup window for a handful of higher-end restaurants, which offer meal kits and prepackaged food—a homegrown variation on Reef’s centralized vision, with the shop as a broker, rather than a venture-funded startup.

Good ways to have insights

Summarization can be used to a lot of things: to have new insights, to analyse what competitors are doing and to create intelligence reports.

What’s your real life and creative way to use it?

Don't forget to share:

Share on linkedin
Share on twitter
Share on whatsapp
Share on facebook

Leave a Reply