How to Use Data Science to Find the Best Seat in the Cinema (Part I)

Written by noel-mathew-isaac | Published 2020/08/05
Tech Story Tags: data-visualization | web-scraping | selenium | python | movies | json | coding | data-science | web-monetization

TLDR We created PopcornData — a website to get a glimpse of Singapore’s Movie trends — by scraping data. The data was collected from Shaw Theatres, one of the biggest cinema chains in Singapore. We used Selenium — a web browser automation tool that can first render the website with the. dynamic content before getting the. HTML. We then collected the data for all the available movie sessions for a particular movie session for a single day. The movie session data was combined with the seat data and stored on a MongoDB database.via the TL;DR App

From the most popular seats to the most popular viewing times, we wanted to find out more about the movie trends in Singapore . So we created PopcornData — a website to get a glimpse of Singapore’s Movie trends — by scraping data, finding interesting insights, and visualizing them.
Ever felt the crippling disappointment of finding out your favorite seat at the theater has been booked?
How popular really is your favorite seat?
On the website you can see how people watched different movies, at different halls, theaters and timings! Some unique aspects include heat maps to show the most popular seats and animations to show the order in which seats were bought. This 2-part article explains how we obtained the data for the website and our analysis of the data.

Scraping the Data

To implement our idea, the first and maybe even most crucial step was to collect the data. We decided to scrape the website of Shaw Theatres, one of the biggest cinema chains in Singapore.
Starting out with basic knowledge of scraping in python, we initially tried using python’s requests library to get the site’s HTML and BeautifulSoup library to parse it but quickly realized that the data we required was not present in the HTML we requested. This was because the website was dynamic —it requests the data from an external source using Javascript and renders the HTML dynamically. When we request the HTML directly, the dynamic part of the website is not being rendered, hence the missing data.
To fix this issue, we used Selenium — A web browser automation tool that can first render the website with the dynamic content before getting the HTML.
Issues With Selenium
Getting the selenium driver to work and fixing minor issues with it had a big learning curve. After countless StackOverflow searches and ‘giving up’ multiple times we managed to scrape through (pun intended) and get it to work.
The main issues we faced were:
  1. Scrolling to a specific portion of the screen to click a button so that the data will be found in the HTML
  2. Figuring out how to run headless Selenium on the cloud.
  3. After deploying the script on Heroku, some of the data was not being scraped when in fact the script was working properly on the local machine. After racking our brains we figured out that some pages loaded by Selenium were defaulting to the mobile version of the page. We fixed it by explicitly mentioning the screen size.
With Selenium and BeautifulSoup, we were finally able to get the data for all the available movie sessions for a particular day!
Sample movie session data:
{
   "theatre":"Nex",
   "hall":"nex Hall 5",
   "movie":"Jumanji: The Next Level",
   "date":"18 Jan 2020",
   "time":"1:00 PM+",
   "session_code":"P00000000000000000200104"
}
We were halfway there! Now we needed to collect the seat data for each movie slot to see which seats were occupied and when they were bought. After going through the Network tab of the website in the Developer Tools, we found that the seat data was being requested from Shaw’s API.
The data could be obtained by requesting the URL
https://www.shaw.sg/api/SeatingStatuses?recordcode=<session_code> 
where the session code was the unique code for each movie session which we had already scraped earlier.
The data we got was in JSON format and we parsed it and reordered the seats in ascending order of seat buy time to obtain an array of JSON objects where each object contained information about each seat in the movie hall, including seat_number, seat_buy_time, and seat_status.

Sample seat data:

[
  {   
     "seat_status":"AV",
     "last_update_time":"2020-01-20 14:34:53.704117",
     "seat_buy_time":"1900-01-01T00:00:00",
     "seat_number":"I15",
     "seat_sold_by":""
  },
   ...,
  {  
     "seat_status":"SO",
     "last_update_time":"2020-01-20 14:34:53.705116",
     "seat_buy_time":"2020-01-18T13:12:34.193",
     "seat_number":"F6",
     "seat_sold_by":""
  }
]
  • seat_number: Unique identifier for a seat in a hall
  • seat_status: Indicates the availability of a seat (SO-seat occupied, AV-Available)
  • seat_buy_time: time seat was purchased by the customer
  • last_update_time: time seat data was last scraped
Halls have anywhere between 28 – 502 seats and each seat corresponds to a JSON object in the array. Add this to the fact that there are upwards of 350 movie sessions in a single day, and the amount of data generated is pretty big. Storing data for a single day took about 10MB. The movie session data was combined with the seat data and stored on a MongoDB database.
We managed to scrape all the movie data from Shaw for January 2020.

A single document in the database

{
   "theatre":"Nex",
   "hall":"nex Hall 5",
   "movie":"Jumanji: The Next Level",
   "date":"18 Jan 2020",
   "time":"1:00 PM+",
   "session_code":"P00000000000000000200104"
   "seats":[
   {   
     "seat_status":"AV",
     "last_update_time":"2020-01-20 14:34:53.704117",
     "seat_buy_time":"1900-01-01T00:00:00",
     "seat_number":"I15",
     "seat_sold_by":""
   },
   ...,
   {  
     "seat_status":"SO",
     "last_update_time":"2020-01-20 14:34:53.705116",
     "seat_buy_time":"2020-01-18T13:12:34.193",
     "seat_number":"F6",
     "seat_sold_by":""
   }
 ]
}
To view the full document click here.
The complete raw data collected can be downloaded here:

Time to Get Our Hands Dirty

It was now time to get our hands dirty by cleaning the data and pulling out relevant information. Using pandas, we parsed the JSON, cleaned it and made a DataFrame with the data to improve readability and filter it easily.
Since the seat data took a lot of memory, we could not include all of it in the DataFrame. Instead, we aggregated the seat data using python to obtain the following:
1. Total Seats: Total number of seats available for a movie session
2. Sold Seats: Number of seats sold for a movie session
3. Seat Buy Order: 2 dimensional array showing the order in which seats were bought
[['A_10', 'A_11'], ['A_12'], ['B_4', 'B_7', 'B_6', 'B_5'], ['C_8', 'C_10', 'C_9'], ['B_1', 'B_2'], ['C_6', 'C_7'], ['C_5', 'C_4'], ['B_8', 'B_10', 'B_9'], ['D_8'], ['A_15', 'A_14', 'A_13']]
Each element in the array represents the seats bought at the same time and the order of elements represents the order in which the seats were purchased.
4. Seat Distribution: Dictionary showing the number of seats that were bought together (in groups of 1, 2, 3 or more)
{
   'Groups of 1': 8,
   'Groups of 2': 30,
   'Groups of 3': 9,
   'Groups of 4': 3,
   'Groups of 5': 1
}
5. Seat Frequency: Dictionary showing the number of times each seat in a hall was bought over the course of the month
{'E_7': 4, 'E_6': 5, 'E_4': 11, 'E_5': 9, 'E_2': 2, 'E_1': 2, 'E_3': 7, 'D_7': 15, 'D_6': 17, 'C_1': 33, 'D_2': 15, 'D_1': 14, 'B_H2': 0, 'B_H1': 0, 'D_4': 45, 'D_5': 36, 'D_3': 32, 'C_3': 95, 'C_4': 94, 'A_2': 70, 'A_1': 70, 'B_2': 50, 'B_1': 47, 'C_2': 37, 'C_6': 53, 'C_5': 61, 'B_4': 35, 'B_3': 40}
6. Rate of Buying: Two dictionaries, with the first dictionary showing the time left to a movie showing (in days) with the corresponding accumulated number of tickets bought in the second dictionary
{“1917”: [4.1084606481481485..., 2.566423611111111, 2.245578703703704, 2.0319560185185184, 1.9269907407407407, 1.8979513888888888....],
...}
{“1917”: [1, 3, 8, 10, 11, ...],
...}
The cleaned data can be viewed here:

Finally we were done! (with 20% of the work)

With our data scraped and cleaned, we could now get to the fun part — analyzing the data to find patterns amongst the popcorn litter.
In Part II, we analyze the data, visualize it, and then build a website for our findings. To learn more about our analysis of the data and interesting patterns we found, check out Part II of this article.
The code for the scraper can be found in this github repo.
Don’t forget to check out our website at http://popcorn-data.herokuapp.com!
This article was written by Noel Mathew Isaac and Vanshiqa Agrawal

Published by HackerNoon on 2020/08/05