Build Dummy Data with Relationships

We’ve all been there. You’re building a rails app that has a large dataset and you plan for the app to run analytics on the dataset. If you’re following the Test Driven Design (TDD) model then you’re going to need to test your analytics methods. The problem is, your dataset is too large to upload in your test suite. You could always copy the first 500 or so lines for a test fixture, but that always ends up producing weird results. What you really need is a little more control over what that test dataset looks like. Let’s start with an example:

I’m building an app to analyze data from a bike sharing network. I have a network of Stations that all have a name and bike_dock_count and which belong to many Trips as a start_station and end_station. Trips have a duration, start_station_id, and end_station_id.

Station attributes: name, dock_count Trip attributes: duration, start_station_id, end_station_id

I know that I want each station to have some trips associated with it, but how can we get that setup? First, let’s start with building some dummy stations in Ruby. I could create 5 stations doing this:

Station.create(name: "station 1", dock_count: 5) Station.create(name: "station 2", dock_count: 6) Station.create(name: "station 3", dock_count: 7) Station.create(name: "station 4", dock_count: 8) Station.create(name: "station 5", dock_count: 9)

Or, I could build them with a times loop and save myself some repetition:

5.times do |time| Station.create(name: "station #{time}", dock_count: time + 5) end

Cool, now I have my stations covered. I will always have 5 stations and they will be consistently named (I’m assuming that the database is being cleaned so the IDs are always 1–5). Next, lets move onto building out trips for these stations. We should probably go over what analytics I want to pull from these. To make this simple I want to calculate:

average count of trips started at each station
average count of trips ended at each station
average duration of the trips
standard deviation of the duration of the trips.

Ok, so how do we do it? I want to dictate what the average value is, then build out dataset that reflect that count. Let’s say that each station should average 5 trips. Since average trips per station is simply:

total_trips / total_stations = average_trips_per_station

And, therefore

total_trips = average_trips_per_station * total_stations

We know that no matter what, if we have 5 stations and want to average 5 trips per station we need to build 25 trips and the actual stations of those trips don’t matter. Building it out we can use another loop:

25.times do Trip.create(duration: 1, start_trip_id: rand(1..5), end_trip_id: rand(1..5)) end

Why did I use rand(1..5) there? Because I know that I only have stations with IDs of 1–5 and I don’t really care which ones get assigned to each trip. This will work just fine, but do we really want our duration to always be 1? It is a bit boring and I think we can do something a bit fancier here. What if we set up our trips so that our duration follows a normal distribution (bell curve). A little googling and I found this handy formula that will build an array of data of specified length (desired_count), the average (avg) and standard deviation (stdev):

Array.new(desired_count) {avg + stdev * Math.sqrt(-2 * Math.log(rand)) * Math.cos(2 * Math::PI * rand)}

How do we use this in our data generator? Simple, we need to build data for 25 trips and let’s say our desired average duration is 60 with a standard deviation of 5. The code will look like this:

trip_durations = Array.new(25) {60 + 5 * Math.sqrt(-2 * Math.log(rand)) * Math.cos(2 * Math::PI * rand)}

Plugging this into IRB and using the descriptive statistics gem which gives me the #mean and #standard_deviation methods I get the result below:

That is pretty, close. Since the dataset is being built using rand function the data will be off for a small dataset like this. I imagine as the array being built gets larger the mean and standard deviation gets closer to what we want. Of course, this level of accuracy isn’t acceptable to have randomly run at the beginning of every test, so we would probably want to run the code in IRB

Anyway, we can implement it like this:

5.times do |time| Station.create(name: "Station #{time}", dock_count: time+5 end

trip_durations = Array.new(25) {60 + 5 * Math.sqrt(-2 * Math.log(rand)) * Math.cos(2 * Math::PI * rand)}

25.times do Trip.create(duration: trip_durations.pop, start_trip_id: rand(1..5), end_trip_id: rand(1..5)) end

That’s it! Now I’ve got a set of code that will build 5 stations and 25 trips. Plus, I know exactly what I’m getting (for the most part)!