Big Data Query Example – How To Efficiently Query Massive Datasets With More Than 50 million Rows?

Big data query example: How does one query a massive dataset containing several million rows effectively? Can we leverage AWS to do this? – We queried a 14 GB data file with 56 million rows!

How to query Datasets with more than 50 million rows
How to query Datasets with more than 50 million rows

When presented with a relational database table or a delimited file, what is the first thing you do? Query it?

Fairly rhetoric, right! One does not have to be a data guru to know that data mining 101 starts with the ability to query a data set.

To understand the story behind the data, one has to query it.

To illustrate this, we used a publicly available healthcare dataset from healthdata.gov. as a big data query example. We looked at the 2014 Medicaid/Medicare data set that contains the list of medical providers in the United States, along with attributes such as NPI (National Provider ID), Name, Address, Number of Beneficiaries, Medicare Payments Submitted and so on. A plethora of information!

Typically, with massive data sets, we usually start with a sample.

We are all familiar with opening that comma delimited data file (.csv file) with Microsoft Excel but we know that we are limited to a million or so rows, and even with files of those sizes, the Excel application slows down once you start filtering the data. This can be very frustrating.

In descriptive analytics, we try to make sense of something that has happened. Like in this case, we know precisely, what data we want to retrieve; for instance in this use case, we would be interested in data for providers in Florida, and would probably download that file. That would cut down the data set to 1/50th of its size, (assuming the data is distributed equally for all the 50 states in the US) and make it just a bit more manageable.

But what happens if we are in the exploratory stage of our data journey?

We want to be able to query the entire data set of providers for all the states in the US and restrict the provider type to be diagnostic radiology centers with at least 50 beneficiaries who are male and over the age of 70, and whose average beneficiary risk score is 2.0 or higher.

All we have is a 14 GB file with 56 million rows? There is not a convenient dataset of this already available that we can open using Excel. If we go a step further and load this into a MySQL database, the loading of the dataset into the MySQL table is the simple part. Querying it is an entirely different ball game.

We are all guilty of firing off the queries into oblivion and waiting for eons for the results to return. Granted indexes may help to a certain extent, but as soon as the number of records turns to several million, these tend to be ineffective in improving query performance. Often, after 10 minutes, the user gets frustrated and terminates the query.

So how does one explore such huge datasets?

The good news is there are multiple ways to do this, and we will take a look at some of these options:

Use Apache Spark and write a program that will execute the above healthcare query. With the right-sized infrastructure, Spark will very quickly chew through the data and churn out the results. Great! But what happens if you want to update the risk score to 3.0? You have to rerun the Spark job with the new values assuming the program accepts the risk value as a parameter. So yes, we could be exploring the data if we know how to write and run Spark jobs. As a client user though, what if I still want to stick to a more familiar, and more flexible sounding SQL interface so I can tweak the filter values or add more filters? Enter our next option.

Use Amazon Web Services’ S3 and Redshift loading the data into Amazon’s Redshift database tables allows the end user to use PostgreSQL syntax to query the tables via any DB client tool such as DBVisualizer. If the tables are defined optimally, the query responses will be under a minute in most cases. The performance gains with this approach make exploratory data mining very feasible. The steps to do this are as follows at a very high level:

  • Zip the data files using gzip locally. Upload these into an S3 folder. Data transfer into S3 is cheaper for smaller data sizes so make sure to gzip those huge files! Make sure to separate datasets by its future destination table into separate folders.
  • Provision a redshift cluster using SSD instead of HDD. SSD is better for computing but is more expensive. So it is suggested that the cluster be alive only when performing exploratory data mining, to save on running cost.
  • Create redshift tables. The key to the performance magic is the SORTKEYs during table creation. It is essential to make sure that the fields we want to query upon are in the set of SORTKEY(s). Preferably use INTERLEAVED instead of COMPOUND so we can choose to use all or some of the columns in the SORTKEYs. Another parameter to pay attention to is the DISTSTYLE that controls how the rows are distributed to the slices in each node. Smaller tables that are lookup tables participating in a JOIN operation with the larger tables can use ALL. But for larger tables, ALL would require storage equivalent to the table size in each node leading to excessive cloud usage costs! In this case, we want to use the EVEN or ODD or choose to go with the distribution key or DISTKEY option that will allow data to be distributed across the different nodes based on the distribution key.
  • Once the tables are created, use the COPY command to load the data from the respective S3 folder to redshift tables already established. To give you an idea, when using a dc1.large SSD cluster with 1 leader and 1 worker, a load of approximately 4GB of .gz files into a redshift table took around 18 minutes. In our case, we chose a DISTSTYLE of ALL for a table with nearly 56 million rows and whose unzipped size is 14GB. As mentioned earlier, we did use INTERLEAVED SORTKEYs. We are allowed a maximum of 8 columns in an INTERLEAVED SORTKEY.
  • Write your query and execute it! Once the load is done, as long as the query’s “where” clause parameters are columns in the SORTKEYs defined for the table, the results are returned within seconds rather than minutes or hours. Queries can be against a single table or a JOIN between multiple tables.
  • Repeat the above as needed. With the above approach, if we wanted to revisit the SORTKEYs, then all it will take is to create a new table and reload it with the data in S3. All of these steps can be executed by a DevOps or a user trained to do just this, so they can successfully query their dataset.

The above example illustrates just one way of addressing the fast query performance that is imperative for exploratory data mining. Furthermore, if the data is in Redshift and the queries are tuned this way, it does not take much to expose the data via a REST API for custom applications or visualization tools to consume, OR even run spark jobs on the data already loaded for other data processing needs.

Read more at http://docs.aws.amazon.com/redshift/latest/dg/welcome.html

Data used for this use case is available at http://www.healthdata.gov and https://www.cms.gov