Big Data Processing & Insights – Cross Querying Of Unstructured With Structured Data

In typical big data processing environments, combining a set of diverse data sources that may be in CSV, JSON or in relational database tables to a single source of truth that can be used to mine insights is common.

Processing Big Data using Apache Drill

Often, we may have a diverse set of data sources, a CSV or a JSON document containing data related to domain entities in relational databases.

In such cases, when dealing with big data a bit of preliminary analysis would be useful before moving all of the data to a shared location as a precursor to data processing. Can we run queries with these same data sources without transforming their current state or changing their location?

For instance, in standard big data processing environments, we would export the data to HDFS, write and run Spark or Hadoop map-reduce jobs. What if there is a way to keep the .csv file where it is, and the domain data in its relational or NoSQL database, and run cross queries to do some preliminary analysis?

What we need in this scenario, is a flexible, SQL interface for cross-querying unstructured & structured data!

Apache Drill is an open source query engine that allows an end user familiar with just plain SQL to cross-query their data sets contained in different types of data sources. To illustrate this, we will be using publicly available healthcare and census data sets. We will look at the 2014 Medicaid/Medicare data set that contains the list of medical providers in the United States, along with attributes such as the NPI (National Provider ID), Name, Address, Number of Beneficiaries, Medicare Payments Submitted and so on. In this example, we want to analyze the relationship between providers and census related demographic data. Say we want to understand how Medicare provider locations are distributed in neighborhoods having a relatively older population.

In this case, assume we already have a relational database table with provider data in it.   What we really want to do is to join the data in the database table with the CSV file downloaded from the census website, on the zip code attribute’s value, which is a field common to both the database table and the CSV file.

Using Apache Drill for interactive, adhoc and cross querying data sources

With Apache Drill, each data source is specified using a “storage plugin”. Storage plugins can be of the type – RDBMS (relational database), a file system (HDFS or NFS, Hadoop or Network File Systems included), Amazon’s S3, the MongoDB NoSQL datastore and so on.  The format of the data files can be delimited (CSV, PSV, TSV), JSON, parquet and so on. The list of supported storage plugins and the formats are documented at the official Drill website.  For database plugins, the driver jar will need to be copied to a 3rd party jars folder within the Drill directories.

  • Download and launch the Drill process – To begin with, once Apache Drill is downloaded and set up,  in the local mode, we can start Apache Drill interactively so we can run queries directly via a SQL line editor using a terminal window. In a distributed mode, we would use zookeeper with “drillbit” daemons running on each node in the cluster. Either of these options will allow us to access the Drill configuration UI, that will allow us to configure the storage plugins for our use, as well as run queries, explain plans on queries and other details such as query history.
  • Configure the storage plugins – Coming back to our use case, using the Drill configuration UI, we would specify our database connection details by creating a new RDBMS storage plugin. This would use the template for the relational database (RDBMS) storage plugin. Let us name this “mydb”. The NFS plugin will be specified for the census file that is in a comma delimited data file (.csv) file format.  Let us name this “myfs”.  As soon as these 2 storage plugins are saved, we can start running queries by joining the data from these plugins.
  • Start “drilling” aka begin querying –  Assume “mydb” points to a MySQL database having various schemas. The schema we are interested in is called “medicare”, and the table we need is called “providers”.  In the case of the “myfs” file system plugin, assume it points to my local machine’s “Downloads” folder – “/Users/gramaswamy/Downloads”. This folder contains the “census.csv” file from the census website and a “zipcodes.csv” that contains a listing of all the US zipcodes along with their latitude and longitude details. To complete the file system plugin configuration, I would create a ‘workspace’ entry named ‘downloads’ within this configuration that points to the “/Users/gramaswamy/Downloads” local folder.

A sample query joining the above two would look like the following:

Select a.npi, a.firstname, a.lastname, a.address, a.city. a.state, a.zip, b.columns[4] 
    as medianage, c.columns[1] as latitude,  c.columns[2] as longitude 
    from mydb.medicare.providers a, myfs.downloads.`census.csv` b, myfs.downloads.`zipcodes.csv` c 
    where a.zip=b.columns[3] and b.columns[4] > 45 and b.columns[3]=c.columns[0] order by a.zip;

In the above query,  in the census.csv file the 4th column denoted by columns[3] has the zip code value, and the 5th column denoted by columns[4] contains the median age within the zip code as recorded by the census bureau. For the zipcodes.csv file, the first 3 columns contain the data we are interested in – a zip code, its latitude and longitude. That is, formyfs.downloads.`zipcodes.csv`, columns[0] contains the zip code, columns[1] contains the latitude and columns[2] contains the longitude values.

The results of the above can be plotted on a browser-based map (using the leaflet.js javascript library, for example) to depict the provider locations for various median age ranges. This will allow us to visually tell how provider locations are spread around demographics with older people.

Another bonus? A built-in REST interface

To top the flexibility of cross-querying different types and formats of data sources interactively, another added bonus of Drill is that, you can use an inbuilt REST interface to expose the sqlline shell via your own custom-built adhoc querying dashboards.

In other words, once the Drill processes are running, by default there is a REST API already available and running at http://<ip_address>:8047/query.json, which can process requests via POST containing a “JSON” query string as a POST parameter. This API returns the “sqlline” query results in JSON format.

Given the above, we can easily create a custom search UI form that allows an end-user to specify the necessary criteria for adhoc queries. When the user submits the form, the server side code just needs to formulate the Drill query using the input criteria and forward this query string to the inbuilt REST API. The inbuilt REST API will run the query and return the results in JSON format to its caller.

This is a big advantage to the development team as they do not have to bother with creating custom or new REST APIs for exposing the data from the underlying data sources.

Read more at https://drill.apache.org/docs/

Data for this use case is available at http://www.healthdata.gov and http://www.census.gov/