Can Machine Learning Be Used For Preventive Healthcare?

Can the health industry, use machine learning for preventive healthcare? To predict an outcome for a given health problem such as the incidence of heart disease in a patient?

Machine learning used for preventive healthcare
Machine learning used for preventive healthcare

This article discusses the use of machine learning in healthcare to predict the future outcome of patient’s current disease symptoms and occurrence to provide preventive healthcare by medical practitioners.

Let’s assume, we want to assess if a specific patient has heart disease or not and what is the severity level of the disease occurrence?

Before we begin, let us first assess why would we even need machine learning in this case.

In the medical domain as in others, we know that specific outcomes are based on if an individual influencing factor is above or below a threshold. For instance, for diabetes, if the fasting glucose level is > 110 mg/ dl, then the patient is flagged as “diabetic”. In here the glucose level is the one factor that highly influences the outcome that the patient has diabetes

Except what happens when we have diseases where there are not just 1 or 2 factors influencing the outcome, but instead, a dozen or so values? Before the application of machine learning to this problem, medical professionals would chart out a flow chart or a decision tree. The caveats in this are:

  1. The medical professional is very knowledgeable about the problem domain so they can construct the most optimal tree using their knowledge.
  2. As the saying  “change is the only constant thing in life” goes, based on the data they collect, they have to ensure that the decision tree model is accurately computing the outcome. So manual tweaks to the model are required to keep it in top shape!

Now, what if there is a new condition that has emerged that the healthcare provider observes but there is no textbook knowledge available to figure out what exactly contributes to the disease symptoms? This is a common occurrence in medical research where practitioners are challenged by health conditions that have never occurred before or are studying how complicated health conditions can be cured.

In these cases, the patient is often required to take several medical tests, and some 30 to 50 test result values are available, and it is up to the medical provider to figure out what caused the condition. The point is that, here, the practitioner does not know the answers for them to construct a decision tree.

It is in situations like these that machine learning plays an instrumental role.

Machine learning can analyze a data set and construct the optimal model (e.g., a decision tree) that can be used to compute the outcome

In our above example, the machine learning algorithm can be fed the data for 50 patients, containing around 30 or 50 test result values for each patient. Using math and computations behind the scenes, the machine learning algorithm can construct the most optimal decision tree and persists this as a model. This is referred to as “training the model”.

The model is usually trained with 75% of the available data set, and the remaining 25% of the data set is usually used to test if the model predicts the correct outcome. This is referred to as “model testing”.

For any new patient that comes in, their test result values are then inputted to the initial model and the new patient’s disease outcome is obtained as a result of the model processing the input values. Another perk of the model is that it can be updated daily using the new data that it processes daily, so the model “adapts” to any changes in the data. This alleviates the need to manually tweak the model based on new data.

The more diverse the data used to train the model is, the more scenarios the model is aware of. In other words, the model tends to be less “biased” towards just a particular data set if it is trained for different types of datasets.

However, it is imperative to remember the model is only as good or as accurate as the data fed to it.

Some benefits of using machine learning models

  • Machine learning models alleviate the need for solely relying on a subject matter expert (SME) to manually construct the decision tree. This is not to say we do not need the subject matter expert at all, but, the model can complement subject matter expertise, and help the SME do their job even better.
  • Secondly, the constant manual update of the solution logic is avoided as the model itself can be updated periodically with new incoming data.

Approach for predicting the occurrence of heart disease

In our implementation, we used the heart disease presence data set from the University of California at Irvine website. This dataset has an outcome that indicates the presence or absence of heart disease based on 13 factors. A value of 0 means no presence, whereas values from 1 to 4 indicate the presence of heart disease, with 1 being the lowest level of severity and 4 being the highest.

Our approach was to train the model using the random forests algorithm (which constructs different decision trees using subsets of predictor variables and gets each “tree” to vote towards the outcome). This algorithm is very popular with its use in medical diagnosis.

Technical details

The machine learning library packages within Apache Spark helped us implement this, and persist the model in a network file location.

Next, we implemented the front end piece of the application that would allow the end-user to input patient data values. For this, we used a nodejs server, which would post the input values entered by the user to a Kafka broker. A spark streaming application listened to this broker, and when data was received, it would process the input values against the persisted machine learning model and generate a result that is saved to a table in a database like MySQL.

The front end app queries the MySQL database table once the result is available, and displays it on the screen. The screenshot below shows the front-end piece of this use case implementation and result for the data entered on the screen

Menerva software predicts heart disease outcomes using machine learning


Next Steps

The above is an example of a “classification” problem. A useful extension of this use case would be to predict early diagnosis so that the patient can be put on preventive therapy. For this, one approach is to use the historical data from the past for a given patient and use linear regression to predict the future values say, in the next couple of years, for the above data points. Once the future values are available, we would run the above machine learning model on these future values to assess if the patient will have the heart disease condition in the future.

Data used for this use case is obtained from and is available at