Machine Learning in Health Care
Knowledge Graph Construction over Heterogeneous sources
As a first step towards building our healthcare KG, we employ the “prediagnosis” information which is a preliminary diagnosis before the start of any given admission (hence this information is available at the beginning of the admission). The prediagnoses information is aggregated from our internal source which is the hospital database in this case. Since the prediagnosis information could be a bit noisy, we perform some normalization techniques, namely,
Convert to lower-case
Remove non-alphabetic characters
Remove multi-spaces as well as leading/trailing ones.
On top of the prediagnosis information, we also leverage external information based on a recent paper  that creates a mapping between disease and symptoms. This external knowledge graph is built from Electronic Health Records (EHR) and links diseases with their respective symptoms while providing a symptom relevance weight (between 0 and 1). An example entry from this external source is: “Migraine: Headache (w=0.384), nausea (w=0.316), sensitivity to light (w=0.223), …” which highlights the importance of the symptoms towards the mentioned disease. These diseases and symptoms are cleaned in the exact same way as the prediagnosis and then assembled to create our dictionary.
We then link the internal admissions with the external disease-symptom information to create an enriched knowledge graph. For this objective, we link the prediagnosis information (per admission) with symptoms and diseases, while these latter two are intrinsically linked directly by the external information (cf. Migraine example above). Note that we also do some normalization here to tackle the problem of spelling mistakes as well as quasi-similar prediagnosis and diseases/symptoms (e.g. “Coronary heart disease” vs. “Arteries heart disease”). Our normalization strategy is to create a character N-grams (where the size of the n-gram has to be tuned) list from the cleaned entries of the dictionary (pre-diagnosis, disease, and symptoms). From this n-grams list, we create a TF-IDF vector with a minimum document frequency of 1.Finally, we match prediagnosis with diseases and symptoms based on the cosine similarity between their n-grams + TF-IDF vectors. Namely, we link a prediagnosis with a disease and a symptom if their cosine similarity score is above a given threshold (to manually filter out noise), and we link diseases and symptoms based on the external information while also applying a threshold on the symptoms relevance weight provided out-of-the-box by the external source. The final knowledge graph is represented in the next figure, including both internal and external information (as shown in our previous post).
Once we have constructed the weighted healthcare KG, our next objective is to extract the relevant neighboring entities to enrich our input entity with additional information. For this purpose, one can employ any graph sampling technique either in a weighted or unweighted manner. The graph sampling is critical for the machine learning model else it might introduce noise in the neighbor information that is fed to the ML model. Some common sampling techniques are random sampling on 1-hop neighbors, importance sampling, snowball sampling, forest fire sampling. For our use-case at hand, we employ importance sampling based on weighted Personalized PageRank (also employed in other Graph-based ML approaches like GraphSage or PinSage) implemented in Oracle PGX package.
Lastly, as stated previously, the top-K entities (i.e., admissions) are extracted for each input admission based on some minimum threshold (could be tuned). The threshold allows for more flexibility and filters out noisy neighbor that could impact the machine learning model. Note that we do not sample neighbors from the testing or validation set while training to ensure fairness in evaluation. Here is a sample visualization of the extracted neighbors.
Since our objective is to predict the diagnoses for admission, let us have a quick look at a given admission sample from the MIMIC-III dataset.
In the above admission, there are four types of events “Lab tests”, “Fluids into patient”, “Fluids out of patient” and “Prescribed Drugs”. This is the timeline of one single admission which can vary from a few hours to a few weeks depending on the treatment and associated diagnoses. In our approach, we model the prediction problem as a temporal sequence one by splitting admissions into N-hour intervals and then training a Recurrent Neural Network (RNN) to do the diagnoses prediction (explained below). Note that to feed the admissions to an RNN, we also need to pad those with zero-padding to have uniform duration for all the admissions. We also do some data normalizations like conversion of time-range events to one-time events (e.g., 6ml/h during 5h → 5 one-time events), unit conversions (e.g., mg/dl → mg/l) or non-constant intervals between different vitals (e.g., mmHg/h for blood pressure and mg/30min for aspirin). At the end of these normalizations, we obtained the following normalized timeline per admission which is ready to be ingested by an RNN.
The above admission is represented as chunks of the four types of events and is then fed to an encoder whose objective is to learn the representation (a.k.a. embedding) of the admission. We will show how this encoding would be employed in the final prediction in the next step.
The prediction module encapsulating information from neighboring admissions as well as the input admission relies on the admission encoder. As stated above, the input admission is first encoded using the encoder module showed above from which the final admission embedding (i.e., vector representation) is fed to the prediction module. The main objective of the prediction module, described below, is to blend the information from extracted neighbors with the input admission vector.
More concretely, the extracted neighbors provide some static information (e.g., diagnoses, patient details) unlike the dynamic temporal information of events in a given admission. The static information of the neighbor is fed through a fully-connected layer to obtain the individual neighbor encoding. Then, the neighbor encodings are fed through an aggregator (e.g., sum, mean or max) whose purpose is to squash the data along the respective dimension to squeeze out information from neighbors. The final prediction diagnoses are made from the concatenation of the aggregated neighbors and input entity encoding, that is finally fed into a fully-connected layer mapping to our target diagnoses (classes). A schematic visualization of the prediction module is as follows.In this blog post, we demonstrated how Knowledge Graphs could be employed with Graph Machine Learning towards enhanced health-care services. However, our approach is generic and could be applied to other domains as well like Financial or Retail Services.