Medical Records Machine Learning

Strategy, Data Science, EMR Platform Development

How Rootstrap Used AI to Build a Scalable & Isolated Architecture for Preprocessing Medical Records in the Healthcare Industry

What this Research Showed Us

“Having a view that summarizes main points and trends of current hospital stay, as well as points important to my specialty.” – Surveyed Physician

  • Physicians don’t have a lot of time to review historical data during appointments.
  • They usually have to ask for this & patients often don’t have full or correct info.
  • Historical medical data can often contain errors.
  • It’s difficult for Physicians to summarize data.

A survey of two popular EMR platforms EPIC & CERNER, which combined make up for 56% of the EHR market share and boast over $8 billion in revenue, showed a 58% satisfaction rate with both systems.

With this data, Rootstrap’s Engineers wanted to leverage AI to provide healthcare professionals with a more efficient and robust architecture to instantly provide an updated summary of patients medical data within a requested date range.


Main Objective

The key objective of this project was to help medical professionals process medical records at scale. This would allow them to save time, reduce human element, and provide accurate and consistent data in a language understood across the healthcare industry.

Prior Experience

Rootstrap has experience developing applications for this industry, and after being approached by numerous potential clients on this topic, their Data Science team conducted extensive research that involved surveying over 100 Medical Doctors and Nurses.

The results further highlighted the need for a robust and effective EMR (emergency medical records) platform within the healthcare industry

“I spend more time dealing with my EMR than attending my patient” – Surveyed Physician

Doctor Looking at Data on Computer


Dealing with this type of data can present complex challenges when attempting to clean, organize, and make sense of it all. Here’s why:

  • A lot of this data is unstructured & written using plain english but with medical lexicon & abbreviations.  
  • Each patient has decades of history, potentially gigabytes of data if we include imagenology.
  • Highly time consuming to analyze these different types of data effectively.
  • DevOps and environment(s) configuration requires a lot of time and work.

The biggest challenge is determining and summarizing what data is actually relevant.

Medical Data Issues:

  • Non-structured (we can’t distinguish easily what’s important and what isn’t important)
  • OCR Errors (data was imported from older EMRs, or from handwritten notes)
  • Different medical conventions for namings and abbreviations
  • Contradictory and duplicated information.
  • A vast amount of data.

Natural Language Processing:

  • Converting plain English into something manageable and semantic.
  • Many vocabularies and medical terms, including CPT, ICD-10-CM, LOINC, MeSH, RxNorm, & SNOMED CT.
  • Many Hierarchies, definitions, and other relationships and attributes.
  • Typing errors that are difficult for AI to understand.
  • Training the model to understand these errors.


Rootstrap’s Data Science team manually analyzed medical records and detected different types of problems. This would allow them to create tasks in the machine learning model for each of the problems detected. They used Natural Language Processing (AI for machines to read & understand language) for the extraction of key information to convert clean medical records to a semantic network, following UMLS standards (Unified Medical Language System).

As there is an infinite amount of vocabularies, hierarchies, definitions etc transforming plain text to a semantic network, developing the architecture with this ability to run tasks is the most efficient approach to extract key data.



  1. Natural Language Processing – Extract necessary information
  2. Semantics – Convert plain English into something manageable
  3. Data Preparation and Processing – Cleaning & preparing data
  4. Reproducibility – Isolating tasks and improving capabilities
  5. Scalability – process thousands/millions of medical records


  1. Data Sources – Assemble Data (from EMRs & public data)
  2. Data Preparation – Prepare Data making it ready for C Takes (NLP processing)
  3. C Takes – Convert clean Medical Records to a semantic network
  4. Summarize – Process a Semantic network to recognize important data
Researcher in a Laboratory


Rootstrap’s Machine Learning Engineers withstood this time-consuming and highly complex challenge

They developed a platform/architecture with a real-time summary of a patient’s clinical encounters for any given date. This AI-driven product can accomplish tasks instantaneously that would otherwise be a time consuming task for humans. They focused their efforts on developing the architecture with the ability to process data in a Unified Medical Language System with over 200 Vocabularies.

By regularly running these types of complex tasks, this machine learning model is constantly improving its capabilities and understanding of medical records terminologies and complex data, and as a result, providing Physicians with instant data visualization in one single location.

Related Projects