A Method for Debugging Medical Device Software

 July 14, 2021
  Yujan Shrestha, MD
SHARE ON

Software

“Rubber ducking” is the practice of talking to inanimate objects to think through bugs.

Are you curious about how to resolve defects in regulated medical device software? Are you working on a defect resolution process for your own company? If so, this article is for you.

A key component of any ISO 13485 compliant quality management system (QMS) is a defect resolution process. While quality metrics can vary from organization to organization, we can all agree that timely and thorough defects resolution should be one of them.

This document details the SKUASH debugging methodology for medical device software defect investigation and documentation. We have used the approach for several years on client projects, including an industry leading medical-device company with hundreds of installations around the world and a multi-tiered support organization. Not only is a documented procedure for evaluating complaints required by the regulations, but it is also a good engineering practice.

A quick aside 🔗

The medical diagnostic process is similar to software debugging in many ways:

  1. Both are plagued with confounding variables making it difficult to distinguish signal from noise
  2. Both require different levels of urgency ranging from emergencies to annual checkups
  3. Both require efficient knowledge transfer between people of different disciplines and levels of experience
  4. Delayed or insufficient investigation could result in the death of a patient or severe damage to an organization’s reputation

Pop quiz: What does the “P” in HIPAA stand for? Privacy? Protection? Patient? It actually stands for “Portability”. Why all the fuss about portability? It can save lives. Physicians have been concerned about portability well before the digital age. A patient’s medical history has a universally accepted schema that allows physicians to quickly review a medical history. Need to know if the patient smokes? That is in the “social history” section. Need to know why the team made a decision for a critical care patient? That is in the “assessment and plan” section. The standardized schema facilitates safe handoffs and fast patient-record review. It is sort of like an API that healthcare providers use to transfer state when they speak to each other.

When it comes to organizing information, debugging patients and debugging software are similar. Why not have a schema for debugging? Wouldn’t it be nice to quickly load another engineer’s debugging state into your brain?

A thorough and well-documented defect-review procedure is required by medical-device regulations, including the US quality systems regulation (21 CFR Part 820 subpart I) and ISO 13485 section 8.2.2. We have investigated hundreds of medical device software defects over the years and have come to appreciate a structured mental framework for tackling difficult bugs. We use the SKUASH methodology to iterate on a solution and document the outcome. While it is probably overkill for those little bugs that come up during the development loop—which are below the purview of regulation anyway—it is especially useful for those tricky bugs that take more than a working day to solve. Overall, it is a communications tool that fits modern messaging, collaborative problem solving, and record maintenance when developing medical devices.

Now let us go into detail for each of the letters of the SKUASH acronym: situation, knowns, unexplained, assumptions, solutions, and hypotheses.

Situation / subjective 🔗

Take a few minutes to explain the situation in sufficient detail to replicate the issue in a controlled environment. If replication is not realistic, then describe the current state of the software and execution environment with as much detail as possible. What was the user doing when they encountered the bug? What is the version of the software used? Is there any local context about the environment needed for replication? Has the user’s IT department tampered with the code and decided it was okay to live edit some code on the server (this has actually happened). Have our support engineers already tried a few troubleshooting scripts? Has the customer’s IT department recently updated a PACS version that could be related to the error? The context of the situation should be subjective and conversational. Use the customer’s own words when possible.

Knowns 🔗

List relevant facts. Initially, this is just the relevant details from the situation. While most knowns will have some degree of uncertainty, keep the threshold high so the team can reliably act on them.

Items may be added to this list by investigative activities such as:

  1. Reading log files and documenting relevant findings
  2. Reviewing code and documenting expected behavior
  3. Documenting customer and support engineer observations

Items may be removed from this list:

  1. You find something in the logs that contradicts an observation the customer claims. Make sure the timestamps positively coincide and move this item to the “unexplained” bucket.
  2. You retry an experiment that was known to be failing but is now passing. This indicates the presence of a confounder. Something outside your reliance about the conditions is present. Move to the “unexplained” bucket.
  3. You are no longer so sure about something you thought was a certainty. Move to the “assumptions” bucket.

Why is this list useful?

  1. You can periodically glance at this to remind you about what you have already tried so you do not end up repeating work.
  2. You can generate new hypotheses and solutions by reviewing this list
  3. You can quickly communicate to other engineers what you have already tried

Unexplained 🔗

List observations that were anomalous but have no immediate explanation. Also list any questions that are not answerable with the objective evidence at hand. A common trap for engineers—myself included—is to try to explain every anomaly to avoid the unsatisfactory feeling of leaving a stone unturned. This “depth first search” approach could end up wasting tons of time chasing down threads that could lead to technically interesting discoveries but do not make any progress to solving the problem. A “breadth first search” forces you to be a bit more scientific about which investigative threads you go down. Your time is limited and valuable. Time wasted going down rabbit holes or scratching academic itches is time taken away from shipping potentially life-saving products.

Items may be added to this list:

  1. You find an odd log statement that could be related. For example, a stream of temporally related database errors could be causative but could also be a confounder. It is worth documenting it here in case it ends up being a key puzzle piece later.
  2. A “known” gets demoted to “unexplained” because it is contradicted by new information.
  3. You are playing a game of telephone. A customer tells you something that does not make sense so you add it to this list to verify later if necessary.

Items may be removed from this list:

  1. You get on a screen share with the customer and see it with your own eyes. You make a small correction and move this to the “known” bucket.
  2. Some unexplained log anomalies look like they are related and you dig further deeper into database logs. You discover the temporal correlation is indeed related to the original issue and you move this item to the “known” bucket.

Assumptions 🔗

Should you take away just one item from this article, let it be this: Never assume you are out of assumptions. You should revisit your assumptions often to mitigate confirmation bias. Confirmation bias is a major contributing factor to medical errors and even airline disasters. It is easy to chalk up conflicting evidence to laboratory errors or faulty sensors. Software engineers are not immune to this cognitive bias. Just replace “patient” with “bug” and the comparison holds up pretty well.

There is a natural urge for engineers to jump into hypothesis generation and testing. We are problem solvers after all. I encourage you to take a step back and enumerate your assumptions first. You will be surprised with how many assumptions you identify that will change the course of your investigation before you even start. Skipping this step is like pouring a foundation without a soil test.

It is also useful to revisit your “knowns” to see if they are assumptions instead. Nothing is absolute. Every known has a non-zero probability of being an assumption in disguise. The higher you climb the investigation ladder, the more important it is to question the stability of the “knowns” that you are standing on.

Items may be added to this list:

  1. Could there be multiple root causes? A single diagnosis may just explain 80% of the findings but two diagnoses may explain all of them.
  2. Some things are not quite adding up during an investigation and you realize there were several assumptions you were making that you did not identify before.

Items may be removed from this list:

  1. An assumption may be moved to a “known” during the course of an investigation. If the assumption turns out to be incorrect, negate the wording before moving it over. There is value in knowing that assumptions were proven false.

Solutions 🔗

It is worthwhile to come up with potential solutions to the problem even before you have begun investigation. There are a few reasons for this:

  1. Keep the goals of the organization in mind. While sometimes a detailed explanation of the problem is necessary to know we’ve fully solved the problem, having a solution to the problem is the ultimate goal.
  2. Coming up with a few preliminary solutions helps explain the problem and your thought process.
  3. Sometimes the cure is the diagnosis. If a patient gets significantly better after administering a thiamine infusion then the diagnosis of thiamine deficiency can be reasonably made. For a drug like thiamine with no side effects, this is often the preferred approach versus waiting for a blood test. If you identify a simple solution that could speed up the investigation, try it and see if the issue is resolved. This is a delicate balance, however. Stopping investigations too early has the risk of missing the root cause. Similarly, discharging the patient after correcting the thiamine deficiency is treating the symptom but not the disease.

Items may be added to this list:

  1. Add potential solutions as you confirm hypotheses. Often you may have a short term solution to unblock a customer followed up with a more permanent solution rolled out later on.

Hypotheses 🔗

Here you list out potential hypotheses that could explain the root cause of the bug. Your investigative efforts should focus on proving hypotheses. If you run out of hypotheses, brainstorm some more by reviewing the other SKUASH elements. Failing that, you may need to do some untargeted investigation to prospect for more information with the goal to help you generate more hypotheses.

Items may be added to the list by:

  1. Brainstorming new hypotheses by reviewing other elements of SKUASH.
  2. Consulting a specialist for more ideas.
  3. Explain the situation to a colleague from scratch. How many times have you consulted a colleague only to figure out the solution midway through?
  4. Consult a rubber duck.

Items may be removed from the list by:

  1. Investigative efforts to confirm or reject a hypothesis. In either case you should reword the hypothesis and move it to the “knowns.”

Conclusion 🔗

Software engineering as a whole is both an art and a science. Debugging, however, is more of a science than an art. So why not use the scientific method and develop a shared framework for communication? In summary, the SKUASH method facilitates efficient problem solving and communication among engineers and customers alike.

Has this article helped you SKUASH a bug? Please subscribe to get more articles like this.

Need help SKUASH-ing your next milestone? Please check out our services.

SHARE ON
×

Get To Market Faster

Monthly Medtech Insider Insights

Our monthly Medtech tips will help you get safe and effective Medtech software on the market faster. We cover regulatory process, AI/ML, software, cybersecurity, interoperability and more.