In April, the US Food & Drug Administration (FDA) published a request for comments on new proposals for the regulation of medical devices which employ artificial intelligence (AI) and machine learning (ML) components. This continues a trend over recent years of the FDA recognizing that regulation of software as medical devices (SaMD) must be different to how regulation has applied in the past.
Historically, medical device innovation was inherently slow and incremental. The software was usually restricted to a minority component embedded in a more extensive self-contained hardware system and was infrequently updated. It was therefore relatively easy to control from a regulatory standpoint. With increasingly complex software becoming a core part of all aspects of medical devices (and indeed, the medical device itself), the FDA is working to modernize regulation to deal with the opportunities and risks of such systems.
We believe the FDA’s AI/ML regulation proposals are an excellent start, uniting well-established risk management principles and best-practice guidance while also recognizing the challenges and opportunities developing and managing complex AI/ML products. The FDA has also coherently integrated existing SaMD regulatory principles into their new proposals. More broadly, the FDA’s proposals encapsulate many established industry and academic best practices in AI/ML implementation and management – so will be familiar to anyone who has worked on such systems in other sectors.
Probably the most enlightened area of change is the recognition of the training and validation data feedback loop established when software-based ML systems are used in real clinical settings. In particular, the new ML proposals recognize the benefit of incorporating real-world evidence (RWE) data into algorithms more rapidly and iteratively to improve its real-world performance, without the requirement to submit a further pre-market submission to the FDA.
We particularly like this meta-process approach being taken by the FDA; approving the process to change the models, rather than the models themselves. We believe overall these proposals, when implemented will allow device developers to respond quickly to feedback and evolve their algorithms to bring more benefits to patient care through the greater successful and safe application of AI/ML.
For Current Health’s remote monitoring solutions, AI/ML allows us to monitor more signals automatically, more reliably, day-and-night for as many people as required. We share the FDA’s aim of safe and practical implementation of AI/ML solutions to complex and potentially high-risk medical problems. Based on our work in AI/ML for everything from low-level waveform analysis to adverse event and outcome prediction, alongside our extensive experience applying AI/ML outside of healthcare, in the following sections we outline a few of our thoughts to help reinforce the FDA’s regulation proposal.
Framing the problem
In the guidelines, the FDA proposes establishing a SaMD pre-specification (SPS) that would describe the anticipated future modifications to an AI/ML product. While we agree that it is critical these are considered upfront; we also believe the SPS should deeply consider the fundamental problem that is being solved to help reason the impact of these changes.
Any successful AI/ML project relies on a well-formulated problem, question, modeling and evaluation approach. The issue needs to be framed appropriately. What is the question that the algorithm is intended to solve, and what is the target it will learn? Appropriate framing helps to define the method of the solution better. For example, are you utilizing supervised learning? Or somewhat unsupervised or semi-supervised learning? Is this a regression problem? What was the rationale for this? Available data, complexity, labels and expected clinical output will all play a role in this decision. There will be many constraints and trade-offs. Will, the rationale of the algorithm change in the future as more, or more varied data, becomes available?
As a hypothetical example, let’s consider applying AI to a CT scan to determine if a tumor is present. That sounds like a simple problem statement. In reality, there are multiple ways of framing this problem. For example, the question could be a simple yes/no “Does this person have a lung lesion?” It could also be a multi-class yes/no/maybe scale. However, consider that the question could also be framed as “How many lesions are present?” The output of that question would be an integer, where a non-zero answer would yield the same answer as the yes/no/maybe question above. Alternatively, the question could be “Which areas of the lung appear visually abnormal compared to other ‘normal’ lungs?” The output of this question would be a classification of image regions through a range of different image recognition models. However, the presence of these areas can also be reduced to a yes/no/maybe answer, as per the original question. Note the presence of a maybe output – what about framing uncertainty to the clinician?
When remotely monitoring a patient’s health, predicting a health deterioration can similarly be framed in many ways. For example, do we predict which of the patient’s vital signs will deteriorate generally? And, over what time frame? Or do we predict that a patient will have a specific disease exacerbation? Again, the framing of this problem yields essential considerations around the dataset required, the labels needed and the evaluation and on-going monitoring to establish sensitivity and specificity.
The correct framing of the problem, therefore, has a significant impact on the technical complexity of implementation, collection, and bias of evaluation, on-going monitoring, and dataset requirements. Consequently, we believe the FDA should incorporate requirements for precise framing of the problem into the pre-specification, along with the risks that imply.
Changing how the problem is framed, should likely require a change to the indications for use and therefore a further pre-market submission. However, the precise definition of this rationale up front will help better characterize and justify the risks during pre-approved model and feature changes.
The anticipated change protocol (ACP) presumes offline learning, that is while the FDA’s proposals consider the inclusion of real-world data into algorithmic improvements, these are still considered in a relatively ‘mini-batch’ incremental fashion, where updates are controlled and documented. It is crucial that the FDA consider online learning, such as active learning, incremental learning or reinforcement learning, when finalizing guidelines.
Online learning is inherently non-deterministic, and our current belief is that that kind of non-determinism is very risky in a critical healthcare setting without appropriate safeguards, such as the development of algorithms to supervise other algorithms. However, the applicability of online learning should be considered and clear guidance provided.
As an example, in remote patient monitoring, alarm fatigue can be a significant problem. By utilizing reinforcement learning, healthcare professionals can feedback on the accuracy of an alarm. This however could be non-deterministically incorporated into an AI/ML model, or user-guided threshold. Alternatively, labels received through the feedback loop could be used for feature engineering and offline learning, updating the model deterministically but with manual oversight. We believe the latter approach is more appropriate at this point for the healthcare environment, and more in the spirit of the FDA’s proposals but the industry will benefit from clear guidance from the FDA on this topic.
Building mechanisms for explicability: aka Explainable Artificial Intelligence
A significant body of on-going research is developing novel explicable AI/ML methods. The motivation behind this is to help the user understand why an algorithm has, or conversely, has not made a decision. We strongly believe that explicability, or “explainability”, is critical to the development of trust in AI/ML in healthcare, and therefore widespread and successful adoption. The battle will be won not on when it works – but understanding why it works and conversely both when and why it doesn’t work.
The US Defense Advanced Research Projects Agency (DARPA) has a significant program examining explainable AI. They describe this as:
(Reproduced with minor alterations for legibility from DARPA Explainable AI)
In healthcare, this means designing both the algorithm and wider medical device product to deliver supplementary information to contextualize any diagnostic or predictive output of the model. This context can be very important for allowing the healthcare professional to comprehend and reason the output to ultimately make a reliable medical decision.
Explicability can come from the framing of model inputs and outputs, such as the percentage likelihood of an event, or from the classification of underlying model mechanics, such as the most salient features used in the prediction. It can also come from characterisation of input and output data in different ways, such as descriptive statistics, visualisations, comparisons to similar instances/outcomes etc. This implies both algorithmic and user-interface components that need to be outlined and risk-managed through a regulatory process.
Just this past week, a team from Google Brain released the paper ‘Human-Centered Tools for Coping with Imperfect Algorithms During Medical Decision-Making’ at the world’s premier human-factors in computing science conference. This paper examines an algorithmically-powered user interface to manage the uncertainty of medical image classification to guide better medical decision with uncertain predictions. We are sure these types of systems will become increasingly common-place as AI/ML systems move from the lab to the hospital.
In another recent paper, ‘Deep learning predicts hip fracture using confounding patient and healthcare variables’, the authors proposed a deep learning approach for predicting hip fracture risk. Deep learning is notoriously difficult to understand, but incredibly effective (deep learning is a ‘black box model’ since its construction and computation of predictions is not easily understood). Crucially, the approach presented proposes fusing both image classification and demographic factors to improve classification but also to help express the algorithm output to improve human decision-making. We’re sure this type of algorithmic cooperation is going to become common-place in successful AI/ML products.
When it comes to remote patient monitoring, we’ve found that rather than simply stating that this patient is going to deteriorate, it is much more helpful to accompany a prediction with information about how the patient’s baseline has changed, and provide context to the healthcare professional based on what the model has seen before. Displaying to the user the major determinants of the prediction helps to assess any unexpected anomalies, and build trust and understanding.
We expect mechanisms of explainability to be among the most requested features for anyone building AI/ML systems used in the real-world. AI/ML product developers should factor this into their product specifications at an early stage, and expect iterations. Those iterations may have effects on human factors – which is something conventional medical devices have long had to consider.
The FDA specifically considers the importance of transparency within SaMD development but does not explicitly discuss the topic of explicability. We feel that building explainable ML models is a significant step forward towards model transparency and this should, therefore, be incorporated into the FDA’s ML guidelines in the future, along with guidelines on how such mechanisms may evolve.
Workflow and the impact of the real world
The FDA proposals may underestimate some of the data challenges which we think will be common in the real-world. That is, data artefacts caused by variability of human behavior and of workflow. The highly influential Google research paper ”Machine Learning: The High Interest Credit Card of Technical Debt” explored the challenges of managing complex ML models in a vast, fast-moving organization such as Google. This paper is highly relevant as modern healthcare delivery is similarly complex, vast and fast moving and involves many interoperating actors and systems.
A machine learning model is inherently dependent on external data. Changing products, human processes, or indeed changing other models which are providing, either explicitly or implicitly, input to other models creates a huge entanglement of dependencies which will be very difficult to truly understand.
For many AI/ML models, input data will come from other proprietary devices, variable (and potentially biased) human and clinical processes. The output of other models may be responsible for some inputs e.g. a readmission score based on a questionnaire, or a patient-generated symptom diary. These too can be changed at any time or may not be available. Clinical workflow cannot be controlled and guaranteed – it certainly cannot be assumed ‘locked’. Some models may be very sensitive to changes in subtle, nuanced and unexpected ways. Risk mitigations can be managed, but third-party external dependencies can never be completely assured.
For example, at Google, how people interact with the search results page will be based on how the results are presented to them. If you, however, alter the presentation of the search results page whether visually or algorithmically, people’s behaviour may also change via a hidden feedback loop. Predictive models based on the old interface may no longer work. It is for this same reason, that training an algorithm to trade securities based solely on patterns in past trading history will not make you a millionaire the next day – the world changes.
As a good example of unexpected input, take a look at the frequency distributions of respiration rate and pulse rate below. These were collected by healthcare staff in a study comparing manually collected observations to that of Current Health’s platform. It is easy to spot the human influence.
Respiration Rate Frequency Distribution:
See the ‘hedgehog-like’ spikiness of this distribution – this is caused by a human preference for even RR observations during manual observations. It can also be caused by underlying mathematics and rounding in some respiration rate algorithms, which vary between devices.
Heart Rate Frequency Distribution:
In this similar example, note the buckets that observations are snapped to 10’s up to around 100bpm. Humans like to round to easy to remember numbers like 60/70/80.
Subtle human and algorithmic factors in the system are skewing the distribution. If now used as training data, this will have ramifications on the output machine learning model. And it will vary across sites, based on their staff training and based on the selection of third-party devices used.
Consider, for example, an ML model that receives input from multiple third-party systems, including electronic medical records (EMR), healthcare staff, and ICU monitors. Each of these systems may be implemented subtly differently depending on brand and there are inherent bias and subjectivity in human reporting. And yet, interoperability throughout the healthcare stack is critical to successful healthcare delivery.
The FDA gives considerable importance to the operators manual and the warnings and guidance contained within that manual. We believe the FDA should insist on a ‘technical’ IFU, providing a clear input/output specification and any associated requirements, warnings, and considerations. It should be a requirement that this is shared with other systems and vendors who are acting as inputs or outputs from the AI/ML model.
For critical systems, we feel there should be provisions for explicit input data validation. Expected constraints on input data can then be explicitly asserted for correctness (namely, some form of online data verification). This is an ongoing process – in essence, engineering a built-in safeguard when appropriate to deal with uncontrollable data risks when a model is running online at arms-length. This approach is standard practice, for example, in avionics when managing sensor readings where everything is bounded and guarded. This is a crucial aspect we feel should be part of the SPS. Many complex models models will happily accept junk – they will typically just run and give junk output, and that is dangerous. Garbage in, garbage out.
Consideration for closed loop effects
In a similar vein to the previous section, deploying any model will cause data drift over time due to the closed feedback loop because of the influence of the system itself, and other systems interacting around it. For example, if an AI/ML model is highly sensitive and specific at predicting readmission, this will likely produce a change in clinical perception and decision making. This creates a feedback loop which may invalidate or otherwise change the model in difficult to predict ways across sites.
Consideration should be built into the SPS and ACP for how this will be robustly monitored. This may require consideration beyond typical AI/ML evaluation metrics, and could involve publishing guidance to those charged with managing the model in a clinical setting (eg healthcare IT, clinical directors etc).
We believe there should be a consideration for planned obsolescence of models in the SPS or pre-market as part of the total product life cycle approach the FDA is taking. Issues such as an ML algorithm manufacturer being sold or entering bankruptcy can and will happen. And the risks of inputs changing over time will increase. Given the criticality of on-going monitoring, how will this be handled in the event of obsolescence? What if that model is now embedded within a larger system?
These issues should be discussed up front and required in the SPS.
Evaluation and validation
Metrics based on validation sets offer only a narrow view. The closed feedback loop discussed prior will introduce many latent empirical and temporal effects. To appropriately characterize these, it is my view that the SPS should consider multiple validation datasets, with a best and worst case defect analysis to represent the population and time-based differences.
The FDA’s proposals are an outstanding first step. Moreover, we appreciate the FDA providing them to the wider AI/ML industry at a time when they are malleable and can be appropriately influenced. Ultimately, we are all optimizing to improve healthcare delivery and patient care. AI/ML offers significant benefits to patients and to the working lives of healthcare professionals. We believe the points raised here are important for the successful, pragmatic implementation of AI/ML within healthcare and medicine.