Friday, April 17, 2015

Freestyle Libre: questions, getting it to fail and possible answers.

Some CGM behavior characteristics are often a bit tricky for newcomers. The main one can be summarized as "CGM measures interstitial fluid glucose values and as such will trail blood glucose values by 15 minutes." New users are told not to trust CGM values when glucose levels are changing rapidly. They are told not to calibrate their devices when glucose levels aren't stable.

The scenario is always the same: a new user pops somewhere and complains about its CGM accuracy. Dozens of helpful users show up and explain the above point. A year ago, I would probably have been among the helpers.

There is indeed some very solid evidence to back those assertions. The scientific literature has consitently reported a delay. The field of BG <> IG exchanges and equilibrium has been studied extensively. A good summary can be found in A Tale of Two Compartments: Interstitial Versus Blood Glucose Monitoring. This article is a must read if you are interested in what really goes on beneath your skin in terms of G exchanges. Please note that there is some mild polite disagreement on the exact length of the delay which as been reported to be as small as 6 minutes to as big as fifteen minutes. Several mathematical models have been developed to model the exchanges and the correlation between IG and BG under different circumstances. There is also, believe it or not, a delay of around 1.5 minutes between the concentration of Interstial Glucose and its concentration in the sensor (because the sensor is protected by a selective membrane).

About a year ago, I visualized the process as two reservoirs: the blood reservoir, receiving a certain input of G (food, gluconeogenesis), connected by a pipe that would allow a certain flow to the interstitial reservoir which in turn would leaking G (for tissue consumption). I wasn't that far from the truth, except that my model was very basic. It is of course a bit more complicated in reality as explained in [SORRY WRONG LINK - WILL FIX] Estimating Plasma Glucose from Interstitial Glucose: The Issue of Calibration Algorithms in Commercial Continuous Glucose Monitoring Devices (which is probably not the most complex paper on the issue...).

Now, the interesting thing about having a model is that you can make predictions. We've all heard about the global warming models and their different predictions. Keep that in mind...

Questions, questions...

... and allow me to backtrack a minute. Over the last year or so, a few things kept bothering me.

First, while the small Dexcom calibration analysis I carried out on user submitted files clearly showed a couple of things such as the relative lack of accuracy of the Dexcom in low ranges and the consequences of calibrating in that range, I failed to identify an impact of the rate of change on subsequent accuracy. Why could I not find the obvious everyone was talking about? In a way, that was understandable since users were generally following the rule no to calibrate when G is changing. But even if I intentionally cherry picked the cases when they ignored the rule, I could not find an impact nearly as significant as low range calibration.

Then, when Dexcom released its AP 505 algorithm, it appeared that they suddenly started to trust the value they were receiving more often. That was a bit puzzling: how could simply upgrading a receiver suddenly transform data from an unchanged sensor/transmitter into "better" data?  I speculated that the sensors had possibly been improved and that the transmitter hardware and firmware had been upgraded and that Dexcom simply expected the changes to percolate as new sensors were being sent to users.

 Finally, the Libre behavior was puzzling at times. It would track wonderfully and suddenly jump the gun and display a value that was much higher than expected and higher than what the BG meter would say. Then, after a while, it would go back to extremely good tracking. Leaving aside the temperature issue, I noticed - thanks to our tennis sessions - that the Libre tracked better standard G increases (meal not followed by physical activity) than carb loading followed by exercise. The mystery deepened as soon as I gained a fair understanding of the raw data format. In some cases, the Libre would report spot check values that seemed to ignore what its own raw data said. Then, after a while, if those spot checks did not actually materialize, the historical data would simply act as if they had never existed.


The Libre constantly rewrites history! (but for a good cause from a clinical point of view). At that point, after having collected several typical situations, I became convinced there was a predictive part in the Libre's behavior and tried to develop a decent (but most certainly highly simplified) model of its behavior that led to decent pseudo CGM runs.

 One question lingered in my mind though: was applying a predictive model justified? Or was it simply because I missed a simpler explanation that I was forced to "cheat"?



I decided to run some tests again, starting from a very well behaved Libre sensor. As you can see on the left, that sensor performed almost flawlessly compared to our BG meter when we used it for scheduled tests. Data point 7 shows a hint of trigger happiness. Data point 11 was a compression event.

This extremely well behaved sensor (I can assure you that I have not cherry picked the best results: these are the only results we had for the first five days starting 16 hours after a pre-insertion) seemed to be an ideal candidate for reliable experiments.


Here's the result of an experiment intended to mislead the Libre, based on our understanding of its behavior. The blue line is my simple/direct interpretation of the raw Libre data: it is possibly inaccurate, but it has worked reliably in stable conditions for the first five days of the life of the sensor. Max eats a bit and starts moving. We start from a stable and accurate situation where the Libre and the BG meter still match. However, 15 minutes later, the Libre spot check gives 163 mg/dL while BG is stuck at 120 mg/dL. Has our perfectly matching Libre suddenly lost its marbles? Not really: it seems to have predicted the BG meter value based on the data it had roughly 10 minutes earlier. Twenty two minutes later, BG has started to rise again and the Libre seems to be back on track based on the data it saw roughly ten minutes earlier. That could, of course, be a coincidence. However, keep in mind that it was a verification experiment designed explicitly what we noticed before... Also worth noting is the fact that simple linear extrapolation doesn't work too well in general, but we'll get back to this below.


Going back to the roots to get answers.

If you've been reading this blog for a while, you've probably understood that I am more interested in the process of investigation than in reaching some kind of goal. The fun is to ask questions and to try to answer them. But sometimes, you hit stumbling blocks and I started looking a bit at the literature. That's why I did when it turned out that my pseudo-CGM model wasn't working too well in all cases.

I had several questions in mind, seemingly somewhat unrelated.

How could I explain that the Dexcom signal quality suddenly increased without obvious sampling hardware changes?

How could I explain that the Libre seemed to predict BG uncannily at times but also overshoot badly in some usage scenarios?


How could I explain, when I looked at the Dexcom calibration, that the actual BG slopes at calibration did not matter as much as they should?

One very interesting paper to start with was FreeStyle Navigator Continuous Glucose Monitoring System with TRUstart Algorithm, a 1-Hour Warm-Up Time where we learn that in the Freestyle Navigator...  

TRUstart corrects for the effect of interstitial glucose lag, and the window for calibration has been opened to rates up to ±3.5 mg/dl/min. Also, the acceptable glucose range for calibration was increased from 60–300 to 60–400 mg/dl, because data collected after the initial product was introduced have demonstrated sufficiently accurate calibration in the range of 300–400 mg/dl. 
and that
To obtain accurate calibration during times of glucose change, a first-order linear ordinary differential equation is used to describe the difference between blood and interstitial glucose.3 Using this model, the sensor current for sensitivity calculation is corrected for an average time lag of 10 min. The model requires an estimate of the rate of interstitial glucose change, which is calculated from the 1 min measurements ±7 min from the time of the BG calibration test.

What I learned from this article was
  1. slopes of +/- 35 mg/dL in ten minutes are officially considered acceptable for time corrected calibrations in the Freestyle Navigator. That's a significant slope! Since the Dexcom users who shared their files generally tried to calibrate when the trend was stable, that also explained why I couldn't spot any impact of the trend on the calibration accuracy (OK, while this article doesn't describe the Dexcom, there's no reason to assume Dexcom hasn't thought about correcting for slope as well).
  2. calibrations in below 60 mg/dL are rejected. That information also confirmed what my calibration analysis showed about the impact of range (many other concerns in the literature about that low range anyway).
  3. that the notion of predicting blood glucose based on a model of the IG-BG exchanges was perfectly acceptable. We can, I believe, legitimately speculate that, if the Freestyle Navigator uses a predicted BG value to put its calibration value in the correct time frame, the Freestyle Libre could very well be using that predicted BG for its spot check values when conditions are changing. The algorithm could be augmented by safety rules that would, for example, not predict values if the model projected impossibly low BGs.
And finally, the reference to the first order linear differential equations leads us to the notion of derivatives. Derivatives are easy if you are working with schoolbook equations (symbolic differentiation) but it becomes much harder if you are dealing with a function specified by the data you are measuring. Leaving the approximation of that function aside, we hit the problem of noise. Measures typically contain a certain amount of noise and, unfortunately, that noise will be amplified by differentiation if the data isn't smoothed. The reference in the second quote links to this 1999 paper Subcutaneous glucose predicts plasma glucose independent of insulin: implications for continuous monitoring

which states that
Although the correction is essentially perfect in the absence of noise, the addition of even a small amount of noise (0.75% noise) dramatically degrades the  calibrated sensor signal
(which will not be a surprise to anyone with a background in signal analysis)

and then proposes to correct the problem by using a three point moving average, not of the values, but of the derivatives terms. Three values? Sensitivity to noise? That is also reminiscent of the Dexcom's behaviour and could explain why a different algorithm, less sensitive to noise, could suddenly consider data to be useable and "clean" when it wasn't before. It is therefore possible that the reason why the Dexcom AP algorithm treats more data as clean than the previous version: it could use either a totally different IG-BG model, or a more noise robust differentiation algorithm.

In a way, that was both bad and good news. While I was reassured that the core idea of my own interpretative algorithm is sound, I became painfully aware that it was a bit simplistic. Exploring a labyrinth in darkness is fun, but I probably look like a hopelessly naive idiot in the eyes of the real guys who are developing this thing. But that paper offered plausible explanations or validation for all the above questions!

Ah, one last question. What do I know about the noise? Very little actually. And data seems extremely hard to get. As the Roche researchers note in this paper
However, the glucose sensor raw data are not usually included in manufacturer publications, and therefore no quantitative statement can be made about comparing noise levels between different CGM systems.
This being said, the design of the Libre sensor on the TI platform is straightforward enough that it could give some insight...

Consequences

For standard users: as CGM become more and more complex black boxes, common wisdom might be out of date. It may apply under some circumstances and simply be extremely wrong in others. The eventual model may decide not to kick in, might deliver too good to be honest results, might overshoot, etc... There is no need to repeat old truths endlessly when the reality evolves rapidly.

As far as Artificial Pancreas teams are concerned, they probably ideally would need to have access to actual raw data and noise data, as close to the source and as unprocessed as possible. Running a secondary model on an eventual black box predictive model as exposed in the spot checks of the Libre is hopeless. An extrapolative model built upon another extrapolative model is asking for big trouble. Business-wise, that could mean that a deep collaboration with a CGM manufacturer is required and that ultimately those guys will get to control who releases what.


No comments:

Post a Comment