Link Search Menu Expand Document

Working cache invalidation

  1. Designing a working solution
    1. Working cache invalidation
  2. Modeling
  3. Verification
  4. Retrospective

Designing a working solution

We need to account for inflight cache fill requests. Let’s remind ourselves what we know about them:

  • They are triggered in response to a read request.
  • They happen synchronously or at least near realtime so a response can be sent back to the client.

From a design perspective, then, we can set certain parameters:

  • No key corresponding to an inflight request will be evicted, as it is needed to respond to an active request. Therefore the key belongs in the cache.
  • We should process invalidation messages for keys corresponding to inflight requests. We know the key will be added to the cache soon, so it may be relevant.

Working cache invalidation

WriterWriterDatabaseDatabaseCache Invalidation QueueCache Invalidation QueueCacheCacheUpdates dataAdds updated key andversioned data to queuePolls queueReturns invalidation itemalt[Key not in Cache]Does nothingalt[Key in cache fill requests]Adds key and data frominvalidation message to cachealt[Key in Cache]Does nothingalt[Cache data olderthan invalidation]Replaces old versionwith data from message


We do have one modeling consideration. We mentioned that keys could be evicted at any time, which accounted for server crashes as well as caching policy. We can model a server crash as a cache fill failure followed by an eviction. Therefore, we can already account for this scenario without needing to extend our current model.

Show Code Show LaTex
----------------------------- MODULE cacheinvalidationv3 -----------------------------

CacheHandleInvalidationMessage == 
    /\ \E message \in invalidationQueue: 
        /\ \/ /\ cache[message.key] \in CacheHit
              \* Message needs to be newer than the cache
              /\ cache[message.key].version < message.version
           \* Or not in the cache, but with a pending fill request
           \/ /\ cache[message.key] \in CacheMiss
              /\ cacheFillStates[message.key].state # "inactive"
        \* Update item in cache
        /\ cache' = [cache EXCEPT 
                        ![message.key] = [
                            type |-> "hit",
                            \* Update to version in invalidation message
                            version |-> message.version
        \* Remove message from queue because handled
        /\ invalidationQueue' = invalidationQueue \ {message}

    /\ UNCHANGED <<cacheFillStates, database, counter>>

CacheIgnoreInvalidationMessage == 
    /\ \E message \in invalidationQueue: \* Dequeue invalidation queue in any order
        \* Ignore invalidation messages for messages not in cache
        /\ \/ /\ cache[message.key] \in CacheMiss
              \* and a fill is not occurring
              /\ cacheFillStates[message.key].state = "inactive"
           \* Or when the cache already has the same or larger version
           \/ /\ cache[message.key] \notin CacheMiss
              /\ cache[message.key].version >= message.version
        \* Remove message from queue to ignore
        /\ invalidationQueue' = invalidationQueue \ {message}
    /\ counter' = counter + 1
    /\ UNCHANGED <<cacheFillStates, database, cache>>

CacheEvict(k) ==
    /\ cache[k] \in CacheHit
    \* A key with a pending request will not be evicted
    /\ cacheFillStates[k].state = "inactive"
    /\ cache' = [cache EXCEPT ![k] = [type |-> "miss"]]
    /\ counter' = counter + 1
    /\ UNCHANGED <<database, cacheFillStates, 

\* Modification History
\* Last modified Wed Jun 15 13:08:32 MST 2022 by elliotswart
\* Created Tue Jun 14 20:36:02 MST 2022 by elliotswart


We run a larger test on this model.


    KEYS = {k1, k2}




And it passes:

State Name Total States Distinct States
Init 1 1
DatabaseUpdate 50885754 4832447
CacheStartReadThroughFill 5265822 2651080
DatabaseRespondToCacheFill 11230605 5633443
CacheCompleteFill 13332315 4578998
CacheIgnoreFill 11931301 0
CacheHandleInvalidationMessage 73569738 5147961
CacheIgnoreInvalidationMessage 53779966 1995678
CacheEvict 10224889 1057390

We trust this result more because we have seen earlier versions of the system fail in rational ways against the same invariants and properties.


There are a couple of key modeling takeaways from this series:

  • Implementing another module: We used the exact same cacherequirements module for the 4 iterations of our caches. Not only does this follow do not repeat yourself (DRY) principals, but it also shows how we can maintain requirement consistency across many implementations. While this was a simple example, it is possible to import much more subtle requirement modules such as “linearizable” and map them into your specific module. They can then provide Invariants and Properties that let you know if your solution is working correctly.
  • Compactness of specifications: The entire set of requirements and common data properties was 50 lines long with extensive comments. The final working cache specification was 220 lines long. This modeled and tested reasonably sophisticated cache invalidation logic. This is much less verbose and time-consuming than implementing the logic and tests in a standard programming languages. This is part of why formal modeling is a great next step after whiteboarding or diagramming. The output is small enough to be read, critiqued and tested in one sitting.
  • Fairness: While it can be tempting just to write WF_vars(Next) as part of your specification, you often need to break it up further. If we were to write that, evictions would HAVE to occur, or worse Database updates would HAVE to happen, meaning that we would not necessarily be able to catch temporal property violations. Without fairness, the cache could simply never update, which would break eventual consistency. That circumstance is something we have to prevent in code and/or with monitoring solutions. We can represent that effort as WF_vars(CacheFairness), providing fairness to all operations that MUST eventually occur.

So where do we go from here? There are two options. The next (and final) page in this series is about replicating the bug described by the Facebook paper.