Reproducing Facebook’s bug
Introduction
The Facebook blog post that inspired this series ended with a bug caught by Polaris, Facebook’s cache inconsistency observability tool. Like any bug found in production, it would be better if it could be caught in the design process. In this post, we’ll try to make a model with sufficient detail to catch this bug.
This reconstruction is best thought of as historical fiction, the “what if the British won the Revolutionary War” version of Facebook’s cache behavior. Take it with a bag of salt.
Detective work
The bug report
Taken directly from this blog post.
Precondition: The cache filled metadata.
1. The cache tried to fill the metadata with version.
2. In the first round, the cache first filled the old metadata.
3. Next, a write transaction updated both the metadata table
and the version table atomically.
4. In the second round, the cache filled the new version data.
Here, the cache fill operation interleaved with the database
transaction. It happens very rarely because the racing window
is tiny. You might be thinking, “This is the bug.”. No. Actually,
so far everything worked as expected because cache invalidation is
supposed to bring the cache to a consistent state.
5. Later, cache invalidation came during an attempt to update the
cache entry to both the new metadata and the new version. This almost
always works, but this time it didn’t.
6. The cache invalidation ran into a rare transient error on the cache
host, which triggered the error handling code.
7. The error handler dropped the item in cache. The pseudocode looks like this:
drop_cache(key, version);
It says drop the item in cache, if its version is less than specified.
However, the inconsistent cache item contained the latest version. So
this code did nothing, leaving stale metadata in cache indefinitely.
This is the bug. We simplified the example quite a bit here. The actual
bug has even more intricacy, with database replication and cross region
communication involved. The bug gets triggered only when all steps above
occur and happen specifically in this sequence. The inconsistency gets
triggered very rarely. The bug hides in the error handling code behind
interleaving operations and transient errors.
Deriving Assumptions
Assumption | Evidence |
---|---|
Database updates metadata and version transactionally | Step 3 |
The metadata stored in cache, and the version of that metadata, are filled separately (non-transactionally) | Steps 1 and 4 |
The metadata stored in cache, and the version of that metadata are filled with separate database requests | Steps 1 and 4 |
The cache fill of metadata or version can fail | Step 5 |
Cache invalidation messages contain both the version and the metadata | Step 5 |
Cache invalidation message processing can fail in a way that is not automatically resumed | Step 6 |
When cache invalidation fails, an error handler is called that conditionally drops keys lower than the version on the message | Step 7 |
Additionally, we can gather information from other sources.
Assumption | Evidence |
---|---|
The cache invalidation messages are guaranteed to be eventually delivered | TAO design paper |
The main question it raised: How exactly does Facebook use versions in the cache, if they are allowed to get out of sync? The answer seems to be: leaning heavily on their cache invalidation solution (Steps 4 and 5). Because the cache will get every invalidation message, no matter how far the version and metadata are out of sync, the LATEST cache invalidation message should fix it. From that we must infer:
Cache invalidation replaces metadata values of at most equal or lesser version.
The fundamental bug is therefore the fact that the error handler that handles cache invalidation errors drops only lesser versions, while the cache invalidation contract requires replacing equal or lesser versions.
Let’s see if we can catch this with a model.
The model
All the descriptions and assumptions are done at a simplified level, as in the blog post. Therefore we will model this level of detail. Note: Simplification in of itself does not make a model invalid or even less helpful. Models should be constructed to catch bugs at a particular level of abstraction. It is possible that a system may need multiple models, at different levels of abstraction, to describe it. Sometimes you can even connect them together to find bugs in their interactions, though this is very advanced.
As we create a model in an attempt to recreate a system, our biggest risk is overfitting—that is, tailoring our implementation too exactly to get our expected result. We can mitigate this somewhat with Occam’s Razor: trying to find the simplest model that fits the assumptions
The other assumption we make is that the reported bug is the only bug in Facebook’s logic at this level of abstraction. Where data is not available from Facebook’s caching papers, we may substitute a fix from the prior articles in this series.
We get state machines that look like this:
One thing thing to notice in the model is that we import the same cacherequirements we have used for the entire series. They should be sufficient to help us find the bug.
----------------------------- MODULE facebookcacheinvalidation -----------------------------
EXTENDS Naturals
CONSTANTS
KEYS
VARIABLES
database,
\* Represents metadata version stored in the cache
cache,
\* Represents the version stored in the cache. This is what is used for comparisons,
\* to allow our model to decouple ACTUAL metadata version with STORED version
cacheVersions,
cacheFillStates,
invalidationQueue,
counter
\* We can still test with the same cache requirements we've been using this whole time
INSTANCE cacherequirements
vars == <<database, cache, cacheFillStates,
invalidationQueue, counter, cacheVersions>>
InvalidationMessage == [key: KEYS, version: DataVersion]
CacheFillState == [
state: {
"inactive",
"startfillmetadata",
"respondedtometadata", \* Next: CacheFillMetadata
"startfillversion",
"respondedtoversion" \* Next: CacheFillVersion
},
version: DataVersion]
CacheValue == CacheMiss \union CacheHit
TypeOk ==
/\ database \in [KEYS -> DataVersion]
/\ cache \in [KEYS -> CacheValue]
\* Cache versions are typed identically to cache
/\ cacheVersions \in [KEYS -> CacheValue]
/\ cacheFillStates \in [KEYS -> CacheFillState]
/\ invalidationQueue \in SUBSET InvalidationMessage
/\ counter \in Nat
Init ==
/\ database = [k \in KEYS |-> 0]
\* cache (metadata) and cacheVersions start empty together
/\ cache = [k \in KEYS |-> [type |-> "miss"]]
/\ cacheVersions = [k \in KEYS |-> [type |-> "miss"]]
/\ cacheFillStates = [k \in KEYS |-> [
state |-> "inactive",
version |-> 0]
]
/\ invalidationQueue = {}
/\ counter = 0
DatabaseUpdate(k) ==
LET updatedVersion == database[k] + 1 IN
/\ database' = [database EXCEPT
![k] = updatedVersion]
/\ invalidationQueue' =
invalidationQueue \union
{[key |-> k, version |-> updatedVersion]}
/\ UNCHANGED <<cache, cacheVersions, cacheFillStates, counter>>
CacheStartFillMetadata(k) ==
\* Fill only occurs if the cache is unset for that value
/\ cache[k] \in CacheMiss
/\ cacheFillStates[k].state = "inactive"
/\ cacheFillStates' = [cacheFillStates EXCEPT ![k].state = "startfillmetadata"]
/\ UNCHANGED <<database, cache, cacheVersions,
invalidationQueue, counter>>
DatabaseRespondWithMetadata(k) ==
/\ cacheFillStates[k].state = "startfillmetadata"
/\ cacheFillStates' = [cacheFillStates EXCEPT
![k].state = "respondedtometadata",
![k].version = database[k]
]
/\ UNCHANGED <<database, cache, cacheVersions,
invalidationQueue, counter>>
\* Metadata updated in cache
CacheFillMetadata(k) ==
/\ cacheFillStates[k].state = "respondedtometadata"
/\ cacheFillStates' = [cacheFillStates EXCEPT
![k].state = "inactive"
]
\* Represents cache metadata being updated
\* Does not check version
/\ cache' = [cache EXCEPT
![k] = [
type |-> "hit",
version |-> cacheFillStates[k].version
]
]
/\ UNCHANGED <<database, cacheVersions, invalidationQueue, counter>>
CacheStartFillVersion(k) ==
\* Fill only occurs if the cacheVersion is unset for that value
/\ cacheVersions[k] \in CacheMiss
/\ cacheFillStates[k].state = "inactive"
/\ cacheFillStates' = [cacheFillStates EXCEPT ![k].state = "startfillversion"]
/\ UNCHANGED <<database, cache, cacheVersions,
invalidationQueue, counter>>
DatabaseRespondWithVersion(k) ==
/\ cacheFillStates[k].state = "startfillversion"
/\ cacheFillStates' = [cacheFillStates EXCEPT
![k].state = "respondedtoversion",
![k].version = database[k]
]
/\ UNCHANGED <<database, cache, cacheVersions,
invalidationQueue, counter>>
\* Version updated in cache
CacheFillVersion(k) ==
/\ cacheFillStates[k].state = "respondedtoversion"
\* Fill empty versions
/\ \/ cacheVersions[k] \in CacheMiss
\* or newer versions
\/ /\ cacheVersions[k] \in CacheHit
/\ cacheVersions[k].version < cacheFillStates[k].version
/\ cacheFillStates' = [cacheFillStates EXCEPT
![k].state = "inactive"
]
\* Represents cache versions being updated
/\ cacheVersions' = [cacheVersions EXCEPT
![k] = [
type |-> "hit",
version |-> cacheFillStates[k].version
]
]
/\ UNCHANGED <<database, invalidationQueue, cache, counter>>
CacheIgnoreFillVersion(k) ==
/\ cacheFillStates[k].state = "respondedtoversion"
\* If we have a newer version in cache, ignore fill
/\ /\ cacheVersions[k] \in CacheHit
/\ cacheVersions[k].version >= cacheFillStates[k].version
/\ cacheFillStates' = [cacheFillStates EXCEPT \* Reset to 0
![k].state = "inactive",
![k].version = 0
]
/\ counter' = counter + 1
/\ UNCHANGED <<cache, cacheVersions,
database, invalidationQueue>>
CacheFailFill(k) ==
/\ cacheFillStates[k].state \in {"respondedtometadata", "respondedtoversion"}
/\ cacheFillStates' = [cacheFillStates EXCEPT
![k].state = "inactive",
![k].version = 0
]
/\ counter' = counter + 1
/\ UNCHANGED <<database, cache, cacheVersions,
invalidationQueue>>
CacheEvict(k) ==
/\ cache[k] \in CacheHit
/\ cacheFillStates[k].state = "inactive"
/\ cache' = [cache EXCEPT ![k] = [type |-> "miss"]]
/\ cacheVersions' = [cache EXCEPT ![k] = [type |-> "miss"]]
/\ counter' = counter + 1
/\ UNCHANGED <<database, cacheFillStates,
invalidationQueue>>
(***************************************************************************)
(* Invalidation message handling *)
(***************************************************************************)
UpdateFromInvalidationMessage ==
\E message \in invalidationQueue:
\* Can update with no version
/\ \/ /\ cache[message.key] \in CacheHit
/\ cacheVersions[message.key] \in CacheMiss
\* or with greater or equal version
\/ /\ cacheVersions[message.key] \in CacheHit
/\ cacheVersions[message.key].version <= message.version
\* Kills pending fill request
/\ cacheFillStates[message.key].state = "inactive"
(***********************************************************************)
(* Unlike fills from the database, the invalidation message contains *)
(* both version and metadata. *)
(***********************************************************************)
/\ cache' = [cache EXCEPT
![message.key] = [
type |-> "hit",
version |-> message.version
]]
/\ cacheVersions' = [cache EXCEPT
![message.key] = [
type |-> "hit",
version |-> message.version
]]
/\ invalidationQueue' = invalidationQueue \ {message}
/\ UNCHANGED <<cacheFillStates, database, counter>>
FailUpdateInvalidationMessageEvictKey ==
\E message \in invalidationQueue:
\* Can update with no version
/\ \/ /\ cache[message.key] \in CacheHit
/\ cacheVersions[message.key] \in CacheMiss
\* or with greater version
\/ /\ cacheVersions[message.key] \in CacheHit
/\ cacheVersions[message.key].version < message.version
\* Kills pending fill request
/\ cacheFillStates[message.key].state = "inactive"
\* Key is evicted from cache, to allow fresh cache fill
/\ cache' =
[cache EXCEPT
![message.key] = [type |-> "miss"]]
/\ cacheVersions' =
[cacheVersions EXCEPT
![message.key] = [type |-> "miss"]]
/\ invalidationQueue' = invalidationQueue \ {message}
/\ UNCHANGED <<cacheFillStates, database, counter>>
FailUpdateInvalidationMessageIgnore ==
\E message \in invalidationQueue:
\* If message version is lower or equal than cache version, do nothing
/\ cacheVersions[message.key] \in CacheHit
/\ cacheVersions[message.key].version >= message.version
/\ counter' = counter + 1
/\ invalidationQueue' = invalidationQueue \ {message}
/\ UNCHANGED <<cacheFillStates, database,
cache, cacheVersions>>
IgnoreInvalidationMessage ==
\E message \in invalidationQueue:
\* Ignore invalidation messages if a key is not in cache
/\ \/ /\ cache[message.key] \in CacheMiss
\* and a fill is not occurring
/\ cacheFillStates[message.key].state = "inactive"
\* or when the cache already has a larger version
\/ /\ cacheVersions[message.key] \in CacheHit
/\ cacheVersions[message.key].version > message.version
/\ invalidationQueue' = invalidationQueue \ {message}
/\ counter' = counter + 1
\* Don't update cache
/\ UNCHANGED <<cacheFillStates, database, cache, cacheVersions>>
CacheFairness ==
\/ \E k \in KEYS:
\/ CacheStartFillMetadata(k)
\/ DatabaseRespondWithMetadata(k)
\/ CacheFillMetadata(k)
\/ CacheStartFillVersion(k)
\/ DatabaseRespondWithVersion(k)
\/ CacheFillVersion(k)
\/ CacheIgnoreFillVersion(k)
\/ UpdateFromInvalidationMessage
\/ FailUpdateInvalidationMessageEvictKey
\/ FailUpdateInvalidationMessageIgnore
\/ IgnoreInvalidationMessage
(***************************************************************************)
(* Specification *)
(***************************************************************************)
Next ==
\/ \E k \in KEYS:
\* Database states
\/ DatabaseUpdate(k)
\* Cache states
\/ CacheStartFillMetadata(k)
\/ DatabaseRespondWithMetadata(k)
\/ CacheFillMetadata(k)
\/ CacheStartFillVersion(k)
\/ DatabaseRespondWithVersion(k)
\/ CacheFillVersion(k)
\/ CacheIgnoreFillVersion(k)
\/ CacheEvict(k)
\/ CacheFailFill(k)
\/ UpdateFromInvalidationMessage
\/ FailUpdateInvalidationMessageEvictKey
\/ FailUpdateInvalidationMessageIgnore
\/ IgnoreInvalidationMessage
Spec == Init /\ [][Next]_vars /\ WF_vars(CacheFairness)
CounterBound == counter =< 2
=============================================================================
\* Modification History
\* Last modified Thu Jun 16 16:19:54 MST 2022 by elliotswart
\* Created Tue Jun 14 20:36:02 MST 2022 by elliotswart
Note how the entire Facebook model fits in 320 lines of code. Also, note how similar it looks to the models we’ve been using this whole series. It feels like the whole model is at the same level of abstraction; we haven’t had to disproportionately model certain pieces just to model our assumptions.
Can we find the bug?
And when we run it we get a bug report that we can actually map, item by item, to the bug reported by Facebook.
Temporal properties were violated.
- 1. Initial predicate Database at v0
- cache
- k1
- type: miss
- k1
- cacheFillStates
- k1
- version: 0
- state: inactive
- k1
- cacheVersions
- k1
- type: miss
- k1
- counter : 0
- database
- k1 : 0
- invalidationQueue :
- cache
- 2. CacheStartFillMetadata First round start
- cache
- k1
- type: miss
- k1
- cacheFillStates
- k1
- version: 0
- state: startfillmetadata
- k1
- cacheVersions
- k1
- type: miss
- k1
- counter : 0
- database
- k1 : 0
- invalidationQueue :
- cache
- 3. DatabaseRespondWithMetadata
- cache
- k1
- type: miss
- k1
- cacheFillStates
- k1
- version: 0
- state: respondedtometadata
- k1
- cacheVersions
- k1
- type: miss
- k1
- counter : 0
- database
- k1 : 0
- invalidationQueue :
- cache
- 4. CacheFillMetadata "In the first round, the cache first filled the old metadata"
- cache
- k1
- version: 0
- type: hit
- k1
- cacheFillStates
- k1
- version: 0
- state: inactive
- k1
- cacheVersions
- k1
- type: miss
- k1
- counter : 0
- database
- k1 : 0
- invalidationQueue :
- cache
- 5. DatabaseUpdate "Next, a write transaction updated both the metadata table and the version table atomically"
- cache
- k1
- version: 0
- type: hit
- k1
- cacheFillStates
- k1
- version: 0
- state: inactive
- k1
- cacheVersions
- k1
- type: miss
- k1
- counter : 0
- database
- k1 : 1
- invalidationQueue
-
- key : k1
- version : 1
-
- cache
- 6. CacheStartFillVersion Second round started
- cache
- k1
- version: 0
- type: hit
- k1
- cacheFillStates
- k1
- version: 0
- state: startfillversion
- k1
- cacheVersions
- k1
- type: miss
- k1
- counter : 0
- database
- k1 : 1
- invalidationQueue
-
- key : k1
- version : 1
-
- cache
- 7. DatabaseRespondWithVersion
- cache
- k1
- version: 0
- type: hit
- k1
- cacheFillStates
- k1
- version: 1
- state: respondedtoversion
- k1
- cacheVersions
- k1
- type: miss
- k1
- counter : 0
- database
- k1 : 1
- invalidationQueue
-
- key : k1
- version : 1
-
- cache
- 8. CacheFillVersion "In the second round, the cache filled the new version data."
- cache
- k1
- version: 0
- type: hit
- k1
- cacheFillStates
- k1
- version: 1
- state: inactive
- k1
- cacheVersions
- k1
- version: 1
- type: hit
- k1
- counter : 0
- database
- k1 : 1
- invalidationQueue
-
- key : k1
- version : 1
-
- cache
- 9. FailUpdateInvalidationMessageIgnore "The cache invalidation ran into a rare transient error on the cache host, which triggered the error handling code. However, the inconsistent cache item contained the latest version. So this code did nothing"
- cache
- k1
- version: 0
- type: hit
- k1
- cacheFillStates
- k1
- version: 1
- state: inactive
- k1
- cacheVersions
- k1
- version: 1
- type: hit
- k1
- counter : 1
- database
- k1 : 1
- invalidationQueue :
- cache
- Stuttering "leaving stale metadata in cache indefinitely"
This trace (or some version of it) is the first thing that comes up when you run the model above.
Retrospective
There are several key takeaways:
- Implementing another module: Again we used the exact same cacherequirements module for all iterations of our caches. This shows the power of writing requirements in TLA+ and having other modules implement them.
- Compactness of specifications: The entire set of requirements and implementation was 370 lines of code with plentiful comments. The detail and level of abstraction remained relatively consistent, supporting the representativeness of the spec.
- Potential to catch bugs before production: The specification is of sufficient detail that the coder of the error handler could have ensured (and even unit tested) conformance to the spec.
While this is obviously a simplified model of the actual behavior, there are two potential conclusions to draw:
- Modeling at this level of abstraction is useful, and other verification methods could be used to account for the additional complexity such as regions, network partitions, database replication, etc.
- The model could be enhanced to a higher level of detail that reflects those complexities, modularized to allow for testing at different levels of abstraction, and then holistically tested (likely over days on a high end machine).
A final reminder that this is still historical fiction, and likely deviates from the Facebook implementation (even at this level of abstraction) in a number of ways.