Link Search Menu Expand Document

(Verifying Correctness) A working solution

  1. Designing the working solution
    1. Write Profile
    2. Read Profile
    3. Assumptions
  2. Modeling the working solution
  3. Verifying the solution
    1. Single Server
    2. Two Servers
    3. Final large test
    4. Summary

Designing the working solution

The problem we faced in the last solution was that the images were overwriting each other in the blob store. To perform a write, the blob was overwritten, leaving the system in an inconsistent state that would be returned by the read. The solution to this is simple: We generate a UUID for image, store it, and then associate it with the database record. That way the database record and image change together in a transactional manner.

Write Profile

ClientClientServerServerDatabaseDatabaseBlob StoreBlob StoreSubmits requestGenerates UUID for imageWrites blob, with UUID as keyReturns statusWrites record with userId as keyRecord contains metadata andUUID of imageReturns statusalt[any failure]Returns fail / timeoutReturns success

Read Profile

ClientClientServerServerDatabaseDatabaseBlob StoreBlob StoreSubmits requestReads database record with userId as keyReturns statusalt[metadata not present in database]Returns empty recordReads blob, with stored UUID as keyReturns statusalt[any unhandled failure]Returns fail / timeoutReturns success

Assumptions

We assume the UUID is always unique. This isn’t strictly true. The likelihood of each UUID being non-unique is 1/(2.71 quintillion). However, for the purpose of our model, we treat the probability as 0 and move on.

Note: Deciding on what assumptions to make will be harder when the probability of failure is more likely. For example, a 1/(1 billon) event would happen to Facebook many times a day. It is critical to track your assumptions and back them up with observability/monitoring or other techniques.

Modeling the working solution

The state machine is unchanged, but the state definitions will be updated.

Show Code Show LaTex
---------------------------- MODULE working ----------------------------
EXTENDS Naturals, Sequences

CONSTANTS 
    USERIDS,
    SERVERS,
    METADATAS,
    IMAGES,
    \* This constant is added to allow us to assign UUIDs as blob store keys
    UUIDS 

VARIABLES
    databaseState,
    blobStoreState,
    serverStates,

    operations 


vars == <<databaseState, blobStoreState, serverStates, operations>> 

(***************************************************************************)
(* Strong Typing                                                           *)
(***************************************************************************)

UserIdVal == USERIDS \union {"UNSET"}
MetadataVal == METADATAS \union {"UNSET"}
ImageVal == IMAGES \union {"UNSET"}
UUIDVal == UUIDS \union {"UNSET"} \* added UUID type

(***************************************************************************)
(* Describes a database record. We need this now that it has to keep track *)
(* of which image UUID it is associated with.                              *)
(***************************************************************************)

DatabaseRecord == [
    metadata: MetadataVal,
    imageId: UUIDVal
]

(***************************************************************************)
(* Describes all possible states a server can be in. Unchanged since last  *)
(* example.                                                                *)
(***************************************************************************)
ServerStateVal == 
    [
        state: {
            \* current: 
            "waiting", \* next: StartWrite or StartRead
            \* after: StartWrite
            "started_write", \* next: WriteBlob or FailWrite
            \* after: WriteBlob
            "wrote_blob", \* next: WriteMetadataAndReturn or FailWrite
            \* after: StartRead
            "started_read", \* next: ReadMetadata
            \* after: ReadMetadata, ReadMetadataAndReturnEmpty
            "read_metadata" \* next: ReadBlobAndReturn
        }, 
        userId: UserIdVal,
        metadata:  MetadataVal,
        imageId: UUIDVal, \* Need to track imageId to perform a lookup
        image: ImageVal
    ]

\* This is an observability value, and we are still measuring the same thing
\* No changes are needed
OperationValue == [type: {"READ", "WRITE"}, 
                   userId: UserIdVal, 
                   metadata: MetadataVal, 
                   image:ImageVal]
                   

TypeOk ==
    \* Database state modified to hold database records
    /\ databaseState \in [USERIDS -> DatabaseRecord]
    \* Blob store uses UUIDs as keys rather than userIds
    /\ blobStoreState \in [UUIDS -> ImageVal]
    /\ serverStates \in [SERVERS -> ServerStateVal]
    /\ operations \in Seq(OperationValue)


Init ==
    \* Database record needs to be initialized differently
    /\ databaseState = 
            [u \in USERIDS |-> [metadata |-> "UNSET", imageId |-> "UNSET"]]
    \* Blob store is initialized with UUIDS
    /\ blobStoreState = [u \in UUIDS |->  "UNSET"]
    /\ serverStates = [s \in SERVERS |->  [state |-> "waiting",
                                           userId |-> "UNSET",
                                           metadata |-> "UNSET",
                                           imageId |-> "UNSET",
                                           image |-> "UNSET"
                                          ]]
    /\ operations = <<>>
    
(***************************************************************************)
(* State Machine: All of the states are functions of s (server), because   *)
(* the only actively modeled actors in this system are our servers, but    *)
(* there can be multiple working simultaneously.                           *)
(***************************************************************************)

(***************************************************************************)
(* Writes                                                                  *)
(***************************************************************************)

StartWrite(s) ==
    /\ serverStates[s].state = "waiting"
    /\ \E u \in USERIDS, m \in METADATAS, i \in IMAGES:
        /\ serverStates' = [serverStates EXCEPT
                                ![s].state ="started_write",
                                \* Set values for the upcoming write
                                ![s].userId = u, 
                                ![s].metadata = m,
                                ![s].image = i]
        \* Record the write for observability
        /\ operations' = Append(operations, 
                                    [ 
                                       type |-> "WRITE",
                                       userId |-> u,
                                       metadata |-> m,
                                       image |-> i
                                    ])
    /\ UNCHANGED <<databaseState, blobStoreState>> 


WriteBlob(s) ==
    LET currentState == serverStates[s] 
    IN
    /\ currentState.state = "started_write"
    /\ \E id \in UUIDS:
        (*******************************************************************)
        (* Guarantees a unique Id to simulate UUID. Note: If we run out of *)
        (* unset UUIDs, our system will just stop writing. We need to look *)
        (* out for this and ensure the set of UUIDs is large enough.       *)
        (*******************************************************************)
        /\ blobStoreState[id] = "UNSET" \* 
        /\ blobStoreState' = [blobStoreState EXCEPT 
                                ![id] = currentState.image ]
        \* Track Id to write to database
        /\ serverStates' = [serverStates EXCEPT
                                ![s].state ="wrote_blob",
                                ![s].imageId = id]
    /\ UNCHANGED <<databaseState, operations>>

\* Writing the database is now the last part of a write operation
WriteMetadataAndReturn(s) ==
    LET currentState == serverStates[s] 
    IN
        /\ currentState.state = "wrote_blob"
        /\ databaseState' = [databaseState EXCEPT 
                                ![currentState.userId] = [
                                    metadata |-> currentState.metadata, 
                                    \* Store imageId in database for read
                                    imageId |-> currentState.imageId] ]

        /\ serverStates' = [serverStates EXCEPT
                                    ![s].state ="waiting",
                                    ![s].userId ="UNSET",
                                    ![s].metadata = "UNSET",
                                    ![s].image = "UNSET",
                                    ![s].imageId = "UNSET"]
        /\ UNCHANGED <<blobStoreState, operations>>

FailWrite(s) ==
    /\ serverStates[s].state \in {"started_write", "wrote_blob"}
    /\ serverStates' = [serverStates EXCEPT
                                ![s].state ="waiting",
                                ![s].userId ="UNSET",
                                ![s].metadata = "UNSET",
                                ![s].image = "UNSET",
                                ![s].imageId = "UNSET"]
    /\ UNCHANGED <<databaseState, blobStoreState, operations>> 


(***************************************************************************)
(* Reads                                                                   *)
(***************************************************************************)

StartRead(s) ==
    \* Reading only starts when a server is waiting
    /\ serverStates[s].state = "waiting"
    /\ \E u \in USERIDS:
                serverStates' = [serverStates EXCEPT
                                    ![s].state ="started_read",
                                    ![s].userId =u]
   
    /\ UNCHANGED <<databaseState, blobStoreState>> 
    /\ UNCHANGED operations

\* If database record is present
ReadMetadata(s) ==
    LET currentState == serverStates[s] 
    IN
    /\ currentState.state = "started_read"
    \* Represents reading the metadata while the database record is set
    /\ databaseState[currentState.userId].metadata # "UNSET"
    /\ serverStates' = 
        [serverStates EXCEPT
            ![s].state ="read_metadata",
            ![s].metadata = databaseState[currentState.userId].metadata,
            \* Reads imageId from database
            ![s].imageId = databaseState[currentState.userId].imageId]
    /\ UNCHANGED <<databaseState, blobStoreState>>
    /\ UNCHANGED operations

\* If database record is not present
ReadMetadataAndReturnEmpty(s) ==
    LET currentState == serverStates[s] 
    IN
    /\ currentState.state = "started_read"
    \* Represents reading the metadata while the database record is unset
    /\ databaseState[currentState.userId].metadata = "UNSET"
    /\ serverStates' = [serverStates EXCEPT
                                ![s].state ="waiting",
                                ![s].userId ="UNSET",
                                ![s].metadata = "UNSET",
                                ![s].image = "UNSET",
                                ![s].imageId = "UNSET"]
                                            
    /\ operations' = Append(operations,
                            (***********************************************)
                            (* Returns an empty record                     *)
                            (***********************************************)
                            [ 
                               type |-> "READ",
                               userId |-> currentState.userId,
                               metadata |-> "UNSET",
                               image |-> "UNSET"
                            ])
    /\ UNCHANGED <<databaseState, blobStoreState>>
    
ReadBlobAndReturn(s) ==
    LET currentState == serverStates[s] 
    IN
    /\ currentState.state = "read_metadata"                               
    /\ operations' = Append(operations,
                            [ 
                               type |-> "READ",
                               userId |-> currentState.userId,
                               metadata |-> currentState.metadata,
                               \* Looks up image by imageId
                               image |-> blobStoreState[currentState.imageId]
                            ])
    /\ serverStates' = [serverStates EXCEPT
                            ![s].state ="waiting",
                            ![s].userId ="UNSET",
                            ![s].metadata = "UNSET",
                            ![s].image = "UNSET",
                            ![s].imageId = "UNSET"]
    /\ UNCHANGED <<databaseState, blobStoreState>>



(***************************************************************************)
(* Specification / Next                                                    *)
(***************************************************************************)
Next ==
    \* For every step, pick a server and have it advance one state
    \E s \in SERVERS: 
        \/ StartWrite(s)
        \/ WriteBlob(s) \* New step
        \/ WriteMetadataAndReturn(s) \* New step
        \/ FailWrite(s)
        \/ StartRead(s)
        \/ ReadMetadata(s) \* New step
        \/ ReadMetadataAndReturnEmpty(s) \* New step
        \/ ReadBlobAndReturn(s)


Spec == Init /\ [][Next]_vars


(***************************************************************************)
(* Invariants                                                              *)
(***************************************************************************)

\* Note that the success criteria hasn't changed this whole time

ConsistentReads ==
    \* If there are no operations, they are consistent
    \/ operations = <<>>
    \/ \A i \in 1..Len(operations): \* For every read operation
        LET readOp == operations[i] IN
        \/  /\ readOp.type = "READ"
            \* There must exist a write operation
            /\  \/ \E j \in 1..i: 
                    LET writeOp == operations[j] IN
                    /\ writeOp.type = "WRITE"
                    \* With the same data
                    /\ readOp.userId = writeOp.userId
                    /\ readOp.metadata = writeOp.metadata
                    /\ readOp.image = writeOp.image
                \/ \* Ignore unset reads
                    /\ readOp.metadata = "UNSET"
                    /\ readOp.image = "UNSET"
        \/ readOp.type = "WRITE" \* Ignore writes
         
(***************************************************************************)
(* This is used for model checker configuration so the simulation doesn't  *)
(* go on forever.                                                          *)
(***************************************************************************)

StopAfter3Operations ==
    Len(operations) <= 3

StopAfter5Operations ==
    Len(operations) <= 5

=============================================================================

Verifying the solution

Single Server

We’re pretty confident this solution is going to work, but we’ve been confident before. We’ll start our test using just a single server, s1.

SPECIFICATION Spec

CONSTANTS
    SERVERS = {s1}
    METADATAS = {m1, m2}
    USERIDS = {u1}
    IMAGES = {i1, i2}
    UUIDS = {ui1, ui2, ui3, ui4, ui5}

CONSTRAINT
    StopAfter3Operations

INVARIANT
    TypeOk
    ConsistentReads

We also limit our total operations (Read or Write) to 3. That might not sound like much, but it’s caught all our errors so far.

But this time, instead of errors, we get back a state space:

State Name Total States Distinct States
Init 1 1
StartWrite 81904 4844
WriteBlob 16300 16300
WriteMetadataAndReturn 16300 5040
FailWrite 21144 14224
StartRead 20476 20476
ReadMetadata 15000 15000
ReadMetadataAndReturnEmpty 5476 451
ReadBlobAndReturn 15000 760

This is all the states that were tested by the checker. And it worked! We’re not done yet, though; this is just a single server. Let’s see what happens when we throw a second one in the mix.

Two Servers

We are now testing with two servers, s1, and s2. We are again stopping after 3 operations.

SPECIFICATION Spec

CONSTANTS
    SERVERS = {s1, s2}
    METADATAS = {m1, m2}
    USERIDS = {u1}
    IMAGES = {i1, i2}
    UUIDS = {ui1, ui2, ui3, ui4, ui5}

CONSTRAINT
    StopAfter3Operations

This test passes too!

State Name Total States Distinct States
Init 1 1
StartWrite 1064288 29892
WriteBlob 287040 166620
WriteMetadataAndReturn 287040 59500
FailWrite 373296 42664
StartRead 266072 184372
ReadMetadata 206560 146800
ReadMetadataAndReturnEmpty 59512 1571
ReadBlobAndReturn 365600 4100

Note how the number of states has expanded drastically. This is to be expected; two servers can interact in a lot more ways than one server. So the two tests that have been guiding our entire development have passed! Are we done now? Not quite.

Final large test

Before we can be really confident, we have to run a larger test. We’re going with 4 servers this time, and way more types of images and metadata. We’re also going up to 10 operations. This is overall a much more representative test. Why didn’t we start with it?

CONSTANTS
    SERVERS = {s1, s2, s3}
    METADATAS = {m1, m2, m3}
    USERIDS = {u1, u2, u3, u4, u5}
    IMAGES = {i1, i2, i3}
    UUIDS = {ui1, ui2}

CONSTRAINT
    StopAfter10Operations

The state space grows exponentially on the number of constants. The previous tests completed in seconds. This test completed in hours. It’s easy to make a test that will take days. The recommended technique is to start small and work your way up. Finally, do a large test on a powerful machine. The TLA+ toolbox can even spin up fast cloud workers. If you get failures with small tests, a large test isn’t going to be any better. But if all your small tests are passing, it’s time to really stress test the solution. In general, running the biggest test you can afford to wait for is the right answer (unless you know particular details about your system that would make expanding the test unnecessary).

Here is the final state space. Look how large it is:

State Name Total States Distinct States
All States 2281234 728303

Precise state profiling needed to be turned off just so this model would run.

Summary

Now we can be confident in our solution. And this is how it will remain forever, perfect and untouched! I bet no one is going to want to add features. Oh wait…