Version 25

    Non-Blocking State Transfe

     

    What we have now

     

    Right now, JBoss Cache uses FLUSH in JGroups to make sure any in-flight messages are received and prevent any more messages from being broadcast so that a stable state can be transferred.

     

    While this provides a high level of data consistency, it is a brute force method that will not scale when there is a large amount of state to be transferred, since it effectively freezes the cluster during the state transfer period.

     

    An alternative - Non-Blocking State Transfer

     

    The alternative discussed here attempts to:

     

    • provide state to a new node joining a cluster or a partition quickly and effectively

    • provide consistency guarantees

    • not hold up the cluster at all so the rest of the cluster can proceed as usual even while state is being transferred

     

    This new approach would need MVCC locking to be implemented first since non-blocking reads is necessary.  It also assumes idempotency of cache updates provided they are applied in the correct order.

     

    Assumptions

    1. Non-blocking reads are available (MVCC)
    2. Modifications are idempotent
    3. Streaming state transfer is present in the JGroups stack (To provide an open stream between the 2 instances)

    Approach

    Assume a 3-instance cluster, containing instances A, B, and C.  Instance D joins the cluster.

     

    All nodes track pending prepares since startup.  This additional overhead means that whenever a transaction enters its prepare phase it is recorded in a concurrent collection and when the transaction commits or rolls back it is removed from this concurrent collection.

     

    1. D asks A for state, and starts responding to all 1 and 2 phase prepares/commits positively, but doesn't log any transactions.

    2. A starts logging all transactions and non-transactional writes

    3. A starts sending transient and persistent state to D.  This does not block on anything.

    4. D applies state.

    5. A starts sending the transaction log to D
    6. A continues to write the transaction log until the log is either empty, or progress is no longer being made.
    7. Lack of progress occurs when the log size is repeatedly not reduced after writing
    8. A waits for pending incoming and outgoing requests to complete and suspends new ones
    9. A sends a marker indicating the need to stop all modifications on A and D.
    10. D receives the marker and unicasts a StateTransferControlCommand to A.
      1. On receipt of this command, A closes a latch that prevents its RPC dispatcher from sending or receiving any commands.
      2. D too closes a similar latch on its RPC dispatcher
      3. Note that this latch does NOT block StateTransferControlCommands
    11. These latches guarantee that other transactions originating at B or C will block in their communications to A or D until the latches are released.

    12. D retrieves and applies the final transaction log, which should no longer be growing

    13. D retrieves and applies all non-committed prepares

    14. A sends a marker indicating transmission is complete
    15. A resumes processing of incoming / outgoing requests
    16. D unicasts another StateTransferControlCommand to A.

      1. This releases latches on A
      2. D also releases similar latches on D
    17. D sets it's cache status to STARTED.

    Transaction Log

     

    This is a data structure that will contain an ordered list of:

     

       public static class LogEntry
       {
          private final GlobalTransaction transaction;
          private final List<WriteCommand> modifications;
       }
    

     

    The receiving node will apply this log by starting transactions using the given gtx, applying the modifications, and commit the transaction.

     

    Capturing the transaction log

     

    It is imperative that the transaction log is captured in the order in which locks are acquired/transactions completed.  As such, in the Synchronization on the state sender (A), the transaction is added in afterCompletion. In addition all non-committed prepares must be kept in a table indexed by gtx. Once the gtx has completed, it is removed from the table.

     

    Idempotency

     

    Idempotency is a requirement since it is feasible that state read may or may not include a given update.  As such, all transactions recorded during the state generation process will have to be re-applied.  Still, this isn't a problem - even with node deletions, creation or moving - provided the transaction log is replayed in exactly the same order as it was applied on the node generating state.

     

    Benefits

     

    • Cluster continues operations, and is not held up

     

    Drawbacks

     

    • D may take longer to join as it would need to replay a transaction log after acquiring state

     

    Assumptions/Requirements

     

    • MVCC is in place to provide efficient non-blocking READ on the state provider

    • Cache updates are idempotent