Non-Blocking State Transfe
Please use this forum thread to discuss ideas/enhancements and provide comments and feedback.
JIRA: JBCACHE-1236
What we have now
Right now, JBoss Cache uses FLUSH in JGroups to make sure any in-flight messages are received and prevent any more messages from being broadcast so that a stable state can be transferred.
While this provides a high level of data consistency, it is a brute force method that will not scale when there is a large amount of state to be transferred, since it effectively freezes the cluster during the state transfer period.
An alternative - Non-Blocking State Transfer
The alternative discussed here attempts to:
provide state to a new node joining a cluster or a partition quickly and effectively
provide consistency guarantees
not hold up the cluster at all so the rest of the cluster can proceed as usual even while state is being transferred
This new approach would need MVCC locking to be implemented first since non-blocking reads is necessary. It also assumes idempotency of cache updates provided they are applied in the correct order.
Assumptions
- Non-blocking reads are available (MVCC)
- Modifications are idempotent
- Streaming state transfer is present in the JGroups stack (To provide an open stream between the 2 instances)
Approach
Assume a 3-instance cluster, containing instances A, B, and C. Instance D joins the cluster.
All nodes track pending prepares since startup. This additional overhead means that whenever a transaction enters its prepare phase it is recorded in a concurrent collection and when the transaction commits or rolls back it is removed from this concurrent collection.
D asks A for state, and starts responding to all 1 and 2 phase prepares/commits positively, but doesn't log any transactions.
A starts logging all transactions and non-transactional writes
A starts sending transient and persistent state to D. This does not block on anything.
D applies state.
- A starts sending the transaction log to D
- A continues to write the transaction log until the log is either empty, or progress is no longer being made.
- Lack of progress occurs when the log size is repeatedly not reduced after writing
- A waits for pending incoming and outgoing requests to complete and suspends new ones
- A sends a marker indicating the need to stop all modifications on A and D.
- D receives the marker and unicasts a StateTransferControlCommand to A.
- On receipt of this command, A closes a latch that prevents its RPC dispatcher from sending or receiving any commands.
- D too closes a similar latch on its RPC dispatcher
- Note that this latch does NOT block StateTransferControlCommands
These latches guarantee that other transactions originating at B or C will block in their communications to A or D until the latches are released.
D retrieves and applies the final transaction log, which should no longer be growing
D retrieves and applies all non-committed prepares
- A sends a marker indicating transmission is complete
- A resumes processing of incoming / outgoing requests
D unicasts another StateTransferControlCommand to A.
- This releases latches on A
- D also releases similar latches on D
- D sets it's cache status to STARTED.
Transaction Log
This is a data structure that will contain an ordered list of:
public static class LogEntry { private final GlobalTransaction transaction; private final List<WriteCommand> modifications; }
The receiving node will apply this log by starting transactions using the given gtx, applying the modifications, and commit the transaction.
Capturing the transaction log
It is imperative that the transaction log is captured in the order in which locks are acquired/transactions completed. As such, in the Synchronization on the state sender (A), the transaction is added in afterCompletion. In addition all non-committed prepares must be kept in a table indexed by gtx. Once the gtx has completed, it is removed from the table.
Idempotency
Idempotency is a requirement since it is feasible that state read may or may not include a given update. As such, all transactions recorded during the state generation process will have to be re-applied. Still, this isn't a problem - even with node deletions, creation or moving - provided the transaction log is replayed in exactly the same order as it was applied on the node generating state.
Benefits
Cluster continues operations, and is not held up
Drawbacks
D may take longer to join as it would need to replay a transaction log after acquiring state
Assumptions/Requirements
MVCC is in place to provide efficient non-blocking READ on the state provider
Cache updates are idempotent
Comments