raft/rafttest: introduce datadriven testing

It has often been tedious to test the interactions between multi-member Raft groups, especially when many steps were required to reach a certain scenario. Often, this boilerplate was as boring as it is hard to write and hard to maintain, making it attractive to resort to shortcuts whenever possible, which in turn tended to undercut how meaningful and maintainable the tests ended up being - that is, if the tests were even written, which sometimes they weren't. This change introduces a datadriven framework specifically for testing deterministically the interaction between multiple members of a raft group with the goal of reducing the friction for writing these tests to near zero. In the near term, this will be used to add thorough testing for joint consensus (which is already available today, but wildly undertested), but just converting an existing test into this framework has shown that the concise representation and built-in inspection of log messages highlights unexpected behavior much more readily than the previous unit tests did (the test in question is `snapshot_succeed_via_app_resp`; the reader is invited to compare the old and new version of it). The main building block is `InteractionEnv`, which holds on to the state of the whole system and exposes various relevant methods for manipulating it, including but not limited to adding nodes, delivering and dropping messages, and proposing configuration changes. All of this is extensible so that in the future I hope to use it to explore the phenomena discussed in https://github.com/etcd-io/etcd/issues/7625#issuecomment-488798263 which requires injecting appropriate "crash points" in the Ready handling loop. Discussions of the "what if X happened in state Y" can quickly be made concrete by "scripting up an interaction test". Additionally, this framework is intentionally not kept internal to the raft package.. Though this is in its infancy, a goal is that it should be possible for a suite of interaction tests to allow applications to validate that their Storage implementation behaves accordingly, simply by running a raft-provided interaction suite against their Storage.
2024-09-27 06:25:44 +00:00 · 2019-08-12 11:13:51 +02:00
parent f57c16c271
commit e8090e57a2
25 changed files with 1613 additions and 132 deletions
--- a/raft/testdata/campaign.txt
+++ b/raft/testdata/campaign.txt
@@ -0,0 +1,117 @@
+log-level info
+----
+ok
+
+add-nodes 3 voters=(1,2,3) index=2
+----
+INFO 1 switched to configuration voters=(1 2 3)
+INFO 1 became follower at term 0
+INFO newRaft 1 [peers: [1,2,3], term: 0, commit: 2, applied: 2, lastindex: 2, lastterm: 1]
+INFO 2 switched to configuration voters=(1 2 3)
+INFO 2 became follower at term 0
+INFO newRaft 2 [peers: [1,2,3], term: 0, commit: 2, applied: 2, lastindex: 2, lastterm: 1]
+INFO 3 switched to configuration voters=(1 2 3)
+INFO 3 became follower at term 0
+INFO newRaft 3 [peers: [1,2,3], term: 0, commit: 2, applied: 2, lastindex: 2, lastterm: 1]
+
+campaign 1
+----
+INFO 1 is starting a new election at term 0
+INFO 1 became candidate at term 1
+INFO 1 received MsgVoteResp from 1 at term 1
+INFO 1 [logterm: 1, index: 2] sent MsgVote request to 2 at term 1
+INFO 1 [logterm: 1, index: 2] sent MsgVote request to 3 at term 1
+
+stabilize
+----
+> 1 handling Ready
+  Ready MustSync=true:
+  Lead:0 State:StateCandidate
+  HardState Term:1 Vote:1 Commit:2
+  Messages:
+  1->2 MsgVote Term:1 Log:1/2
+  1->3 MsgVote Term:1 Log:1/2
+> delivering messages
+  1->2 MsgVote Term:1 Log:1/2
+  INFO 2 [term: 0] received a MsgVote message with higher term from 1 [term: 1]
+  INFO 2 became follower at term 1
+  INFO 2 [logterm: 1, index: 2, vote: 0] cast MsgVote for 1 [logterm: 1, index: 2] at term 1
+> delivering messages
+  1->3 MsgVote Term:1 Log:1/2
+  INFO 3 [term: 0] received a MsgVote message with higher term from 1 [term: 1]
+  INFO 3 became follower at term 1
+  INFO 3 [logterm: 1, index: 2, vote: 0] cast MsgVote for 1 [logterm: 1, index: 2] at term 1
+> 2 handling Ready
+  Ready MustSync=true:
+  HardState Term:1 Vote:1 Commit:2
+  Messages:
+  2->1 MsgVoteResp Term:1 Log:0/0
+> 3 handling Ready
+  Ready MustSync=true:
+  HardState Term:1 Vote:1 Commit:2
+  Messages:
+  3->1 MsgVoteResp Term:1 Log:0/0
+> delivering messages
+  2->1 MsgVoteResp Term:1 Log:0/0
+  INFO 1 received MsgVoteResp from 2 at term 1
+  INFO 1 has received 2 MsgVoteResp votes and 0 vote rejections
+  INFO 1 became leader at term 1
+  3->1 MsgVoteResp Term:1 Log:0/0
+> 1 handling Ready
+  Ready MustSync=true:
+  Lead:1 State:StateLeader
+  Entries:
+  1/3 EntryNormal ""
+  Messages:
+  1->2 MsgApp Term:1 Log:1/2 Commit:2 Entries:[1/3 EntryNormal ""]
+  1->3 MsgApp Term:1 Log:1/2 Commit:2 Entries:[1/3 EntryNormal ""]
+> delivering messages
+  1->2 MsgApp Term:1 Log:1/2 Commit:2 Entries:[1/3 EntryNormal ""]
+> delivering messages
+  1->3 MsgApp Term:1 Log:1/2 Commit:2 Entries:[1/3 EntryNormal ""]
+> 2 handling Ready
+  Ready MustSync=true:
+  Lead:1 State:StateFollower
+  Entries:
+  1/3 EntryNormal ""
+  Messages:
+  2->1 MsgAppResp Term:1 Log:0/3
+> 3 handling Ready
+  Ready MustSync=true:
+  Lead:1 State:StateFollower
+  Entries:
+  1/3 EntryNormal ""
+  Messages:
+  3->1 MsgAppResp Term:1 Log:0/3
+> delivering messages
+  2->1 MsgAppResp Term:1 Log:0/3
+  3->1 MsgAppResp Term:1 Log:0/3
+> 1 handling Ready
+  Ready MustSync=false:
+  HardState Term:1 Vote:1 Commit:3
+  CommittedEntries:
+  1/3 EntryNormal ""
+  Messages:
+  1->2 MsgApp Term:1 Log:1/3 Commit:3
+  1->3 MsgApp Term:1 Log:1/3 Commit:3
+> delivering messages
+  1->2 MsgApp Term:1 Log:1/3 Commit:3
+> delivering messages
+  1->3 MsgApp Term:1 Log:1/3 Commit:3
+> 2 handling Ready
+  Ready MustSync=false:
+  HardState Term:1 Vote:1 Commit:3
+  CommittedEntries:
+  1/3 EntryNormal ""
+  Messages:
+  2->1 MsgAppResp Term:1 Log:0/3
+> 3 handling Ready
+  Ready MustSync=false:
+  HardState Term:1 Vote:1 Commit:3
+  CommittedEntries:
+  1/3 EntryNormal ""
+  Messages:
+  3->1 MsgAppResp Term:1 Log:0/3
+> delivering messages
+  2->1 MsgAppResp Term:1 Log:0/3
+  3->1 MsgAppResp Term:1 Log:0/3
--- a/raft/testdata/confchange_v1.txt
+++ b/raft/testdata/confchange_v1.txt
@@ -0,0 +1,78 @@
+add-nodes 1 voters=(1) index=2
+----
+INFO 1 switched to configuration voters=(1)
+INFO 1 became follower at term 0
+INFO newRaft 1 [peers: [1], term: 0, commit: 2, applied: 2, lastindex: 2, lastterm: 1]
+
+campaign 1
+----
+INFO 1 is starting a new election at term 0
+INFO 1 became candidate at term 1
+INFO 1 received MsgVoteResp from 1 at term 1
+INFO 1 became leader at term 1
+
+propose-conf-change 1
+v2 v3
+----
+ok
+
+add-nodes 2
+
+process-ready 1
+----
+INFO 2 switched to configuration voters=()
+INFO 2 became follower at term 0
+INFO newRaft 2 [peers: [], term: 0, commit: 0, applied: 0, lastindex: 0, lastterm: 0]
+INFO 3 switched to configuration voters=()
+INFO 3 became follower at term 0
+INFO newRaft 3 [peers: [], term: 0, commit: 0, applied: 0, lastindex: 0, lastterm: 0]
+
+stabilize 1
+----
+> 1 handling Ready
+  INFO 1 switched to configuration voters=(1 2 3)&&(1) autoleave
+  INFO initiating automatic transition out of joint configuration voters=(1 2 3)&&(1) autoleave
+  Ready MustSync=true:
+  Lead:1 State:StateLeader
+  HardState Term:1 Vote:1 Commit:4
+  Entries:
+  1/3 EntryNormal ""
+  1/4 EntryConfChangeV2 v2 v3
+  CommittedEntries:
+  1/3 EntryNormal ""
+  1/4 EntryConfChangeV2 v2 v3
+> 1 handling Ready
+  Ready MustSync=true:
+  Entries:
+  1/5 EntryConfChangeV2
+
+# NB: this test is broken from here on because the leader doesn't propagate the
+# commit index proactively, see the buglet #11002.
+
+stabilize 2
+----
+ok
+
+stabilize 1
+----
+ok
+
+stabilize 2
+----
+ok
+
+stabilize 1
+----
+ok
+
+stabilize 2
+----
+ok
+
+stabilize 1
+----
+ok
+
+stabilize 2
+----
+ok
--- a/raft/testdata/snapshot_succeed_via_app_resp.txt
+++ b/raft/testdata/snapshot_succeed_via_app_resp.txt
@@ -0,0 +1,141 @@
+# TestSnapshotSucceedViaAppResp regression tests the situation in which a snap-
+# shot is sent to a follower at the most recent index (i.e. the snapshot index
+# is the leader's last index is the committed index). In that situation, a bug
+# in the past left the follower in probing status until the next log entry was
+# committed.
+#
+# See https://github.com/etcd-io/etcd/pull/10308 for additional background.
+
+# Turn off output during the setup of the test.
+log-level none
+----
+ok
+
+# Start with two nodes, but the config already has a third.
+add-nodes 2 voters=(1,2,3) index=10
+----
+ok
+
+campaign 1
+----
+ok
+
+# Fully replicate everything, including the leader's empty index.
+stabilize
+----
+ok (quiet)
+
+compact 1 11
+----
+ok (quiet)
+
+# Drop inflight messages to n3.
+deliver-msgs 3
+----
+ok (quiet)
+
+# Show the Raft log messages from now on.
+log-level debug
+----
+ok
+
+status 1
+----
+1: StateReplicate match=11 next=12 inactive
+2: StateReplicate match=11 next=12
+3: StateProbe match=0 next=11 paused inactive
+
+# Add the node that will receive a snapshot (it has no state at all, does not
+# even have a config).
+add-nodes 1
+----
+INFO 3 switched to configuration voters=()
+INFO 3 became follower at term 0
+INFO newRaft 3 [peers: [], term: 0, commit: 0, applied: 0, lastindex: 0, lastterm: 0]
+
+# Time passes on the leader so that it will try the previously missing follower
+# again.
+tick-heartbeat 1
+----
+ok
+
+process-ready 1
+----
+Ready MustSync=false:
+Messages:
+1->2 MsgHeartbeat Term:1 Log:0/0 Commit:11
+1->3 MsgHeartbeat Term:1 Log:0/0
+
+# Iterate until no more work is done by the new peer. It receives the heartbeat
+# and responds.
+stabilize 3
+----
+> delivering messages
+  1->3 MsgHeartbeat Term:1 Log:0/0
+  INFO 3 [term: 0] received a MsgHeartbeat message with higher term from 1 [term: 1]
+  INFO 3 became follower at term 1
+> 3 handling Ready
+  Ready MustSync=true:
+  Lead:1 State:StateFollower
+  HardState Term:1 Commit:0
+  Messages:
+  3->1 MsgHeartbeatResp Term:1 Log:0/0
+
+# The leader in turn will realize that n3 needs a snapshot, which it initiates.
+stabilize 1
+----
+> delivering messages
+  3->1 MsgHeartbeatResp Term:1 Log:0/0
+  DEBUG 1 [firstindex: 12, commit: 11] sent snapshot[index: 11, term: 1] to 3 [StateProbe match=0 next=11]
+  DEBUG 1 paused sending replication messages to 3 [StateSnapshot match=0 next=11 paused pendingSnap=11]
+> 1 handling Ready
+  Ready MustSync=false:
+  Messages:
+  1->3 MsgSnap Term:1 Log:0/0 Snapshot: Index:11 Term:1 ConfState:Voters:[1 2 3] VotersOutgoing:[] Learners:[] LearnersNext:[]
+
+status 1
+----
+1: StateReplicate match=11 next=12 inactive
+2: StateReplicate match=11 next=12
+3: StateSnapshot match=0 next=11 paused pendingSnap=11
+
+# Follower applies the snapshot. Note how it reacts with a MsgAppResp upon completion.
+# The snapshot fully catches the follower up (i.e. there are no more log entries it
+# needs to apply after). The bug was that the leader failed to realize that the follower
+# was now fully caught up.
+stabilize 3
+----
+> delivering messages
+  1->3 MsgSnap Term:1 Log:0/0 Snapshot: Index:11 Term:1 ConfState:Voters:[1 2 3] VotersOutgoing:[] Learners:[] LearnersNext:[]
+  INFO log [committed=0, applied=0, unstable.offset=1, len(unstable.Entries)=0] starts to restore snapshot [index: 11, term: 1]
+  INFO 3 switched to configuration voters=(1 2 3)
+  INFO 3 [commit: 11, lastindex: 11, lastterm: 1] restored snapshot [index: 11, term: 1]
+  INFO 3 [commit: 11] restored snapshot [index: 11, term: 1]
+> 3 handling Ready
+  Ready MustSync=false:
+  HardState Term:1 Commit:11
+  Snapshot Index:11 Term:1 ConfState:Voters:[1 2 3] VotersOutgoing:[] Learners:[] LearnersNext:[]
+  Messages:
+  3->1 MsgAppResp Term:1 Log:0/11
+
+# The MsgAppResp lets the leader move the follower back to replicating state.
+# Leader sends another MsgAppResp, to communicate the updated commit index.
+stabilize 1
+----
+> delivering messages
+  3->1 MsgAppResp Term:1 Log:0/11
+  DEBUG 1 recovered from needing snapshot, resumed sending replication messages to 3 [StateSnapshot match=11 next=12 paused pendingSnap=11]
+> 1 handling Ready
+  Ready MustSync=false:
+  Messages:
+  1->3 MsgApp Term:1 Log:1/11 Commit:11
+
+status 1
+----
+1: StateReplicate match=11 next=12 inactive
+2: StateReplicate match=11 next=12
+3: StateReplicate match=11 next=12
+
+stabilize
+----
+ok