raft/rafttest: introduce datadriven testing

It has often been tedious to test the interactions between multi-member
Raft groups, especially when many steps were required to reach a certain
scenario. Often, this boilerplate was as boring as it is hard to write
and hard to maintain, making it attractive to resort to shortcuts
whenever possible, which in turn tended to undercut how meaningful and
maintainable the tests ended up being - that is, if the tests were even
written, which sometimes they weren't.

This change introduces a datadriven framework specifically for testing
deterministically the interaction between multiple members of a raft group
with the goal of reducing the friction for writing these tests to near
zero.

In the near term, this will be used to add thorough testing for joint
consensus (which is already available today, but wildly undertested),
but just converting an existing test into this framework has shown that
the concise representation and built-in inspection of log messages
highlights unexpected behavior much more readily than the previous unit
tests did (the test in question is `snapshot_succeed_via_app_resp`; the
reader is invited to compare the old and new version of it).

The main building block is `InteractionEnv`, which holds on to the state
of the whole system and exposes various relevant methods for
manipulating it, including but not limited to adding nodes, delivering
and dropping messages, and proposing configuration changes. All of this
is extensible so that in the future I hope to use it to explore the
phenomena discussed in

https://github.com/etcd-io/etcd/issues/7625#issuecomment-488798263

which requires injecting appropriate "crash points" in the Ready
handling loop. Discussions of the "what if X happened in state Y"
can quickly be made concrete by "scripting up an interaction test".

Additionally, this framework is intentionally not kept internal to the
raft package.. Though this is in its infancy, a goal is that it should
be possible for a suite of interaction tests to allow applications to
validate that their Storage implementation behaves accordingly, simply
by running a raft-provided interaction suite against their Storage.
This commit is contained in:
Tobias Schottdorf
2019-08-12 11:13:51 +02:00
parent f57c16c271
commit e8090e57a2
25 changed files with 1613 additions and 132 deletions

117
raft/testdata/campaign.txt vendored Normal file
View File

@@ -0,0 +1,117 @@
log-level info
----
ok
add-nodes 3 voters=(1,2,3) index=2
----
INFO 1 switched to configuration voters=(1 2 3)
INFO 1 became follower at term 0
INFO newRaft 1 [peers: [1,2,3], term: 0, commit: 2, applied: 2, lastindex: 2, lastterm: 1]
INFO 2 switched to configuration voters=(1 2 3)
INFO 2 became follower at term 0
INFO newRaft 2 [peers: [1,2,3], term: 0, commit: 2, applied: 2, lastindex: 2, lastterm: 1]
INFO 3 switched to configuration voters=(1 2 3)
INFO 3 became follower at term 0
INFO newRaft 3 [peers: [1,2,3], term: 0, commit: 2, applied: 2, lastindex: 2, lastterm: 1]
campaign 1
----
INFO 1 is starting a new election at term 0
INFO 1 became candidate at term 1
INFO 1 received MsgVoteResp from 1 at term 1
INFO 1 [logterm: 1, index: 2] sent MsgVote request to 2 at term 1
INFO 1 [logterm: 1, index: 2] sent MsgVote request to 3 at term 1
stabilize
----
> 1 handling Ready
Ready MustSync=true:
Lead:0 State:StateCandidate
HardState Term:1 Vote:1 Commit:2
Messages:
1->2 MsgVote Term:1 Log:1/2
1->3 MsgVote Term:1 Log:1/2
> delivering messages
1->2 MsgVote Term:1 Log:1/2
INFO 2 [term: 0] received a MsgVote message with higher term from 1 [term: 1]
INFO 2 became follower at term 1
INFO 2 [logterm: 1, index: 2, vote: 0] cast MsgVote for 1 [logterm: 1, index: 2] at term 1
> delivering messages
1->3 MsgVote Term:1 Log:1/2
INFO 3 [term: 0] received a MsgVote message with higher term from 1 [term: 1]
INFO 3 became follower at term 1
INFO 3 [logterm: 1, index: 2, vote: 0] cast MsgVote for 1 [logterm: 1, index: 2] at term 1
> 2 handling Ready
Ready MustSync=true:
HardState Term:1 Vote:1 Commit:2
Messages:
2->1 MsgVoteResp Term:1 Log:0/0
> 3 handling Ready
Ready MustSync=true:
HardState Term:1 Vote:1 Commit:2
Messages:
3->1 MsgVoteResp Term:1 Log:0/0
> delivering messages
2->1 MsgVoteResp Term:1 Log:0/0
INFO 1 received MsgVoteResp from 2 at term 1
INFO 1 has received 2 MsgVoteResp votes and 0 vote rejections
INFO 1 became leader at term 1
3->1 MsgVoteResp Term:1 Log:0/0
> 1 handling Ready
Ready MustSync=true:
Lead:1 State:StateLeader
Entries:
1/3 EntryNormal ""
Messages:
1->2 MsgApp Term:1 Log:1/2 Commit:2 Entries:[1/3 EntryNormal ""]
1->3 MsgApp Term:1 Log:1/2 Commit:2 Entries:[1/3 EntryNormal ""]
> delivering messages
1->2 MsgApp Term:1 Log:1/2 Commit:2 Entries:[1/3 EntryNormal ""]
> delivering messages
1->3 MsgApp Term:1 Log:1/2 Commit:2 Entries:[1/3 EntryNormal ""]
> 2 handling Ready
Ready MustSync=true:
Lead:1 State:StateFollower
Entries:
1/3 EntryNormal ""
Messages:
2->1 MsgAppResp Term:1 Log:0/3
> 3 handling Ready
Ready MustSync=true:
Lead:1 State:StateFollower
Entries:
1/3 EntryNormal ""
Messages:
3->1 MsgAppResp Term:1 Log:0/3
> delivering messages
2->1 MsgAppResp Term:1 Log:0/3
3->1 MsgAppResp Term:1 Log:0/3
> 1 handling Ready
Ready MustSync=false:
HardState Term:1 Vote:1 Commit:3
CommittedEntries:
1/3 EntryNormal ""
Messages:
1->2 MsgApp Term:1 Log:1/3 Commit:3
1->3 MsgApp Term:1 Log:1/3 Commit:3
> delivering messages
1->2 MsgApp Term:1 Log:1/3 Commit:3
> delivering messages
1->3 MsgApp Term:1 Log:1/3 Commit:3
> 2 handling Ready
Ready MustSync=false:
HardState Term:1 Vote:1 Commit:3
CommittedEntries:
1/3 EntryNormal ""
Messages:
2->1 MsgAppResp Term:1 Log:0/3
> 3 handling Ready
Ready MustSync=false:
HardState Term:1 Vote:1 Commit:3
CommittedEntries:
1/3 EntryNormal ""
Messages:
3->1 MsgAppResp Term:1 Log:0/3
> delivering messages
2->1 MsgAppResp Term:1 Log:0/3
3->1 MsgAppResp Term:1 Log:0/3

78
raft/testdata/confchange_v1.txt vendored Normal file
View File

@@ -0,0 +1,78 @@
add-nodes 1 voters=(1) index=2
----
INFO 1 switched to configuration voters=(1)
INFO 1 became follower at term 0
INFO newRaft 1 [peers: [1], term: 0, commit: 2, applied: 2, lastindex: 2, lastterm: 1]
campaign 1
----
INFO 1 is starting a new election at term 0
INFO 1 became candidate at term 1
INFO 1 received MsgVoteResp from 1 at term 1
INFO 1 became leader at term 1
propose-conf-change 1
v2 v3
----
ok
add-nodes 2
process-ready 1
----
INFO 2 switched to configuration voters=()
INFO 2 became follower at term 0
INFO newRaft 2 [peers: [], term: 0, commit: 0, applied: 0, lastindex: 0, lastterm: 0]
INFO 3 switched to configuration voters=()
INFO 3 became follower at term 0
INFO newRaft 3 [peers: [], term: 0, commit: 0, applied: 0, lastindex: 0, lastterm: 0]
stabilize 1
----
> 1 handling Ready
INFO 1 switched to configuration voters=(1 2 3)&&(1) autoleave
INFO initiating automatic transition out of joint configuration voters=(1 2 3)&&(1) autoleave
Ready MustSync=true:
Lead:1 State:StateLeader
HardState Term:1 Vote:1 Commit:4
Entries:
1/3 EntryNormal ""
1/4 EntryConfChangeV2 v2 v3
CommittedEntries:
1/3 EntryNormal ""
1/4 EntryConfChangeV2 v2 v3
> 1 handling Ready
Ready MustSync=true:
Entries:
1/5 EntryConfChangeV2
# NB: this test is broken from here on because the leader doesn't propagate the
# commit index proactively, see the buglet #11002.
stabilize 2
----
ok
stabilize 1
----
ok
stabilize 2
----
ok
stabilize 1
----
ok
stabilize 2
----
ok
stabilize 1
----
ok
stabilize 2
----
ok

View File

@@ -0,0 +1,141 @@
# TestSnapshotSucceedViaAppResp regression tests the situation in which a snap-
# shot is sent to a follower at the most recent index (i.e. the snapshot index
# is the leader's last index is the committed index). In that situation, a bug
# in the past left the follower in probing status until the next log entry was
# committed.
#
# See https://github.com/etcd-io/etcd/pull/10308 for additional background.
# Turn off output during the setup of the test.
log-level none
----
ok
# Start with two nodes, but the config already has a third.
add-nodes 2 voters=(1,2,3) index=10
----
ok
campaign 1
----
ok
# Fully replicate everything, including the leader's empty index.
stabilize
----
ok (quiet)
compact 1 11
----
ok (quiet)
# Drop inflight messages to n3.
deliver-msgs 3
----
ok (quiet)
# Show the Raft log messages from now on.
log-level debug
----
ok
status 1
----
1: StateReplicate match=11 next=12 inactive
2: StateReplicate match=11 next=12
3: StateProbe match=0 next=11 paused inactive
# Add the node that will receive a snapshot (it has no state at all, does not
# even have a config).
add-nodes 1
----
INFO 3 switched to configuration voters=()
INFO 3 became follower at term 0
INFO newRaft 3 [peers: [], term: 0, commit: 0, applied: 0, lastindex: 0, lastterm: 0]
# Time passes on the leader so that it will try the previously missing follower
# again.
tick-heartbeat 1
----
ok
process-ready 1
----
Ready MustSync=false:
Messages:
1->2 MsgHeartbeat Term:1 Log:0/0 Commit:11
1->3 MsgHeartbeat Term:1 Log:0/0
# Iterate until no more work is done by the new peer. It receives the heartbeat
# and responds.
stabilize 3
----
> delivering messages
1->3 MsgHeartbeat Term:1 Log:0/0
INFO 3 [term: 0] received a MsgHeartbeat message with higher term from 1 [term: 1]
INFO 3 became follower at term 1
> 3 handling Ready
Ready MustSync=true:
Lead:1 State:StateFollower
HardState Term:1 Commit:0
Messages:
3->1 MsgHeartbeatResp Term:1 Log:0/0
# The leader in turn will realize that n3 needs a snapshot, which it initiates.
stabilize 1
----
> delivering messages
3->1 MsgHeartbeatResp Term:1 Log:0/0
DEBUG 1 [firstindex: 12, commit: 11] sent snapshot[index: 11, term: 1] to 3 [StateProbe match=0 next=11]
DEBUG 1 paused sending replication messages to 3 [StateSnapshot match=0 next=11 paused pendingSnap=11]
> 1 handling Ready
Ready MustSync=false:
Messages:
1->3 MsgSnap Term:1 Log:0/0 Snapshot: Index:11 Term:1 ConfState:Voters:[1 2 3] VotersOutgoing:[] Learners:[] LearnersNext:[]
status 1
----
1: StateReplicate match=11 next=12 inactive
2: StateReplicate match=11 next=12
3: StateSnapshot match=0 next=11 paused pendingSnap=11
# Follower applies the snapshot. Note how it reacts with a MsgAppResp upon completion.
# The snapshot fully catches the follower up (i.e. there are no more log entries it
# needs to apply after). The bug was that the leader failed to realize that the follower
# was now fully caught up.
stabilize 3
----
> delivering messages
1->3 MsgSnap Term:1 Log:0/0 Snapshot: Index:11 Term:1 ConfState:Voters:[1 2 3] VotersOutgoing:[] Learners:[] LearnersNext:[]
INFO log [committed=0, applied=0, unstable.offset=1, len(unstable.Entries)=0] starts to restore snapshot [index: 11, term: 1]
INFO 3 switched to configuration voters=(1 2 3)
INFO 3 [commit: 11, lastindex: 11, lastterm: 1] restored snapshot [index: 11, term: 1]
INFO 3 [commit: 11] restored snapshot [index: 11, term: 1]
> 3 handling Ready
Ready MustSync=false:
HardState Term:1 Commit:11
Snapshot Index:11 Term:1 ConfState:Voters:[1 2 3] VotersOutgoing:[] Learners:[] LearnersNext:[]
Messages:
3->1 MsgAppResp Term:1 Log:0/11
# The MsgAppResp lets the leader move the follower back to replicating state.
# Leader sends another MsgAppResp, to communicate the updated commit index.
stabilize 1
----
> delivering messages
3->1 MsgAppResp Term:1 Log:0/11
DEBUG 1 recovered from needing snapshot, resumed sending replication messages to 3 [StateSnapshot match=11 next=12 paused pendingSnap=11]
> 1 handling Ready
Ready MustSync=false:
Messages:
1->3 MsgApp Term:1 Log:1/11 Commit:11
status 1
----
1: StateReplicate match=11 next=12 inactive
2: StateReplicate match=11 next=12
3: StateReplicate match=11 next=12
stabilize
----
ok