
Extend the timeout from 1s to defaultRequestTimeout 5s. The 1s may bring unwanted burden to the target member. If the member is busy at recovering, it has limited bandwidth for client requests. A short timeout at client side will retry quickly while keeping the on-going connections. Thus, etcd will queue lots of requests and connections and takes long time to clear them. This finally causes the timeout of member health check. This problem is a general one that how etcd handles amounts of requests at the same time in a good way. We don't plan to address it at current stage.
etcd functional test suite
etcd functional test suite tests the functionality of a etcd cluster with a focus on failure resistance under high pressure. It sets up an etcd cluster and inject failures into the cluster by killing the process or isolate the network of the process. It expects the etcd cluster to recover within a short amount of time after fixing the fault.
etcd functional test suite has two components: etcd-agent and etcd-tester. etcd-agent runs on every test machines and etcd-tester is a single controller of the test. etcd-tester controls all the etcd-agent to start etcd clusters and simulate various failure cases.
requirements
The environment of the cluster must be stable enough, so etcd test suite can assume that most of the failures are generated by itself.
etcd agent
etcd agent is a daemon on each machines. It can start, stop, restart, isolate and terminate an etcd process. The agent exposes these functionality via HTTP RPC.
etcd tester
etcd functional tester control the progress of the functional tests. It calls the the RPC of the etcd agent to simulate various test cases. For example, it can start a three members cluster by sending three start RPC calls to three different etcd agents. It can make one of the member failed by sending stop RPC call to one etcd agent.