This method uses raft status exposed at /debug/varz to determine the
health of the cluster. It uses whether commit index increases to
determine the cluster health, and uses whether match index increases to
determine the member health.
This could fix the bug #2711 that fails to detect follower is unhealthy
because it doesn't rely on whether message in long-polling connection is sent.
This health check is stricter than the old one, and reflects the
situation that whether followers are healthy in the view of the leader. One
example is that if the follower is receiving the snapshot, it will turns
out to be unhealthy because it doesn't move forward.
`etcdctl cluster-health` will reflect the healthy view in the raft level,
while connectivity checks reflects the healthy view in transport level.
Before this PR, the timeout caused by leader election returns:
```
14:45:37 etcd2 | 2015-08-12 14:45:37.786349 E | etcdhttp: got unexpected
response error (etcdserver: request timed out)
```
After this PR:
```
15:52:54 etcd1 | 2015-08-12 15:52:54.389523 E | etcdhttp: etcdserver:
request timed out, possibly due to leader down
```
Conflicts:
etcdserver/raft.go
Follow the simple rule in the atomic package:
"On both ARM and x86-32, it is the caller's responsibility to arrange
for 64-bit alignment of 64-bit words accessed atomically. The first word
in a global variable or in an allocated struct or slice can be relied
upon to be 64-bit aligned."
Tested on a system with /proc/cpuinfo reporting:
processor : 0
model name : ARMv7 Processor rev 1 (v7l)
Features : swp half thumb fastmult vfp edsp thumbee neon vfpv3
tls vfpv4 idiva idivt vfpd32 lpae evtstrm
CPU implementer : 0x41
CPU architecture: 7
CPU variant : 0x0
CPU part : 0xc0d
CPU revision : 1
etcd always returns 500/503 response when it may have no leader.
So we should log the other 50x response in a normal way.
This helps to log correctly when discovery meets 504 error. Before this
PR, it logs like this:
```
18:31:58 etcd2 | 2015/08/4 18:31:58 discovery: error #0: client: etcd
member https://discovery.etcd.io has no leader
18:31:58 etcd2 | 2015/08/4 18:31:58 discovery: waiting for other nodes:
error connecting to https://discovery.etcd.io, retrying in 4s
```
After this PR:
```
22:20:25 etcd2 | 2015/08/4 22:20:25 discovery: error #0: client: etcd
member https://discovery.etcd.io returns server error [Gateway Timeout]
22:20:25 etcd2 | 2015/08/4 22:20:25 discovery: waiting for other nodes:
error connecting to https://discovery.etcd.io, retrying in 4s
```
Conflicts:
client/client.go
Add a new ClusterError type. It contians all encountered errors and
return ClusterNotAvailable as the error string.
Conflicts:
client/client.go
discovery/discovery.go
At present it prints the whole usage and flags, which cause the exact
error message is hidden two screens above.
Fixes#3141
Signed-off-by: Guohua Ouyang <gouyang@redhat.com>
If the user sets TLS info, this implies that he wants to listen on TLS.
If etcd finds that urls to listen is still HTTP schema, it prints out
warning to notify user about possible wrong setting.
If TLS config is empty, etcd downgrades keepalive listener from HTTPS to
HTTP without warning. This results in HTTPS downgrade bug for client urls.
The commit returns error if it cannot listen on TLS.
Reasons for the notice:
1. No users have reported about their feedback about auth feature so
far.
2. We haven't used it internally.
3. This is the first release that includes auth feature, so it is good
to be more cautious.
The discovery service doesn't return HTTP header early when watch
starts. This may trigger ResponseHeaderTimeout and cause the watch
request failed.
The fix on discovery service may take some time. Remove the
ResponseHeaderTimeout first so it behaves as before.
@xiang90 pointed out my earlier commit duplicated a lot of things that
were mentioned earlier in the doc.
This time around I tried just making some gotchas more explicit in the
existing docs instead of just tacking new stuff onto the end.