mirror of
				https://github.com/etcd-io/etcd.git
				synced 2024-09-27 06:25:44 +00:00 
			
		
		
		
	
		
			
				
	
	
		
			26 lines
		
	
	
		
			3.4 KiB
		
	
	
	
		
			Markdown
		
	
	
	
	
	
			
		
		
	
	
			26 lines
		
	
	
		
			3.4 KiB
		
	
	
	
		
			Markdown
		
	
	
	
	
	
| # Data model
 | ||
| 
 | ||
| etcd is designed to reliably store infrequently updated data and provide reliable watch queries. etcd exposes previous versions of key-value pairs to support inexpensive snapshots and watch history events (“time travel queries”). A persistent, multi-version, concurrency-control data model is a good fit for these use cases.
 | ||
| 
 | ||
| etcd stores data in a multiversion [persistent][persistent-ds] key-value store. The persistent key-value store preserves the previous version of a key-value pair when its value is superseded with new data. The key-value store is effectively immutable; its operations do not update the structure in-place, but instead always generate a new updated structure. All past versions of keys are still accessible and watchable after modification. To prevent the data store from growing indefinitely over time and from maintaining old versions, the store may be compacted to shed the oldest versions of superseded data.
 | ||
| 
 | ||
| ### Logical view
 | ||
| 
 | ||
| The store’s logical view is a flat binary key space. The key space has a lexically sorted index on byte string keys so range queries are inexpensive.
 | ||
| 
 | ||
| The key space maintains multiple revisions. Each atomic mutative operation (e.g., a transaction operation may contain multiple operations) creates a new revision on the key space. All data held by previous revisions remains unchanged. Old versions of key can still be accessed through previous revisions. Likewise, revisions are indexed as well; ranging over revisions with watchers is efficient. If the store is compacted to save space, revisions before the compact revision will be removed.
 | ||
| 
 | ||
| A key’s lifetime spans a generation, denoted by its version. Each key may have one or multiple generations. Creating a key increments the version of that key, starting at 1 if the key never existed. Deleting a key generates a key tombstone, concluding the key’s current generation by resetting its version. Each modification of a key increments its version. Once a compaction happens, any version ended before the given revision will be removed and values set before the compaction revision except the latest one will be removed.
 | ||
| 
 | ||
| ### Physical view
 | ||
| 
 | ||
| etcd stores the physical data as key-value pairs in a persistent [b+tree][b+tree]. Each revision of the store’s state only contains the delta from its previous revision to be efficient. A single revision may correspond to multiple keys in the tree.
 | ||
| 
 | ||
| The key of key-value pair is a 3-tuple (major, sub, type). Major is the store revision holding the key. Sub differentiates among  keys within the same revision. Type is an optional suffix for special value (e.g., `t` if the value contains a tombstone). The value of the key-value pair contains the modification from previous revision, thus one delta from previous revision. The b+tree is ordered by key in lexical byte-order. Ranged lookups over revision deltas are fast; this enables quickly finding modifications from one specific revision to another. Compaction removes out-of-date keys-value pairs.
 | ||
| 
 | ||
| etcd also keeps a secondary in-memory [btree][btree] index to speed up range queries over keys. The keys in the btree index are the keys of the store exposed to user. The value is a pointer to the modification of the persistent b+tree. Compaction removes dead pointers.
 | ||
| 
 | ||
| [persistent-ds]: https://en.wikipedia.org/wiki/Persistent_data_structure
 | ||
| [btree]: https://en.wikipedia.org/wiki/B-tree
 | ||
| [b+tree]: https://en.wikipedia.org/wiki/B%2B_tree
 | 
