Docs: revisions to storage/memory reqs/setup

2024-10-13 13:34:05 +00:00 · 2016-07-05 09:13:55 +02:00 · 2016-07-05 09:13:55 +02:00 · d22a5e9ee1
commit d22a5e9ee1
parent 6a686765cf
2 changed files with 46 additions and 34 deletions
--- a/docs/source/nodes/node-requirements.md
+++ b/docs/source/nodes/node-requirements.md
@ -18,36 +18,31 @@ We don't test BigchainDB on Windows or Mac OS X, but you can try.
 * If you have Mac OS X and want to experiment with BigchainDB, then you could do that [using Docker](run-with-docker.html).


-## Memory Requirements
+## Storage Requirements

-Every OS has memory requirements; check the memory requirements of your OS.
+When it comes to storage for RethinkDB, there are many things that are nice to have (e.g. SSDs, high-speed input/output [IOPS], replication, reliability, scalability, pay-for-what-you-use), but there are few _requirements_ other than:

-There is [documentation about RethinkDB's memory requirements](https://rethinkdb.com/docs/memory-usage/). In particular: "RethinkDB requires data structures in RAM on each server proportional to the size of the data on that server’s disk, usually around 1% of the size of the total data set." ([source](https://rethinkdb.com/limitations/))
+1. have enough storage to store all your data (and its replicas), and
+2. make sure your storage solution (hardware and interconnects) can handle your expected read & write rates.
+
+For RethinkDB's failover mechanisms to work, [every RethinkDB table must have at least three replicas](https://rethinkdb.com/docs/failover/) (i.e. a primary replica and two others). For example, if you want to store 10 GB of unique data, then you need at least 30 GB of storage. (Indexes and internal metadata are stored in RAM.)
+
+As for the read & write rates, what do you expect those to be for your situation? It's not enough for the storage system alone to handle those rates: the interconnects between the nodes must also be able to handle them.
+
+
+## Memory (RAM) Requirements
+
+In their [FAQ](https://rethinkdb.com/faq/), RethinkDB recommends that, "RethinkDB servers have at least 2GB of RAM... RethinkDB has a custom caching engine and can run on low-memory nodes with large amounts of on-disk data..." ([source](https://rethinkdb.com/faq/))
+
+In particular: "RethinkDB requires data structures in RAM on each server proportional to the size of the data on that server’s disk, usually around 1% of the size of the total data set." ([source](https://rethinkdb.com/limitations/))

 Also, "The storage engine is used in conjunction with a custom, B-Tree-aware caching engine which allows file sizes many orders of magnitude greater than the amount of available memory. RethinkDB can operate on a terabyte of data with about ten gigabytes of free RAM." ([source](https://www.rethinkdb.com/docs/architecture/))

-
-## Storage Requirements
-
-The RethinkDB storage engine has a number of SSD optimizations, so you can benefit from using SSDs. ([source](https://www.rethinkdb.com/docs/architecture/))
-
-If you want a RethinkDB cluster to store an amount of data D, with a replication factor of R (on every table), and the cluster has N nodes, then each node will need to be able to store R×D/N data plus the storage required for the OS and various other software (RethinkDB, Python, etc.). The secondary indexes also require some storage.
-
-For failover to work, [every RethinkDB table must have at least three replicas](https://rethinkdb.com/docs/failover/), i.e. R ≥ 3.
-
-Also, RethinkDB tables can have [at most 64 shards](https://rethinkdb.com/limitations/). For example, if you have only one table and more than 64 nodes, some nodes won't have the primary of any shard, i.e. they will have replicas only. In other words, once you pass 64 nodes, adding more nodes won't provide storage space for new data; it will only add more space for shard replicas. If the biggest single-node storage available is d, then the most you can store in a RethinkDB cluster is < 64×d: accomplished by putting one primary shard in each of 64 nodes, with all replica shards on other nodes. (This is assuming one table. If there are T tables, then the most you can store is < 64×d×T.)
+RethinkDB has [documentation about its memory requirements](https://rethinkdb.com/docs/memory-usage/). You can use that page to get a better estimate of how much memory you'll need.


-## Compatible File Systems
+## Filesystem Requirements

-RethinkDB "supports most commonly used file systems." ([source](https://www.rethinkdb.com/docs/architecture/))
-
-It has [issues with BTRFS](https://github.com/rethinkdb/rethinkdb/issues/2781) (B-tree file system).
-
-It's best to have a file system that supports direct I/O, because that will improve RethinkDB performance (if you tell RethinkDB to use direct I/O). Many compressed or encrypted file systems don't support direct I/O.
-
-
-## CPU Requirements
-
-Most servers will have enough CPUs (or vCPUs) to run a BigchainDB node. The more you have, the higher throughput will be.
+RethinkDB "supports most commonly used file systems" ([source](https://www.rethinkdb.com/docs/architecture/)) but it has [issues with BTRFS](https://github.com/rethinkdb/rethinkdb/issues/2781) (B-tree file system).

+It's best to use a filesystem that supports direct I/O, because that will improve RethinkDB performance (if you tell RethinkDB to use direct I/O). Many compressed or encrypted filesystems don't support direct I/O.
--- a/docs/source/nodes/setup-run-node.md
+++ b/docs/source/nodes/setup-run-node.md
@ -28,17 +28,36 @@ NTP is a standard protocol. There are many NTP daemons implementing it. We don't
 Please see the [notes on NTP daemon setup in the Appendices](../appendices/ntp-notes.html).


-## Set Up the File System for RethinkDB
+## Set Up Storage for RethinkDB Data

-Ideally, use a file system that supports direct I/O (Input/Output), a feature whereby file reads and writes go directly from RethinkDB to the storage device, bypassing the operating system read and write caches.
+Below are some things to consider when setting up storage for the RethinkDB data. The appendices have a [section with concrete examples](../appendices/example-rethinkdb-storage-setups.html).

-TODO: What file systems support direct I/O? How can you check? How do you enable it, if necessary?
+We suggest you set up a separate storage "device" (partition, RAID array, or logical volume) to store the RethinkDB data. Here are some questions to ask:

-See `def install_rethinkdb()` in `deploy-cluster-aws/fabfile.py` for an example of configuring a file system on an AWS instance running Ubuntu.
+* How easy will it be to add storage in the future? Will I have to shut down my server?
+* How big can the storage get? (Remember that [RAID](https://en.wikipedia.org/wiki/RAID) can be used to make several physical drives look like one.)
+* How fast can it read & write data? How many input/output operations per second (IOPS)?
+* How does IOPS scale as more physical hard drives are added?
+* What's the latency?
+* What's the reliability? Is there replication?
+* What's in the Service Level Agreement (SLA), if applicable?
+* What's the cost?

-Mount the partition for RethinkDB on `/data`: we will tell RethinkDB to store its data there.
+There are many options and tradeoffs. Don't forget to look into Amazon Elastic Block Store (EBS) and Amazon Elastic File System (EFS), or their equivalents from other providers.

-TODO: This section needs more elaboration
+**Storage Notes Specific to RethinkDB**
+
+* The RethinkDB storage engine has a number of SSD optimizations, so you _can_ benefit from using SSDs. ([source](https://www.rethinkdb.com/docs/architecture/))
+
+* If you want a RethinkDB cluster to store an amount of data D, with a replication factor of R (on every table), and the cluster has N nodes, then each node will need to be able to store R×D/N data.
+
+* RethinkDB tables can have [at most 64 shards](https://rethinkdb.com/limitations/). For example, if you have only one table and more than 64 nodes, some nodes won't have the primary of any shard, i.e. they will have replicas only. In other words, once you pass 64 nodes, adding more nodes won't provide more storage space for new data. If the biggest single-node storage available is d, then the most you can store in a RethinkDB cluster is < 64×d: accomplished by putting one primary shard in each of 64 nodes, with all replica shards on other nodes. (This is assuming one table. If there are T tables, then the most you can store is < 64×d×T.)
+
+* When you set up storage for your RethinkDB data, you may have to select a filesystem. (Sometimes, the filesystem is already decided by the choice of storage.) We recommend using a filesystem that supports direct I/O (Input/Output). Many compressed or encrypted file systems don't support direct I/O. The ext4 filesystem supports direct I/O (but be careful: if you enable the data=journal mode, then direct I/O support will be disabled; the default is data=ordered). If your chosen filesystem supports direct I/O and you're using Linux, then you don't need to do anything to request or enable direct I/O. RethinkDB does that.
+
+<p style="background-color: lightgrey;">What is direct I/O? It allows RethinkDB to write directly to the storage device (or use its own in-memory caching mechanisms), rather than relying on the operating system's file read and write caching mechanisms. (If you're using Linux, a write-to-file normally writes to the in-memory Page Cache first; only later does that Page Cache get flushed to disk. The Page Cache is also used when reading files.)</p>
+
+* RethinkDB stores its data in a specific directory. You can tell RethinkDB _which_ directory using the RethinkDB config file, as explained below. In this documentation, we assume the directory is `/data`. If you set up a separate device (partition, RAID array, or logical volume) to store the RethinkDB data, then mount that device on `/data`.


 ## Install RethinkDB Server
@ -50,7 +69,6 @@ If you don't already have RethinkDB Server installed, you must install it. The R

 Create a RethinkDB configuration file (text file) named `instance1.conf` with the following contents (explained below):
 ```text
-server-tag=original
 directory=/data
 bind=all
 direct-io
@ -61,10 +79,9 @@ join=node2_hostname:29015
 # continue until there's a join= line for each node in the federation
 ```

-* `server-tag=original` is an optional line, but you'll be glad you included it later if you decide to create a set of backup-only servers as described in [the section on continuous backup](../clusters-feds/backup.html#incremental-or-continuous-backup).
-* `directory=/data` tells the RethinkDB server process to store its share of the database data in `/data`.
+* `directory=/data` tells the RethinkDB node to store its share of the database data in `/data`.
 * `bind=all` binds RethinkDB to all local network interfaces (e.g. loopback, Ethernet, wireless, whatever is available), so it can communicate with the outside world. (The default is to bind only to local interfaces.)
-* `direct-io` tells RethinkDB to use direct I/O (explained earlier).
+* `direct-io` tells RethinkDB to use direct I/O (explained earlier). Only include this line if your file system supports direct I/O.
 * `join=hostname:29015` lines: A cluster node needs to find out the hostnames of all the other nodes somehow. You _could_ designate one node to be the one that every other node asks, and put that node's hostname in the config file, but that wouldn't be very decentralized. Instead, we include _every_ node in the list of nodes-to-ask.

 If you're curious about the RethinkDB config file, there's [a RethinkDB documentation page about it](https://www.rethinkdb.com/docs/config-file/). The [explanations of the RethinkDB command-line options](https://rethinkdb.com/docs/cli-options/) are another useful reference.