mirror of
https://github.com/bigchaindb/bigchaindb.git
synced 2024-10-13 13:34:05 +00:00
Merge pull request #1919 from bigchaindb/remove-defunct-docs-pages
Removed some fully-removeable pages from the Appendices
This commit is contained in:
commit
e201ba7305
docs/server/source/appendices
@ -1,6 +1,3 @@
|
||||
.. You can adapt this file completely to your liking, but it should at least
|
||||
contain the root `toctree` directive.
|
||||
|
||||
Appendices
|
||||
==========
|
||||
|
||||
@ -8,12 +5,10 @@ Appendices
|
||||
:maxdepth: 1
|
||||
|
||||
install-os-level-deps
|
||||
install-latest-pip
|
||||
run-with-docker
|
||||
json-serialization
|
||||
cryptography
|
||||
the-Bigchain-class
|
||||
pipelines
|
||||
backend
|
||||
commands
|
||||
aws-setup
|
||||
@ -22,10 +17,7 @@ Appendices
|
||||
generate-key-pair-for-ssh
|
||||
firewall-notes
|
||||
ntp-notes
|
||||
rethinkdb-reqs
|
||||
rethinkdb-backup
|
||||
licenses
|
||||
install-with-lxd
|
||||
run-with-vagrant
|
||||
run-with-ansible
|
||||
vote-yaml
|
||||
|
@ -1,20 +0,0 @@
|
||||
# How to Install the Latest pip and setuptools
|
||||
|
||||
You can check the version of `pip` you're using (in your current virtualenv) by doing:
|
||||
```text
|
||||
pip -V
|
||||
```
|
||||
|
||||
If it says that `pip` isn't installed, or it says `pip` is associated with a Python version less than 3.5, then you must install a `pip` version associated with Python 3.5+. In the following instructions, we call it `pip3` but you may be able to use `pip` if that refers to the same thing. See [the `pip` installation instructions](https://pip.pypa.io/en/stable/installing/).
|
||||
|
||||
On Ubuntu 16.04, we found that this works:
|
||||
```text
|
||||
sudo apt-get install python3-pip
|
||||
```
|
||||
|
||||
That should install a Python 3 version of `pip` named `pip3`. If that didn't work, then another way to get `pip3` is to do `sudo apt-get install python3-setuptools` followed by `sudo easy_install3 pip`.
|
||||
|
||||
You can upgrade `pip` (`pip3`) and `setuptools` to the latest versions using:
|
||||
```text
|
||||
pip3 install --upgrade pip setuptools
|
||||
```
|
@ -1,43 +0,0 @@
|
||||
# Installing BigchainDB on LXC containers using LXD
|
||||
|
||||
**Note: This page was contributed by an external contributor and is not actively maintained. We include it in case someone is interested.**
|
||||
|
||||
You can visit this link to install LXD (instructions here): [LXD Install](https://linuxcontainers.org/lxd/getting-started-cli/)
|
||||
|
||||
(assumption is that you are using Ubuntu 14.04 for host/container)
|
||||
|
||||
Let us create an LXC container (via LXD) with the following command:
|
||||
|
||||
`lxc launch ubuntu:14.04 bigchaindb`
|
||||
|
||||
(ubuntu:14.04 - this is the remote server the command fetches the image from)
|
||||
(bigchaindb - is the name of the container)
|
||||
|
||||
Below is the `install.sh` script you will need to install BigchainDB within your container.
|
||||
|
||||
Here is my `install.sh`:
|
||||
|
||||
```
|
||||
#!/bin/bash
|
||||
set -ex
|
||||
export DEBIAN_FRONTEND=noninteractive
|
||||
apt-get install -y wget
|
||||
source /etc/lsb-release && echo "deb http://download.rethinkdb.com/apt $DISTRIB_CODENAME main" | sudo tee /etc/apt/sources.list.d/rethinkdb.list
|
||||
wget -qO- https://download.rethinkdb.com/apt/pubkey.gpg | sudo apt-key add -
|
||||
apt-get update
|
||||
apt-get install -y rethinkdb python3-pip
|
||||
pip3 install --upgrade pip wheel setuptools
|
||||
pip install ptpython bigchaindb
|
||||
```
|
||||
|
||||
Copy/Paste the above `install.sh` into the directory/path you are going to execute your LXD commands from (ie. the host).
|
||||
|
||||
Make sure your container is running by typing:
|
||||
|
||||
`lxc list`
|
||||
|
||||
Now, from the host (and the correct directory) where you saved `install.sh`, run this command:
|
||||
|
||||
`cat install.sh | lxc exec bigchaindb /bin/bash`
|
||||
|
||||
If you followed the commands correctly, you will have successfully created an LXC container (using LXD) that can get you up and running with BigchainDB in <5 minutes (depending on how long it takes to download all the packages).
|
@ -1,26 +0,0 @@
|
||||
#########
|
||||
Pipelines
|
||||
#########
|
||||
|
||||
Block Creation
|
||||
==============
|
||||
|
||||
.. automodule:: bigchaindb.pipelines.block
|
||||
|
||||
|
||||
Block Voting
|
||||
============
|
||||
|
||||
.. automodule:: bigchaindb.pipelines.vote
|
||||
|
||||
|
||||
Block Status
|
||||
============
|
||||
|
||||
.. automodule:: bigchaindb.pipelines.election
|
||||
|
||||
|
||||
Stale Transaction Monitoring
|
||||
============================
|
||||
|
||||
.. automodule:: bigchaindb.pipelines.stale
|
@ -1,124 +0,0 @@
|
||||
# Backing Up and Restoring Data
|
||||
|
||||
This page was written when BigchainDB only worked with RethinkDB, so its focus is on RethinkDB-based backup. BigchainDB now supports MongoDB as a backend database and we recommend that you use MongoDB in production. Nevertheless, some of the following backup ideas are still relevant regardless of the backend database being used, so we moved this page to the Appendices.
|
||||
|
||||
|
||||
## RethinkDB's Replication as a form of Backup
|
||||
|
||||
RethinkDB already has internal replication: every document is stored on _R_ different nodes, where _R_ is the replication factor (set using `bigchaindb set-replicas R`). Those replicas can be thought of as "live backups" because if one node goes down, the cluster will continue to work and no data will be lost.
|
||||
|
||||
At this point, there should be someone saying, "But replication isn't backup!"
|
||||
|
||||
It's true. Replication alone isn't enough, because something bad might happen _inside_ the database, and that could affect the replicas. For example, what if someone logged in as a RethinkDB admin and did a "drop table"? We currently plan for each node to be protected by a next-generation firewall (or something similar) to prevent such things from getting very far. For example, see [issue #240](https://github.com/bigchaindb/bigchaindb/issues/240).
|
||||
|
||||
Nevertheless, you should still consider having normal, "cold" backups, because bad things can still happen.
|
||||
|
||||
|
||||
## Live Replication of RethinkDB Data Files
|
||||
|
||||
Each BigchainDB node stores its subset of the RethinkDB data in one directory. You could set up the node's file system so that directory lives on its own hard drive. Furthermore, you could make that hard drive part of a [RAID](https://en.wikipedia.org/wiki/RAID) array, so that a second hard drive would always have a copy of the original. If the original hard drive fails, then the second hard drive could take its place and the node would continue to function. Meanwhile, the original hard drive could be replaced.
|
||||
|
||||
That's just one possible way of setting up the file system so as to provide extra reliability.
|
||||
|
||||
Another way to get similar reliability would be to mount the RethinkDB data directory on an [Amazon EBS](https://aws.amazon.com/ebs/) volume. Each Amazon EBS volume is, "automatically replicated within its Availability Zone to protect you from component failure, offering high availability and durability."
|
||||
|
||||
As with shard replication, live file-system replication protects against many failure modes, but it doesn't protect against them all. You should still consider having normal, "cold" backups.
|
||||
|
||||
|
||||
## rethinkdb dump (to a File)
|
||||
|
||||
RethinkDB can create an archive of all data in the cluster (or all data in specified tables), as a compressed file. According to [the RethinkDB blog post when that functionality became available](https://rethinkdb.com/blog/1.7-release/):
|
||||
|
||||
> Since the backup process is using client drivers, it automatically takes advantage of the MVCC [multiversion concurrency control] functionality built into RethinkDB. It will use some cluster resources, but will not lock out any of the clients, so you can safely run it on a live cluster.
|
||||
|
||||
To back up all the data in a BigchainDB cluster, the RethinkDB admin user must run a command like the following on one of the nodes:
|
||||
```text
|
||||
rethinkdb dump -e bigchain.bigchain -e bigchain.votes
|
||||
```
|
||||
|
||||
That should write a file named `rethinkdb_dump_<date>_<time>.tar.gz`. The `-e` option is used to specify which tables should be exported. You probably don't need to export the backlog table, but you definitely need to export the bigchain and votes tables.
|
||||
`bigchain.votes` means the `votes` table in the RethinkDB database named `bigchain`. It's possible that your database has a different name: [the database name is a BigchainDB configuration setting](../server-reference/configuration.html#database-host-database-port-database-name). The default name is `bigchain`. (Tip: you can see the values of all configuration settings using the `bigchaindb show-config` command.)
|
||||
|
||||
There's [more information about the `rethinkdb dump` command in the RethinkDB documentation](https://www.rethinkdb.com/docs/backup/). It also explains how to restore data to a cluster from an archive file.
|
||||
|
||||
**Notes**
|
||||
|
||||
* If the `rethinkdb dump` subcommand fails and the last line of the Traceback says "NameError: name 'file' is not defined", then you need to update your RethinkDB Python driver; do a `pip install --upgrade rethinkdb`
|
||||
|
||||
* It might take a long time to backup data this way. The more data, the longer it will take.
|
||||
|
||||
* You need enough free disk space to store the backup file.
|
||||
|
||||
* If a document changes after the backup starts but before it ends, then the changed document may not be in the final backup. This shouldn't be a problem for BigchainDB, because blocks and votes can't change anyway.
|
||||
|
||||
* `rethinkdb dump` saves data and secondary indexes, but does *not* save cluster metadata. You will need to recreate your cluster setup yourself after you run `rethinkdb restore`.
|
||||
|
||||
* RethinkDB also has [subcommands to import/export](https://gist.github.com/coffeemug/5894257) collections of JSON or CSV files. While one could use those for backup/restore, it wouldn't be very practical.
|
||||
|
||||
|
||||
## Client-Side Backup
|
||||
|
||||
In the future, it will be possible for clients to query for the blocks containing the transactions they care about, and for the votes on those blocks. They could save a local copy of those blocks and votes.
|
||||
|
||||
**How could we be sure blocks and votes from a client are valid?**
|
||||
|
||||
All blocks and votes are signed by cluster nodes (owned and operated by consortium members). Only cluster nodes can produce valid signatures because only cluster nodes have the necessary private keys. A client can't produce a valid signature for a block or vote.
|
||||
|
||||
**Could we restore an entire BigchainDB database using client-saved blocks and votes?**
|
||||
|
||||
Yes, in principle, but it would be difficult to know if you've recovered every block and vote. Votes link to the block they're voting on and to the previous block, so one could detect some missing blocks. It would be difficult to know if you've recovered all the votes.
|
||||
|
||||
|
||||
## Backup by Copying RethinkDB Data Files
|
||||
|
||||
It's _possible_ to back up a BigchainDB database by creating a point-in-time copy of the RethinkDB data files (on all nodes, at roughly the same time). It's not a very practical approach to backup: the resulting set of files will be much larger (collectively) than what one would get using `rethinkdb dump`, and there are no guarantees on how consistent that data will be, especially for recently-written data.
|
||||
|
||||
If you're curious about what's involved, see the [MongoDB documentation about "Backup by Copying Underlying Data Files"](https://docs.mongodb.com/manual/core/backups/#backup-with-file-copies). (Yes, that's documentation for MongoDB, but the principles are the same.)
|
||||
|
||||
See the last subsection of this page for a better way to use this idea.
|
||||
|
||||
|
||||
## Incremental or Continuous Backup
|
||||
|
||||
**Incremental backup** is where backup happens on a regular basis (e.g. daily), and each one only records the changes since the last backup.
|
||||
|
||||
**Continuous backup** might mean incremental backup on a very regular basis (e.g. every ten minutes), or it might mean backup of every database operation as it happens. The latter is also called transaction logging or continuous archiving.
|
||||
|
||||
At the time of writing, RethinkDB didn't have a built-in incremental or continuous backup capability, but the idea was raised in RethinkDB issues [#89](https://github.com/rethinkdb/rethinkdb/issues/89) and [#5890](https://github.com/rethinkdb/rethinkdb/issues/5890). On July 5, 2016, Daniel Mewes (of RethinkDB) wrote the following comment on issue #5890: "We would like to add this feature [continuous backup], but haven't started working on it yet."
|
||||
|
||||
To get a sense of what continuous backup might look like for RethinkDB, one can look at the continuous backup options available for MongoDB. MongoDB, the company, offers continuous backup with [Ops Manager](https://www.mongodb.com/products/ops-manager) (self-hosted) or [Cloud Manager](https://www.mongodb.com/cloud) (fully managed). Features include:
|
||||
|
||||
* It "continuously maintains backups, so if your MongoDB deployment experiences a failure, the most recent backup is only moments behind..."
|
||||
* It "offers point-in-time backups of replica sets and cluster-wide snapshots of sharded clusters. You can restore to precisely the moment you need, quickly and safely."
|
||||
* "You can rebuild entire running clusters, just from your backups."
|
||||
* It enables, "fast and seamless provisioning of new dev and test environments."
|
||||
|
||||
The MongoDB documentation has more [details about how Ops Manager Backup works](https://docs.opsmanager.mongodb.com/current/application/#backup).
|
||||
|
||||
Considerations for BigchainDB:
|
||||
|
||||
* We'd like the cost of backup to be low. To get a sense of the cost, MongoDB Cloud Manager backup [costed $30 / GB / year prepaid](https://www.mongodb.com/blog/post/lower-mms-backup-prices-backing-mongodb-now-easier-and-more-affordable). One thousand gigabytes backed up (i.e. about a terabyte) would cost 30 thousand US dollars per year. (That's just for the backup; there's also a cost per server per year.)
|
||||
* We'd like the backup to be decentralized, with no single point of control or single point of failure. (Note: some file systems have a single point of failure. For example, HDFS has one Namenode.)
|
||||
* We only care to back up blocks and votes, and once written, those never change. There are no updates or deletes, just new blocks and votes.
|
||||
|
||||
|
||||
## Combining RethinkDB Replication with Storage Snapshots
|
||||
|
||||
Although it's not advertised as such, RethinkDB's built-in replication feature is similar to continous backup, except the "backup" (i.e. the set of replica shards) is spread across all the nodes. One could take that idea a bit farther by creating a set of backup-only servers with one full backup:
|
||||
|
||||
* Give all the original BigchainDB nodes (RethinkDB nodes) the server tag `original`.
|
||||
* Set up a group of servers running RethinkDB only, and give them the server tag `backup`. The `backup` servers could be geographically separated from all the `original` nodes (or not; it's up to the consortium to decide).
|
||||
* Clients shouldn't be able to read from or write to servers in the `backup` set.
|
||||
* Send a RethinkDB reconfigure command to the RethinkDB cluster to make it so that the `original` set has the same number of replicas as before (or maybe one less), and the `backup` set has one replica. Also, make sure the `primary_replica_tag='original'` so that all primary shards live on the `original` nodes.
|
||||
|
||||
The [RethinkDB documentation on sharding and replication](https://www.rethinkdb.com/docs/sharding-and-replication/) has the details of how to set server tags and do RethinkDB reconfiguration.
|
||||
|
||||
Once you've set up a set of backup-only RethinkDB servers, you could make a point-in-time snapshot of their storage devices, as a form of backup.
|
||||
|
||||
You might want to disconnect the `backup` set from the `original` set first, and then wait for reads and writes in the `backup` set to stop. (The `backup` set should have only one copy of each shard, so there's no opportunity for inconsistency between shards of the `backup` set.)
|
||||
|
||||
You will want to re-connect the `backup` set to the `original` set as soon as possible, so it's able to catch up.
|
||||
|
||||
If something bad happens to the entire original BigchainDB cluster (including the `backup` set) and you need to restore it from a snapshot, you can, but before you make BigchainDB live, you should 1) delete all entries in the backlog table, 2) delete all blocks after the last voted-valid block, 3) delete all votes on the blocks deleted in part 2, and 4) rebuild the RethinkDB indexes.
|
||||
|
||||
**NOTE:** Sometimes snapshots are _incremental_. For example, [Amazon EBS snapshots](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/EBSSnapshots.html) are incremental, meaning "only the blocks on the device that have changed after your most recent snapshot are saved. **This minimizes the time required to create the snapshot and saves on storage costs.**" [Emphasis added]
|
@ -1,61 +0,0 @@
|
||||
# RethinkDB Requirements
|
||||
|
||||
[The RethinkDB documentation](https://rethinkdb.com/docs/) should be your first source of information about its requirements. This page serves mostly to document some of its more obscure requirements.
|
||||
|
||||
RethinkDB Server [will run on any modern OS](https://www.rethinkdb.com/docs/install/). Note that the Fedora package isn't officially supported. Also, official support for Windows is fairly recent ([April 2016](https://rethinkdb.com/blog/2.3-release/)).
|
||||
|
||||
|
||||
## Storage Requirements
|
||||
|
||||
When it comes to storage for RethinkDB, there are many things that are nice to have (e.g. SSDs, high-speed input/output [IOPS], replication, reliability, scalability, pay-for-what-you-use), but there are few _requirements_ other than:
|
||||
|
||||
1. have enough storage to store all your data (and its replicas), and
|
||||
2. make sure your storage solution (hardware and interconnects) can handle your expected read & write rates.
|
||||
|
||||
For RethinkDB's failover mechanisms to work, [every RethinkDB table must have at least three replicas](https://rethinkdb.com/docs/failover/) (i.e. a primary replica and two others). For example, if you want to store 10 GB of unique data, then you need at least 30 GB of storage. (Indexes and internal metadata are stored in RAM.)
|
||||
|
||||
As for the read & write rates, what do you expect those to be for your situation? It's not enough for the storage system alone to handle those rates: the interconnects between the nodes must also be able to handle them.
|
||||
|
||||
**Storage Notes Specific to RethinkDB**
|
||||
|
||||
* The RethinkDB storage engine has a number of SSD optimizations, so you _can_ benefit from using SSDs. ([source](https://www.rethinkdb.com/docs/architecture/))
|
||||
|
||||
* If you have an N-node RethinkDB cluster and 1) you want to use it to store an amount of data D (unique records, before replication), 2) you want the replication factor to be R (all tables), and 3) you want N shards (all tables), then each BigchainDB node must have storage space of at least R×D/N.
|
||||
|
||||
* RethinkDB tables can have [at most 64 shards](https://rethinkdb.com/limitations/). What does that imply? Suppose you only have one table, with 64 shards. How big could that table be? It depends on how much data can be stored in each node. If the maximum amount of data that a node can store is d, then the biggest-possible shard is d, and the biggest-possible table size is 64 times that. (All shard replicas would have to be stored on other nodes beyond the initial 64.) If there are two tables, the second table could also have 64 shards, stored on 64 other maxed-out nodes, so the total amount of unique data in the database would be (64 shards/table)×(2 tables)×d. In general, if you have T tables, the maximum amount of unique data that can be stored in the database (i.e. the amount of data before replication) is 64×T×d.
|
||||
|
||||
* When you set up storage for your RethinkDB data, you may have to select a filesystem. (Sometimes, the filesystem is already decided by the choice of storage.) We recommend using a filesystem that supports direct I/O (Input/Output). Many compressed or encrypted file systems don't support direct I/O. The ext4 filesystem supports direct I/O (but be careful: if you enable the data=journal mode, then direct I/O support will be disabled; the default is data=ordered). If your chosen filesystem supports direct I/O and you're using Linux, then you don't need to do anything to request or enable direct I/O. RethinkDB does that.
|
||||
|
||||
<p style="background-color: lightgrey;">What is direct I/O? It allows RethinkDB to write directly to the storage device (or use its own in-memory caching mechanisms), rather than relying on the operating system's file read and write caching mechanisms. (If you're using Linux, a write-to-file normally writes to the in-memory Page Cache first; only later does that Page Cache get flushed to disk. The Page Cache is also used when reading files.)</p>
|
||||
|
||||
* RethinkDB stores its data in a specific directory. You can tell RethinkDB _which_ directory using the RethinkDB config file, as explained below. In this documentation, we assume the directory is `/data`. If you set up a separate device (partition, RAID array, or logical volume) to store the RethinkDB data, then mount that device on `/data`.
|
||||
|
||||
|
||||
## Memory (RAM) Requirements
|
||||
|
||||
In their [FAQ](https://rethinkdb.com/faq/), RethinkDB recommends that, "RethinkDB servers have at least 2GB of RAM..." ([source](https://rethinkdb.com/faq/))
|
||||
|
||||
In particular: "RethinkDB requires data structures in RAM on each server proportional to the size of the data on that server’s disk, usually around 1% of the size of the total data set." ([source](https://rethinkdb.com/limitations/)) We asked what they meant by "total data set" and [they said](https://github.com/rethinkdb/rethinkdb/issues/5902#issuecomment-230860607) it's "referring to only the data stored on the particular server."
|
||||
|
||||
Also, "The storage engine is used in conjunction with a custom, B-Tree-aware caching engine which allows file sizes many orders of magnitude greater than the amount of available memory. RethinkDB can operate on a terabyte of data with about ten gigabytes of free RAM." ([source](https://www.rethinkdb.com/docs/architecture/)) (In this case, it's the _cluster_ which has a total of one terabyte of data, and it's the _cluster_ which has a total of ten gigabytes of RAM. That is, if you add up the RethinkDB RAM on all the servers, it's ten gigabytes.)
|
||||
|
||||
In reponse to our questions about RAM requirements, @danielmewes (of RethinkDB) [wrote](https://github.com/rethinkdb/rethinkdb/issues/5902#issuecomment-230860607):
|
||||
|
||||
> ... If you replicate the data, the amount of data per server increases accordingly, because multiple copies of the same data will be held by different servers in the cluster.
|
||||
|
||||
For example, if you increase the data replication factor from 1 to 2 (i.e. the primary plus one copy), then that will double the RAM needed for metadata. Also from @danielmewes:
|
||||
|
||||
> **For reasonable performance, you should probably aim at something closer to 5-10% of the data size.** [Emphasis added] The 1% is the bare minimum and doesn't include any caching. If you want to run near the minimum, you'll also need to manually lower RethinkDB's cache size through the `--cache-size` parameter to free up enough RAM for the metadata overhead...
|
||||
|
||||
RethinkDB has [documentation about its memory requirements](https://rethinkdb.com/docs/memory-usage/). You can use that page to get a better estimate of how much memory you'll need. In particular, note that RethinkDB automatically configures the cache size limit to be about half the available memory, but it can be no lower than 100 MB. As @danielmewes noted, you can manually change the cache size limit (e.g. to free up RAM for queries, metadata, or other things).
|
||||
|
||||
If a RethinkDB process (on a server) runs out of RAM, the operating system will start swapping RAM out to disk, slowing everything down. According to @danielmewes:
|
||||
|
||||
> Going into swap is usually pretty bad for RethinkDB, and RethinkDB servers that have gone into swap often become so slow that other nodes in the cluster consider them unavailable and terminate the connection to them. I recommend adjusting RethinkDB's cache size conservatively to avoid this scenario. RethinkDB will still make use of additional RAM through the operating system's block cache (though less efficiently than when it can keep data in its own cache).
|
||||
|
||||
|
||||
## Filesystem Requirements
|
||||
|
||||
RethinkDB "supports most commonly used file systems" ([source](https://www.rethinkdb.com/docs/architecture/)) but it has [issues with BTRFS](https://github.com/rethinkdb/rethinkdb/issues/2781) (B-tree file system).
|
||||
|
||||
It's best to use a filesystem that supports direct I/O, because that will improve RethinkDB performance (if you tell RethinkDB to use direct I/O). Many compressed or encrypted filesystems don't support direct I/O.
|
Loading…
x
Reference in New Issue
Block a user