gun/faq.txt
2014-04-04 06:40:56 -06:00

27 lines
10 KiB
Plaintext

Is Gun really distributed if it relies on S3?
Yes, because if you are running a multi-machine setup, likely each machine is in a dedicated region (North America, Europe, Asia, etc). As such, each machine is probably persisting the data to an S3 instance in isolated regions - however this configuration is up to you. Also, Gun persistence is supported by any S3 compatible API, thus allowing you to backup to multiple locations simultaneously. (Note, even if you did store all your data in a single S3 region, S3 itself will distribute your data within their own infrastructure, so it is much more reliable than having all your data sitting on a single machine which is probably the case if you manage your database yourself)
What queries does Gun support, and what is their runtime complexity?
One of the main inspirations to Gun's creation was to have the working set of data in memory so that one could do fast arbitrary searches. This stemmed from needing to do more and more complicated queries in traditional databases, but failing to be able to do so because the query language was limited, or forced a developer to write an overly complex query when writing one's own map function would have been easier, more maintainable, and faster. As such, this hands the power back to you, as a developer, allowing you to write your own optimized queries - however, if you don't want to do that, future versions of Gun and/or other plugins will provide these for you, using the best known traversal algorithms.
How does Gun index its data?
Because Gun is a graph engine, indexing is incredibly efficient because every node sits at the top level of the graph. A node is any object that is referenced more than once, any value that sits in an array, or any node created by you manually. Even in the worse case condition, scanning through the entire database will be fast because every node sits in the top level of the graph's hashtable - thus will be as fast as the size of your graph. The size of your graph is up to you, and is only limited by how much memory you have, but it is recommended you keep graphs small and use external references to link to other nodes in other graphs (these work almost identically to internal references).
What happens if the persistence layer is unreachable?
Your app will continue to work as normal within the limits of the data you have in cache. How much data you have in cache is determined by whether you are using the default cache (which will lose its data if your process dies) or if you are using the optional Redis cache which will persist to a local disk, which is limited by the size of your ephemeral drive. All other data outside your cache will be unreachable. When the network restores, Gun will resync all the data it has in cache, handling conflict resolution for you. Data should not be lost because the browser client that issued the data preserves the updates until it gets a confirmation from the server that persistence has been achieved, these edits will even survive across page reloads if the client browser supports a localstorage fallback.
Is there a leader node with Gun?
No, Gun is fully peer to peer. The only "leader" is your primary IP addresses you have assigned for your apps, which your users and servers will connect through. If that is unavailable, this should not effect Gun because no users will be able to access your app either, and therefore no modifications can be triggered.
---------------------------------------------------------------
How would doing an arbitrary, un-indexed query on 1 Billion records compare to traditional databases?
Let's assume that all 1B records can fit on a single solid state disk.
- (1) If you are managing your database yourself, and if the database is on the same disk as your server application logic (this is best case scenario). THEN, your app logic will send the query over a socket to the database, this will be fast. The database will then begin its scan (preferably) on the records it has in memory, and collect the result. This will be decently fast. Once it runs out of the working set, it will start loading from disk, and then scanning whatever it loaded from disk, and collecting it into the result. This is terribly terribly terribly slow, on 1B records this will take a very long time. At some point, the result itself may not fit into memory either, or the result will exceed a maximum limit, and the database will flush its limited result back over the socket to your application and the database will keep a cursor of where it stopped scanning the disk. It is then up to your application logic to then make another query over the socket to the database with that cursor ID. The database will then resume its scan, as before. This process will repeat until all 1B records are scanned. If your application logic attempts to stack all these results onto each other, in its memory, your application process will start hogging the entire machine's memory, forcing the database's working set to be even smaller and smaller - thus causing even more of a crazy slowdown in the database query. It is up to your application logic to handle each immediate result that it has and flushing it from memory, if it doesn't... your result will start getting paged to disk as a swap, and each time your application receives a new update from the database your result will get swapped back into memory so that it can append the new result, then paged back to disk. This will cause even worse slow downs.
This was the best case condition.
- (2) If you are managing your database yourself, and it is on a separate machine. THEN your app logic will send a query over a network socket, this will be fast since the query is small, but already much slower than if the database was on the same machine. The database on the other machine will receive the request and then do what happened in (1) however its memory will not be eaten out by your application so that will allow it to be a little bit faster. But this gets counter-acted by the fact that each time it fills up to the max limit that is supported (or memory), it will flush the result data back over the network to you. This will be significantly slow, because it will then wait for your application to receive the result, and then send a follow up to continue the query at the same cursor (Note, some databases may buffer each next "follow up", so responses will be immediate once the "follow up" request comes in, but there is a timeout for how long it will keep this "follow up" buffer in memory). You application and the database machine will continue this back and forth chatter over the network until the scan is complete - but mind you, this chatter is happening over the network, so it will be slow. Additionally, your application may run out of memory if your application logic stacks the results on top of each other before flushing, and the same swap/paging process will happen as before.
- (3) If you are using a Database as a Service. THEN the same thing will happen as (2) except you'll be paying thousands of dollars every month for them to be storing 1B records for you, and it will still be crazy slow because you have to rely on the network chatter back and forth.
- (4) If you are using Gun. THEN you already have your working set in memory, so you don't need to do a query, you'll just collect the results from it. While this is happening you will issue a LIST request over the network to S3, this will be fast since the request is small, but already slower than if the data was already in memory. S3 will reply with the metadata and you'll then start streaming each file in, this will be slow because it relies on the network. As many files as can be will be buffered in, and you'll collect your result on each. While this is happening, the S3 metadata (the LIST result) will also have a limit to it, so your application will need to keep following up with a new LIST request from the new cursor point, however this is fast because it can happen concurrently with streaming the files - it does not have to wait until the entire file streams are finished. So all the files that are getting streamed in will just page to disk if there is any backpressure building up, and wait until your application can catch up. Pulling from disk, as was the case in (1) will be terribly terribly slow. If your result happens to stack up, it may need to page to disk while the subsequently streamed file is brought into memory, then the buffer result collected and then swap the result back into memory, append it, and then repeat But the advantage with this method is that your application's memory usage doesn't clog up S3, and once the scanning is complete, the final result is already in memory (or will have been just swapped back in), you don't need to rely on any final network call.
In conclusion, (4) is terribly terribly slow at first because of the initial network calls, but as the process gets going, all the data is just getting streamed in and paged to disk, and then the speed becomes essentially equivalent to (1). This is very unlike (2) and (3) where each result has to get buffered and then sent across the network, received, and only then replied to. The only way you could make (2) and (3) faster is if your application could somehow predict where the cursor would end, and then sent them one after another. This would cause the database to have to queue up the requests as independent from each other, and then fire off each response. Your application would then have to handle this incoming streams and stitch them together, and would be about as fast as (4). Which, as we are learning, will become as fast as (1) if the streaming requests come in fast enough that they stack up into being paged to disk, or will be as fast as the network can reply - which is the same slowness as (2) and (3).
Therefore it is reasonable to assert that Gun can be just as fast as a DaaS for scanning through 1B records, but will be tremendously cheaper because you're paying for raw storage rather than expensive database maintenance. Ontop of the fact that S3 is practically speaking infinitely scalable, while DaaS's have tiered levels (so you'd need to worry about constantly upgrading while you grew to 1B records, S3 you wouldn't have to touch a thing). There are tons more arguments, but that'd be trailing off topic. I'll end here.