Qserv Data Placement

Qserv data placement and replication strategies.

Goal is to provide a sharp requirements list for Data-Placement. These requirements are critical in order to offer an efficient and well-fitted solution.
Here’s an initial proposal:

Maximize high availability

SQL user queries must continue in real time, if less than 2% worker nodes crashes.

Minimize overall cost

Costs include infrastructure, sofware development, system administration and maintenance.

Minimize data reconstruction time

A node which crashes must be cloned in a minimal amount of time (data and application)

Prevent data loss

At least one instance of each chunk must be available at any time on disk, even in case 5% of storage is definitly lost.
Archive should also be managed by the system in order to reduce cost.

Maximize performances

Bottleneck must be identified and quantified for each proposed architectures.Chunk placement combinations can then be optimized depending on a given query set.

Very interesting and extremely challenging piece of functionality. Certainly not a short term or trivial task.

State-of-the-practice in data recovery focuses on many-to-many reconstruction to avoid secondary failures within MTBF using secondary channels to limit impact on performance. Better solutions allow for scheduling of rebuilding to avoid interference on key production operations.

If drive size is not to be kept small or if, in general, you want to minimize rebuild times, you’ll want shard-node placement to NOT be identical between any two drives or any two systems throughout the cluster. This may be in contrast to the shared-scan model for Qserv. The placement mapping would need to be very dynamic; not a minor implementation by any means.

Given that Qserv is ‘bandwidth bound’ instead of ‘latency bound’, it may be advantageous to look at storage solutions that provide the recovery as transparent to Qserv. Think iscsi, distributed file system, etc. These designs may/will provide greater bandwidth than current single-disk expectations so this could further reduce cluster size as well … or not … much inspection to be done.

I think the starting point is to look at the hard numbers (kpm and cost) and compare the designs. Development will cost significantly more for FTEs, infrastructure solution will cost more from the hardware but still some development cost. Both require significant QA/QC effort.

Hi Fabrice,

This is an interesting topic to raise, which coincides nicely with some work we’re doing here in the UK to build experience with Qserv. We are considering questions related to: how new data releases will be ingested into Qserv; how difference DACs will cooperate if they have different Qserv installations (e.g. different number of nodes); and whether different DACs will tailor their distribution policy for data, in response to the types of queries they most frequently see.

We’re more about questions than answers at present, but we do have some on-going activities, trying to address these, which may be interesting to others, plus would be more fruitful with input from a wider group (and, in particular, experts such as yourself).

Do you have thoughts on how to progress this topic?

Thanks again for raising,
George (LSST:UK/ Edinburgh).

Thanks so much for your answers, I’ll take in in account. It is a good idea to have a wider working group on this hot topic. Will let you know soon :slight_smile:

A couple of considerations that should be included.

  • The maximum size for a chunk is a function of the memory installed on the workers. There are serious performance penalties if chunks are too large. Having lots of chunks increases bookkeeping costs, which is a lesser concern until the number of chunks becomes very high.
  • The number of replicas for each chunk. If there are only 2 copies of each chunk (one primary and one replica), there are issues that need to be considered. Among them, if there are significantly more chunks than worker nodes, losing 2 workers will likely cause some chunks to be unavailable.

Disaster recovery should be enabled by DAC and DAC-satellite management. However, I guess “Data Placement” might be involved in disaster recovery (for example, if no tape storage is used, when copying backup data to main site).

I realize “Data Placement” != “Disaster Recovery”, and not to hijack this thread, but this comment is interesting. I had thought of data placement as relevant to how Qserv would implement recover-in-place. And I see disaster recovery as a separate, likely unrelated concern (although the loading method must take into account changes in hardware configs between initial load and recovery). I think as we consider operations for Qserv instead of just the standing up of Qserv, this all becomes related.

FYI, I’ve tried to summarize our discussion in a draft DM technical note:

Furthermore, I’ve tried to compute I/O throughput required for one Qserv node, comment are welcomes!

We’ve now published the technote: https://dmtn-032.lsst.io.

Hi @jalt,

FYI, we are now testing CEPH with Qserv but we have ~20% iowaits when doing a full-scan of Source tables (~ 1TB on 60 disks). In Qserv VM data disk is mounted using ext4. We are around 2-3x faster than at IN2P3 (10 RAID6 disks), and we also have around 20% iowait on IN2P3 cluster. So this does not look so bad for CEPH, and we’ll try to reduce disks numbers.

On you side, would you have any idea on how to optimize this setup? Shall we tests other distributed FS which might be better for Qserv (like GlusterRS or RozoFS)? Remember that we would like to optimize massive parallel full scans of large myisam files (i.e. ~200 GB).

Let me paraphrase because I wan to be sure I am following correctly.

@ IN2P3: docker instances access CEPH Restful interface experiencing 20%iowait
@ NCSA: docker instances access ext4 on RAID 8+2 and is 2-3x faster

And in both cases, the object/file size is ~200GB. Is that an accurate summary?

Sorry I wasn’t clear enough

  • We are comparing CC-IN2P3 infrastructure, which is bare-metal and Galactica infrastructure based on openstack+CEPH
  • On both infrastructure, our tests launch SQL queries which read sequentially large MyISAM files (~2.5 GB currently, should be ~200 GB in LSST DR1)
  • At CC-IN2P3: docker instances running on bare-metal access 10 RAID6 disks experiencing 20% iowait
  • At Galactica: docker instances running on openstack vms access CEPH RADOS Block Device experiencing 20% iowait. It is 2x3 time faster than at CC-IN2P3
  • At Galactica, however, only data for one worker node is read (~1.5 TB of MyISAM files), and it is sharded on 60 disks, we aim at reducing disk number and improving performance, in order to design a low granularity CEPH storage (n Qserv nodes for m osd), so that we can scale up linearly to hundred of Qserv nodes.

Would you please have any suggestions/ideas to help us achieving this goal?

If you are concerned about the 20% iowait, benchmark the storage (dd, xdd, ior) for comparison to the rates you are seeing with the application. If you get better rates during the benchmarking, then it is likely an application tuning issue. Galactica may be a noisy platform so run the test several times for consistency. If the RAID6 has multiple readers, you may throw off performance due to contention (head seek times). Make sure the tests match the applications access pattern.

Getting perfect performance from 60 disks is difficult due to bottlenecks and layout on disk. I assume these are 2.5" so maybe a best case of 80*60=4.8GB/s (ish) which exceeds your 10Gb interfaces. The ideal configuration may be hyper-converged to eliminate bottlenecks. And using a storage system (CEPH, Gluster, Quobyte) that can recover on failure.