Qserv Data Placement

fjammes · January 17, 2017, 9:29am

Qserv data placement and replication strategies.

Goal is to provide a sharp requirements list for Data-Placement. These requirements are critical in order to offer an efficient and well-fitted solution.
Here’s an initial proposal:

Maximize high availability

SQL user queries must continue in real time, if less than 2% worker nodes crashes.

Minimize overall cost

Costs include infrastructure, sofware development, system administration and maintenance.

Minimize data reconstruction time

A node which crashes must be cloned in a minimal amount of time (data and application)

Prevent data loss

At least one instance of each chunk must be available at any time on disk, even in case 5% of storage is definitly lost.
Archive should also be managed by the system in order to reduce cost.

Maximize performances

Bottleneck must be identified and quantified for each proposed architectures.Chunk placement combinations can then be optimized depending on a given query set.

jalt · January 17, 2017, 5:37pm

Very interesting and extremely challenging piece of functionality. Certainly not a short term or trivial task.

State-of-the-practice in data recovery focuses on many-to-many reconstruction to avoid secondary failures within MTBF using secondary channels to limit impact on performance. Better solutions allow for scheduling of rebuilding to avoid interference on key production operations.

If drive size is not to be kept small or if, in general, you want to minimize rebuild times, you’ll want shard-node placement to NOT be identical between any two drives or any two systems throughout the cluster. This may be in contrast to the shared-scan model for Qserv. The placement mapping would need to be very dynamic; not a minor implementation by any means.

Given that Qserv is ‘bandwidth bound’ instead of ‘latency bound’, it may be advantageous to look at storage solutions that provide the recovery as transparent to Qserv. Think iscsi, distributed file system, etc. These designs may/will provide greater bandwidth than current single-disk expectations so this could further reduce cluster size as well … or not … much inspection to be done.

I think the starting point is to look at the hard numbers (kpm and cost) and compare the designs. Development will cost significantly more for FTEs, infrastructure solution will cost more from the hardware but still some development cost. Both require significant QA/QC effort.

george_beckett · January 19, 2017, 12:23pm

Hi Fabrice,

This is an interesting topic to raise, which coincides nicely with some work we’re doing here in the UK to build experience with Qserv. We are considering questions related to: how new data releases will be ingested into Qserv; how difference DACs will cooperate if they have different Qserv installations (e.g. different number of nodes); and whether different DACs will tailor their distribution policy for data, in response to the types of queries they most frequently see.

We’re more about questions than answers at present, but we do have some on-going activities, trying to address these, which may be interesting to others, plus would be more fruitful with input from a wider group (and, in particular, experts such as yourself).

Do you have thoughts on how to progress this topic?

Thanks again for raising,
George (LSST:UK/ Edinburgh).

fjammes · January 19, 2017, 9:58pm

Thanks so much for your answers, I’ll take in in account. It is a good idea to have a wider working group on this hot topic. Will let you know soon

jgates108 · January 20, 2017, 11:09pm

A couple of considerations that should be included.

The maximum size for a chunk is a function of the memory installed on the workers. There are serious performance penalties if chunks are too large. Having lots of chunks increases bookkeeping costs, which is a lesser concern until the number of chunks becomes very high.
The number of replicas for each chunk. If there are only 2 copies of each chunk (one primary and one replica), there are issues that need to be considered. Among them, if there are significantly more chunks than worker nodes, losing 2 workers will likely cause some chunks to be unavailable.

fjammes · January 31, 2017, 8:33pm

Disaster recovery should be enabled by DAC and DAC-satellite management. However, I guess “Data Placement” might be involved in disaster recovery (for example, if no tape storage is used, when copying backup data to main site).

jalt · February 2, 2017, 4:50pm

I realize “Data Placement” != “Disaster Recovery”, and not to hijack this thread, but this comment is interesting. I had thought of data placement as relevant to how Qserv would implement recover-in-place. And I see disaster recovery as a separate, likely unrelated concern (although the loading method must take into account changes in hardware configs between initial load and recovery). I think as we consider operations for Qserv instead of just the standing up of Qserv, this all becomes related.

fjammes · February 7, 2017, 11:41am

FYI, I’ve tried to summarize our discussion in a draft DM technical note:

github.com

lsst-dm/dmtn-032/blob/master/index.rst

..
  Technote content.

  See https://developer.lsst.io/docs/rst_styleguide.html
  for a guide to reStructuredText writing.

  Do not put the title, authors or other metadata in this document;
  those are automatically added.

  Use the following syntax for sections:

  Sections
  ========

  and

  Subsections
  -----------

  and

This file has been truncated. show original

Furthermore, I’ve tried to compute I/O throughput required for one Qserv node, comment are welcomes!

jsick · February 7, 2017, 4:13pm

We’ve now published the technote: https://dmtn-032.lsst.io.

fjammes · March 6, 2017, 5:04pm

Hi @jalt,

FYI, we are now testing CEPH with Qserv but we have ~20% iowaits when doing a full-scan of Source tables (~ 1TB on 60 disks). In Qserv VM data disk is mounted using ext4. We are around 2-3x faster than at IN2P3 (10 RAID6 disks), and we also have around 20% iowait on IN2P3 cluster. So this does not look so bad for CEPH, and we’ll try to reduce disks numbers.

On you side, would you have any idea on how to optimize this setup? Shall we tests other distributed FS which might be better for Qserv (like GlusterRS or RozoFS)? Remember that we would like to optimize massive parallel full scans of large myisam files (i.e. ~200 GB).

jalt · March 6, 2017, 5:27pm

Let me paraphrase because I wan to be sure I am following correctly.

@ IN2P3: docker instances access CEPH Restful interface experiencing 20%iowait
@ NCSA: docker instances access ext4 on RAID 8+2 and is 2-3x faster

And in both cases, the object/file size is ~200GB. Is that an accurate summary?

fjammes · March 7, 2017, 9:36am

Sorry I wasn’t clear enough

We are comparing CC-IN2P3 infrastructure, which is bare-metal and Galactica infrastructure based on openstack+CEPH
On both infrastructure, our tests launch SQL queries which read sequentially large MyISAM files (~2.5 GB currently, should be ~200 GB in LSST DR1)
At CC-IN2P3: docker instances running on bare-metal access 10 RAID6 disks experiencing 20% iowait
At Galactica: docker instances running on openstack vms access CEPH RADOS Block Device experiencing 20% iowait. It is 2x3 time faster than at CC-IN2P3
At Galactica, however, only data for one worker node is read (~1.5 TB of MyISAM files), and it is sharded on 60 disks, we aim at reducing disk number and improving performance, in order to design a low granularity CEPH storage (n Qserv nodes for m osd), so that we can scale up linearly to hundred of Qserv nodes.

Would you please have any suggestions/ideas to help us achieving this goal?

jalt · March 7, 2017, 5:55pm

If you are concerned about the 20% iowait, benchmark the storage (dd, xdd, ior) for comparison to the rates you are seeing with the application. If you get better rates during the benchmarking, then it is likely an application tuning issue. Galactica may be a noisy platform so run the test several times for consistency. If the RAID6 has multiple readers, you may throw off performance due to contention (head seek times). Make sure the tests match the applications access pattern.

Getting perfect performance from 60 disks is difficult due to bottlenecks and layout on disk. I assume these are 2.5" so maybe a best case of 80*60=4.8GB/s (ish) which exceeds your 10Gb interfaces. The ideal configuration may be hyper-converged to eliminate bottlenecks. And using a storage system (CEPH, Gluster, Quobyte) that can recover on failure.