Planning for verification data set use cases

jalt · September 17, 2015, 11:14pm

As part of the FY16 hardware purchase plan, we have been planning for a significant increase in storage to accommodate the raw data as well as multiple versions of the output. We are thinking somewhere in the vicinity of 2PB. Since there is also a need for an increased number of compute nodes for the verification runs (we are thinking 60ish nodes), the file system of choice for us at this time is GPFS. This is all hardware that, if approved, wouldn’t land until probably the first of the year. When it does land, we would deploy something of this size into NPCF.

I spoke briefly with @jbosch about RFC-95 and how it relates to our plans. I have also seen @frossie @jmatt @josh GIT LFS use cases (somewhat). But I’m not sure what else is lingering out there.

So please use this topic to discuss data set use cases. What I see so far:

It will be available to the verification cluster. (new hardware)
It needs to be available to the development systems (current hardware, upgraded in FY15)
There may be a need for it directly within VMs
GIT LFS … this one is still under defined

Or … do we satisfy all/most use cases with a large batch system plus OpenStack.

jbosch · September 18, 2015, 3:13pm

If we want to use OpenStack VMs beyond CI (I’m not sure what everyone’s plans are in that area, but I’ve gotten that impression), then I think they do need to have access to this storage.

RHL · September 19, 2015, 1:59pm

I don’t know exactly what this means, but we will need command-line access (including ipython/jupyter notebooks) to this data as well as batch.

jmatt · September 21, 2015, 2:41am

Past OpenStack implementations that I’ve been a part of have used the same technology to back glance (image), swift (object) and cinder (volume) services. I think there are many advantages to this approach. So I’d expect we’d use GPFS to back all of these data stores. But, I’m new and don’t know enough specifics. So, for now, I can only know what’s worked in past experiences. I’ll definitely keep an open mind.

I expect there to be a need for this based on what I’ve discussed with @frossie and @josh. It’s been extremely useful in past scientific clouds. I think it makes sense as long as we have a simple, well defined set of rules on the data. In my opinion, there needs to be consistency between who has access to VMs and the data available from that VM. As long as that requirement is met, it’s good.

The current server implementation of git-lfs requires an S3 API. Swift already has an S3 API and it should work. I could implement our own git-lfs server and back it differently. But everything is working with the current implementation, so no reason to do additional work given the existing S3 middleware for Swift.

jalt · September 21, 2015, 3:33pm

I am not familiar with ipython/jupyter. Is this something you are using at NCSA now? From Nebula? From the development hardware?

jalt · September 21, 2015, 3:46pm

Just being careful here in our descriptions. GPFS is not a backing store for the OpenStack services. GPFS would be a LSST-specific ‘service’ independent of the NCSA center-wide Nebula installation. We could use GPFS to hold additional copies of data that is in swift or cinder if we chose, but we would need to implement that at our project level.

I agree; I certainly do not want to create work but I do want to make sure we support all uses. The only reason I bring up git-lfs because it was mentioned in the same context of the verification data sets. Is there a specific subset of data sets that will reside in git-lfs? Are you intending this to version computed data? Who contributes to the data in your git-lfs store? Who consumes that data? That’s not yet clear to me.

jalt · September 21, 2015, 5:43pm

Perhaps this will get us on the same page.

At this time, we have 3 distinct user data stores: (1) NFS systems attached to the LSST development hardware and (2) Nebula’s cinder services and (3) storage local to VMWare instances. We also have some storage on the GPFS Condo and Nearline but, as far as I can tell, we are using these as an administrative function (secondary store of certain data ex. data challenges). In the future, we plan to add swift storage as well.

From what I can tell, developers seem to be somewhat content (or perhaps not vocally discontent) using NFS storage for their development. Some developers may be using VMware instances, some may be moving to OpenStack; the choice is theirs. In either of the VM scenarios, the user must replicate portions of the datasets from NFS (or externally if it is new data) to the VMs.

@nidever is about to start (using this timing phrase loosely, most likely meaning at some point in 2016) larger integration testing. This will necessitate a larger compute and storage infrastructure than we have now. We are planning to deploy this new infrastructure in NCPF (current dev and OpenStack are in NCSA lab). This is the planned location of the verification data sets though it need not be.

@frossie and her group are making the largest strides with cloud technologies. This is really a specific case of the VM usage mentioned above. Portions of the data sets must be copied as needed.

There is also a copy of PAN STARRs within a new Qserv environment for SUI integration but that is not necessary for this discussion.

As a side note, I would be very interested in discussing whether object stores could replace GPFS for LSST use. My gut feeling is, at this time, this would have a negative impact on productivity since current software expects a POSIX interface. Correct me as needed.

So … straw man:

New data sets land in GPFS attached to the new integration cluster.
(1) integration testing has typical ‘local’ POSIX access from new cluster
(2) portions of raw data sets are copied to NFS stores for developer use on the developer hardware
(3) VM users will copy data as needed from the NFS stores as they do now

As a consequence, the organization of data sets on the integration cluster can be independent of the organization of data on the development systems. No computed data sets are shared between the integration cluster and the development systems.

Correct as needed.

jbosch · September 21, 2015, 7:19pm

Is connecting the development systems to GPFS a possibility?

I think most developers would be happy to move from NFS to GPFS, if accessing GPFS from the developer machines was a possibility - I think we’d prefer to have just one volume and have everything in it, rather than try to remember what’s in the various /lsstN paths and juggle what’s in them to fit their individual sizes.

I also think that we’d much prefer to have direct access (even somewhat slower access) to the full datasets on GPFS from the development systems rather than have to copy things back and forth. In fact, I think if we don’t do this, it’s likely that most developers will just try to use the new cluster for everything and the development systems may largely go unused.

RHL · September 21, 2015, 10:22pm

Ipython notebooks are pretty standard (google will help!). We’ll need the ability to start a notebook server on a machine that can see the data and connect to it over https. This can be done via ssh tunnels, but a signed cert from NCSA would be better (but the https access is a separate question)

RHL · September 21, 2015, 10:25pm

I don’t think users should need to understand your various layers of file access unless there is a compelling reason. If there is, could you explain the tradeoffs here?

ktl · September 21, 2015, 10:53pm

The issue is whether we need/want to provide global POSIX access from all computing platforms.

There are three computing platforms (current development cluster, OpenStack virtual machines, and to-be-purchased integration cluster) and five storage mechanisms of which only three really have the potential to be globally-visible:

NFS (exists, POSIX, accessible from development cluster, could be global but hard to support)
GPFS (to be bought, POSIX, accessible from integration cluster, could be global, probably with some effort)
Swift, the OpenStack object store (to be installed, not POSIX, global but needs Butler work)
Cinder, the OpenStack block store (exists, POSIX, accessible from OpenStack VMs, probably hard to make it global)
Local storage (on VMs or cluster nodes, POSIX, accessible locally only, hard to make global)

My understanding is that global POSIX access from OpenStack VMs to NFS and GPFS requires resolution of security issues (although I would hope that at least read-only mounts could be enforced and would not pose much of an issue – but they are of course insufficient for output repositories/reruns).

In lieu of global access, staging models (copying data from where it is to where it needs to be) could be implemented. This is likely to be the initial mechanism for supporting Swift with the Butler, for example, and it could be extended to other storage cases. This can be acceptable as long as the latency is small or can be overlapped, but it will otherwise be a user-visible impact.

jalt · September 22, 2015, 1:52am

Yes. Getting to the VMs is a little trickier.

jalt · September 22, 2015, 1:57am

GPFS in user-owned VMs is not likely to be an option. NFS/Samba to VMs could be an option.

When is Swift support planned?

ktl · September 22, 2015, 2:05am

It is not in the current baseline, so the formal answer is “currently, never”, although it is something I would like to investigate, if we have the time. To be useful, I would think such investigations should occur by the end of 2016. I can see ways of implementing the staging model in the current Butler (with required registry). A non-staging implementation would almost certainly need a New Butler framework to be reasonable.