I’ve seen a number of mutterings on HipChat about how badly broken the shared stack is on lsst-dev; some excerpts:
@ktl:
The three big problems on lsst-dev are: 1) shared stacks are too big and slow, 2) eups tags (besides bNNNN) not automatically applied to ~lsstsw versions, 3) interesting tags not automatically copied to a usable smaller stack.
@jdswinbank (on the unavailability of IPython on lsst-dev):
Not sure whether I’m more shocked that our core development system is so broken or that, apparently, nobody noticed.
I’ve also seen complaints from @mwv, @rowen, and @merlin that I can’t be bothered to find just now, and it’s been apparent that in the process of validating DM-4692 it’s been a real impediment to progress that we don’t have a usable shared stack on a single development cluster (though this goes well beyond just DM-4692; it’s only the most recent reminder).
Well, actually, what’s been a problem is that we don’t have a solution for this situation:
- Developer A builds a stack containing some some branches he/she is working on, runs some test data through that stack, generating an output repository somewhere, and discovers a problem that Developer B might be able to help with, so…
- Developer B needs to look at the output repository, re-run the same pipelines with some additional changes to code or configuration, and be able to plot and display the results.
It’s crucial that Developer B not have to transfer any data between machines or compile any code beyond what they need to override relative to Developer A’s environment; frequently Developer B is busy and should be spending most of their time on something else, but with easy access to the Developer A’s environment, they might be able to unblock Developer A quickly.
The solution we’ve used on the HSC side (and in the now-somewhat-distant past on shared LSST machines) is to have a shared stack on a single beefy machine, like the one we’ve tried to put together on lsst-dev. This has worked quite well for us, after some initial overhead and training: everybody has to be vigilant about group access via chmod and suid, know their way around EUPS tags, and setup some way to do remote display via (probably) IPython and ds9. I think the key differences between the situation on tiger and the situation on lsst-dev are:
- @price devotes a lot of time to serving as HSC release master, vetting, publishing, and installing new stacks. That let’s us to make real releases with meaningful (for developers, not just managers) version numbers on timescales determined to be useful by a human (often once every few months, but sometimes as often as once a week).
- We use EUPS versions (with umbrella packages), not tags, to designate releases - so the number of active tags in the shared stack is extremely low.
- We rigorously control more of our third-party packages via EUPS, as LSST used to do.
I’m not sure all of these are important to the success of this development model (especially the last point), but I think we’d need to take some of these steps if we want to make lsst-dev similarly usable. And I think we really need to do that. Or…
…we need to find some other solution to do the workflow described above. I know we have a lot of people more excited about Nebula than shared stack management and EUPS reimplementation, and I got the impression at one point that that might also provide a solution to this problem. If that’s the case, I’d love to hear more about how it might work, and we need to get everyone through the overhead/training process for whichever approach we adopt, so the next time Developer A gets stuck, he/she hasn’t already wasted effort by working in the wrong space.