I don’t think we want completely independent datasets living under the data repo — there’s too much potential for confusion. Which directories in the data repo are part of the data repo and which are independent? Hence I suggest we keep everything separate:
- Data root repo:
/datasets/decam/repo
(or similar); will contain/datasets/decam/repo/rerun
for processed results. - Raw data (uningested into repo):
/datasets/decam/raw
- Preprocessed data:
/datasets/decam/preprocessed
(or.../cp
for Community Pipeline?)
Perhaps we don’t understand each other because I’m assuming that we want to keep the raw data independent of the data repo and not just ingest it into it. I think that’s desirable because the raw data is sacrosanct, while the data repo can be allowed to evolve.