Wednesday, November 03, 2010

Data and backups

I don't talk too much on here about the university service stuff that I do - frankly, much of it wouldn't be very interesting to most of my readers.  However, this year I'm chairing Rice University's Committee on Research, and we're discussing an issue that many of you may care about:  data management and preservation.   Generally, principal investigators are assumed to be "responsible custodians" of data taken during research.  Note that "data" can mean many things in this context - see here, for example.   US federal agencies that sponsor research typically expect the PIs to hold on to their data for several years following the conclusion of a project, and that PIs will make their data available if requested.  The university is legally responsible to ensure that the data is retained, in fact.  There are many issues that crop up here, but the particular one on which I'd like some feedback is university storage of electronic data.  If you're at a university, does your institution provide electronic (or physical, for that matter) storage space for the retention of research data?  Do they charge the investigators for that storage?  What kind of storage is it, and is the transfer of data from a PI's lab, say, to that storage automated?  I'd be very interested in hearing either success stories about university or institutional data management, or alternately horror stories. 

4 comments:

  1. We're required to keep all our research data (electronic records plus physical notebooks) for five years after the end of a grant. Physical storage space is not provided apart from the PI's offices; electronic storage space in our case is provided by the school.

    ReplyDelete
  2. Talking about the system at two German institutions: Data are supposed to be stored for 10 years. The institutions provide the infrastructure to keep the 0s and 1s somewhere (currently on tape), but no effort is required or made to ensure that one will be able to read the file format of said 0s and 1s in 10 years.

    At my current place, there is a central server and we are supposed to dump our data there. Say, once you've published a paper, you might collect the data that went in, compress the collection and copy it to that server. At another place, this archiving is done together with the backup of the home directories, so all stuff is kept for 10 years.

    At my PhD institution, individual scientists, including grad students, were theoretically responsible for the 10-year storage (so the mandate can not practically be enforced, because students and postdocs will be gone too soon) and the university provided no infrastructure or support. We left them a bunch of CD's. Which I am pretty sure are no longer readable long before the 10 years are over, but, eh, we formally fulfilled the requirement.

    ReplyDelete
  3. Charu8:38 PM

    Some of my friends from MPI Stuttgart tell me they have an automated backup system, where every instrument computer is automatically backed up every day (? not sure how often), and dated versions are maintained. That way, you can not just track every bit of data you ever took, but it you CANNOT manipulate any of it. Some say it was because of the Schon scandal that they do this.

    ReplyDelete
  4. At the Oxford Phonetics lab, the lab runs a storage area network with a few terabytes of disks. Basically, each project gets a folder, and all work should get put in the shared storage. We lean on people to leave README files to explain what they did, to document their scripts and leave comments saying where the data came from and where it goes to. And we encourage people to do as much as possible with scripts, so that they automatically leave some documentation.

    Then, when a project is over, the data stays on the spinning disks, eventually being set to read-only.

    It's OK. Some information gets lost one way or another, but we're able to go back and find things when we need it, at least most of the time.

    The University is starting a service to permanently maintain data sets and scripts. That may help some.

    ReplyDelete