Saturday, April 01, 2023

The problems and opportunities of data

We live in a world of "big data", and this presents a number of challenges for how we handle this at research universities.  Until relatively recently, the domain of huge volume/huge throughput scientific data was chiefly that of the nuclear/particle physics community and then the astronomy community.  The particle physicists in particular have been pioneers in how they handle enormous petabyte quantities of data at crazy high rates.

Thanks to advances in technology, though, it is now possible for small university research groups to acquire terabytes of data in an afternoon, thanks to high speed/high resolution video recording, hyperspectral imaging, many GHz bandwidths, etc.  Where things get tricky is, this new volume and pace are demands that researchers, universities, funding agencies, publishers, etc. are generally not equipped to handle, in terms of data stewardship.

I've written before about the responsibilities of various people regarding data stewardship.  Data from sponsored research at universities is "owned" by the universities, because they are held responsible by the funding agencies (e.g. NSF, DOE, DOD) for, among other things, maintaining accessible copies until years after the end of the funding agreements.  The enormous volumes of data that can now be generated are problematic.  With the exception of a small number of universities that host supercomputing centers, most academic institutions just do not have the enterprise-class storage capacity (either locally or contracted via cloud storage services) to meet the ever-growing demand.  Ordering a bunch of 8 TB external hard drives and keeping them in a growing stack on your shelves is not a scalable, sustainable plan.  Universities all over are finding out that providers can't really provide "unlimited" capacity without passing costs along to the research institutions.  Agencies are also often not fans of significant budgeting in proposals for long-term data retention.  It's not clear that anyone has a long-term solution to this that really meets everyone's needs.  Repositories like zenodo are great, but somewhere someone actually has to pay the costs to operate these.

Further, there is a thriving movement toward open science (with data sharing) and FAIR data principles - making sure that data is findable, accessible, interoperable, and resuseable.  In condensed matter physics, this is exemplified by the Materials Genome Initiative and its updated strategic plan.  There is a belief that having this enormous amount of information available (and properly indexed with metadata so that it can be analyzed and used intelligently), combined with machine learning and AI, will lead to accelerated progress in research, design, and discoveries.  

At the same time, in the US there are increasing concerns about data security, and coming regulatory actions about this.   University research administrators are looking very hard at all this, as is the Council on Government Relations, both because of chilling effects across the community and to push to make sure that Congress and agencies don't saddle universities with mutually incompatible and contradictory policies and requirements.  

Meeting all of these needs is going to be a challenge for a long time to come.  If any readers have particular examples of how to meet the needs of very large volume data retention, I'd appreciate the comments.

5 comments:

Anonymous said...

Very good points!



For the near future, the efforts such as Materials Data Facility (Ben Blaiszik) can offer a solution. Given that data volumes scale exponentially and DOE budgets do not, I expect that at some point commercial players such as Amazon Web Services (AWS) or Microsoft will develop the commercial solutions.



Then ideal output of research can be something like paper-with-code and data (as the one proposed by Maxim Ziatdinov 5 years ago using then available @Google Colabs), but with the stored data objects and reference manager. Given the number of new features being introduced on the arxiv recently, I will not be surprised if they will implement something like this shortly.



These will of course have to be complemented by the intragroup and facility level cloud ecosystems, but this is a different story (internal operation as oposed to disemination).



https://colab.research.google.com/github/jupyter-papers/mock-paper/blob/master/FerroicBlocks_mockup_paper_v3a.ipynb

Unknown said...

I am not sure, based on your description, why current cloud storage services won't suffice. Object storage like AWS S3 and it's cold storage variant, Glacier, seem like the most obvious choices for optimizing costs. Of course there needs to be graphical and/or API interfaces built to allow users to deliver and retrieve data (seems like that is what the MDF mentioned in the earlier comment is attempting). Another big challenge is the networking requirements. Trying to move very large data volumes over the internet is typically a bad idea. While the cloud providers offer direct pipes, like AWS Direct Connect, you still need to get your data to a Direct Connect access point (typically found at large data centers). A science-wide service to collect and distribute data will be difficult unless universities and research labs can establish the necessary network infrastructure. This sounds like a decent project for a senior student studying computer systems and engineering.

Douglas Natelson said...

Anon@2:58, sorry for my unclear wording. I have no doubt that AWS or equivalent could in principle serve in this role, but the cost structure is completely unclear for universities. I do know that previous “unlimited” storage packages for universities through vendors like google, Box, and Microsoft have been renegotiated repeatedly with universities, as it becomes clear that really *using* large capacity requires further costs.

Unknown said...

Doug, it's Aaron T. Not sure why it's showing my comments as anonymous. I think the idea of a negotiated unlimited storage plan is probably going to cost more than the normal p er GB pricing. The provider is going to pass the risk to you in the negotiated price and it also does not encourage proper optimized use of the different storage tiers. I think a more realistic, sustainable approach is to try and negotiate discounted rates on their storage services. Maybe you get a deeper discount for something like Glacier to encourage moving data to colder levels of storage where appropriate. The challenge, then, is figuring out how the university does chargeback to each lab for the storage each uses. This is a fairly routine activity in the business world so I don't mean to imply it's an insurmountable issue.

Anonymous said...

Also something to be said about maintaining future-proof analysis code. My own python script written for a paper 10 years ago no longer works out of the box. I was a naive graduate student who didn't export all the code dependencies and it was a serious undertaking to get the code working again when someone asked for them. The fitting code library I used also changed over time, so the fit parameters and error bars aren't exactly the same either. My case wasn't too bad, but I can imagine code robustness being more important when exact values matter.