10 years ago I worked at a university that had a couple people doing research on LHC data. I forget the specifics but there is a global tiered system for replication of data coming from the LHC so that researchers all around the world can access it.
I probably don’t have it right, but as I recall, raw data is replicated from the LHC to two or three other locations (tier 1). The raw data contains a lot of uninteresting data (think a DVR/VCR recording a blank TV image) so those tier 1 locations analyze the data and removes all that unneeded data. This version of the data is then replicated to a dozen or so tier 2 locations. Lots of researchers have access to HPC clusters at those tier 2 locations in order to analyze that data. I believe tier 2 could even request chunks of data from tier 1 that wasn’t originally replicated in the event a researcher had a hunch there might actually be something interesting in the “blank” data that had originally been scrubbed.
The university where I worked had its own HPC cluster that was considered tier 3. It could replicate chunks of data from tier 2 on demand in order to analyze it locally. The way it was mostly used was our researchers would use tier 2 to do some high level analysis, and when they found something interesting they would use the tier 3 cluster to do more detailed analysis. This way they could throw a significant amount of our universities HPC resources at targeted data rather than competing with hundreds of other researchers all trying to do the same thing on the tier 2 clusters.
T1 sites typically have replicas which usually get backed up on the experiment data to work with, T2 and T3 sites then get more local working copies that dont backup and are only kept as long as they need (as long as the delete cycle works).
Man, planning a 3-2-1 backup strategy for CERN must be a nightmare!
Imagine the offsite storage
10 years ago I worked at a university that had a couple people doing research on LHC data. I forget the specifics but there is a global tiered system for replication of data coming from the LHC so that researchers all around the world can access it.
I probably don’t have it right, but as I recall, raw data is replicated from the LHC to two or three other locations (tier 1). The raw data contains a lot of uninteresting data (think a DVR/VCR recording a blank TV image) so those tier 1 locations analyze the data and removes all that unneeded data. This version of the data is then replicated to a dozen or so tier 2 locations. Lots of researchers have access to HPC clusters at those tier 2 locations in order to analyze that data. I believe tier 2 could even request chunks of data from tier 1 that wasn’t originally replicated in the event a researcher had a hunch there might actually be something interesting in the “blank” data that had originally been scrubbed.
The university where I worked had its own HPC cluster that was considered tier 3. It could replicate chunks of data from tier 2 on demand in order to analyze it locally. The way it was mostly used was our researchers would use tier 2 to do some high level analysis, and when they found something interesting they would use the tier 3 cluster to do more detailed analysis. This way they could throw a significant amount of our universities HPC resources at targeted data rather than competing with hundreds of other researchers all trying to do the same thing on the tier 2 clusters.
T1 sites typically have replicas which usually get backed up on the experiment data to work with, T2 and T3 sites then get more local working copies that dont backup and are only kept as long as they need (as long as the delete cycle works).
A small T3 is likely around 1PB of storage.