Dataset policy & guidelines

Michael Wanek

Michael Wanek

HPC Engineer

Some context on /gscratch/data#

The Klone Data Commons is our cluster-wide, shared dataset storage located at /gscratch/data.

Historically, we've addressed requests to add datasets to the Commons on a case-by-case basis. We've seen a growing number of these types of requests over the past few weeks, so we thought we should make the guidelines clear. That's the purpose of this blog post today, as well as the new Data Commons documentation section here.

Requirements#

In order for a dataset to be approved, the following criteria must be met:

  1. The requester must create a new page of documentation, and submit a pull request, describing the dataset:

    • A full description of the dataset, publication date, licenses, etc.
    • Instructions for using the dataset, i.e. any required modules, the structure of the data, etc.
    • Contact information for dataset maintainers (typically, the group/user submitting the request) and the intended audience or discipline of the data.
  2. The requester must name a minimum of 3 separate groups/labs & 3 specific users who will be using the data.

  3. The requester emails help@uw.edu with:

    • A link to the documentation PR.
    • The following people CC'd: the lab/group owners & all initial users. This will be at least 6 people.
  4. Every person included in the request (again, at least 6), must individually attest that the dataset has been vetted: that, to the best of their knowledge, the dataset contains no material where its download/storage/use violates any State or Federal law and/or the rules/policies of UW, including intellectual property laws.

Questions?#

Hopefully this clears up our expectations going forward. If you have any questions for us, please reach out to the team by emailing help@uw.edu with Hyak somewhere in the subject or body. Thanks!