The data packages

A data package is nothing more, and nothing less, that set of files (or even a single file).

Identifying

bgdata identifies each data package with a 4-level structure

  1. project
  2. dataset
  3. version
  4. build

project and dataset are the main identifiers of a data package. In some cases, you might find that the package does not belong to a particular project. For such cases, we use _ as project name. Some of the bgdata commands will automatically set the project as _ if you do not provide it.

The version is intended distinguish between incompatible versions of the package. E.g. when you are removing some data columns in your files.

The build is an identifier that allows to distinguish between compatible versions of the same packages. Typically, we use the date when we create the package as the build identifier. However, the build can be anything (as long as it does not start with an alpha character), so you might find other builds.

For example, we use the human genome in many of projects. There are several version of the human genome available at http://hgdownload.cse.ucsc.edu/downloads.html#human . We downloaded our data of interest for the hg19 version and created packages using _ as project, genomereference as dataset, hg19 as version and 20150724 as build. Then, we can request this package as

bgdata get _/genomereference/hg19?20150724

Tags

As remembering all the build identifiers for all the packages might be painful and you probably need to change all the queries in your scripts to get newer versions, bgdata supports the concept of tags.

A tag is a pointer to a particular build, and in several operations with bgdata you can use a tag instead of a build. bgdata will resolve which is the build associated with that tag and use that package.

The advantage of using a tag rather than a build is that with the same query in your software, you get the most updated version of a particular package by only keeping the tag up to date. E.g. following the example above, if we ask for the tag master we get the always the most recent version:

bgdata get _/genomereference/hg19?master

provided that we keep our tag up to date.

In most cases, bgdata will use the tag master when you do not indicate the build or tag for a particular package.

Important

A tag works essentially as a pointer to a build for a particular project, dataset and version. This means that when asking for a tag you also need to indicate the other parameters.