Advanced usage

Understanding the local repository

As we have already mentioned in the package section bgdata identifies each data package with a 4-level structure: project, dataset, version and build.

In the local repository, the 4-level structure is converted into a 3-level folder structure following this layout project/dataset/version-build.

For example, for the hg19 version of the human genome, we set the project to _, the dataset to genomereference, the version to hg19 and the build to the date used to create the package 20150724.

If you request this package with bgdata (bgdata get _/genomereference/hg19?20150724), after downloading you will see that you have a local repository as:

|- .bgdata/
|  |
|  |- genomerefernce/
|  |  |
|  |  |- hg19-20150724/
|  |  |  |
|  |  |  |- chr1.txt
|  |  |  |- chr2.txt
|  |  |  |- ...
|  |  |  |- .downloaded

This structure makes easy to map the query you make with project, dataset and version to the folder structure.

The .downloaded

The .downloaded file is a file created after downloading and extracting the package used internally by bgdata to check whether the package is present and correct.

The .singlefile

In some data packages you will find that there is a .singlefile file. It contains the name of one of the files in the folder. This file, if present, is used by bgdata to retrieve the path to that particular file rather than the path to the folder.

bgdata creates this file automatically if a downloaded package contains only one file. However, some packages can use this file, even if there is more than one file, to easy the usage. For example, a tabix file is formed by a data file and and index file. However, tools using it only need to receive the path to the data file. For packages consisting on a tabix file, although they contain two files, we retrieve always the path to the data file as if that was the only file in the package.

The tag files

The build that is pointed by a tag is indicated in a file, named as the version. For example, a tag file for the hg19 package mentioned above that sets the master tag to 20150724 build will be located in:

|- .bgdata/
|  |
|  |- genomerefernce/
|  |  |
|  |  |- hg19-20150724/
|  |  |
|  |  |- hg19.master

The tag file only contains a string with the build.

Cache management

bgdata includes some commands to manage your caches. However, keep in mind that caches are like partial copies of your local repository so adding or removing packages from your caches is as simple as copying them from the local repository or deleting.

The commands you can use with bgdata cache are:

add Add a package to the cache
clean Clean everything
remove Add a package to the cache
update Update packages in caches
add
This command will copy a local package into the cache
clean
Clean is a command to remove everything in the cache
remove
This command will remove a particular build of package from the cache
Update

Update will remove old versions of package and copy new ones. Care must be used when using this command. The flow is as follow:

  • bgdata resolves which builds are associated with the indicated tags
  • for each cache, bgdata gets which packages are present. If the build of that package is not in the resolved, it is deleted. The recent(s) version(s) of the packages are added to the cache.

It is important to note that if a package is not present in the cache it will not be updated.

Tags in caches

Tag files can be used in cache repositories. In fact, when you request for a particular tag bgdata looks first in the local repository and then in the caches for it.

Warning

Using tag files in the caches is not recommended and the user must manually update the tag files.

Creating your own packages

Building packages

The build command receives the path to a folder (or even a single file) and creates a compressed data package with it. Then it uncompress it in the local repository and associates that build with the build tag. Thus you can use that tag (e.g. _/genomereference/hg19?build) for your tests.

Uploading packages to the remote

Warning

This section is only for people within our group or people that have set up their own system using bgdata.

Once the package is build, it can be uploaded to the remote making use of the upload command.

Important

Only packages that have been previously built can be uploaded.

The upload process does not go through HTTP. To avoid external users to update packages to our remote repository, the upload process is just a copy of files in the network file system. Thus, it will only work for people with access to the NFS.

If you have access, you need to edit your configuration file to add

remote_repository_upload = /path/to/remote

The upload process includes the creation of a metadata file for the uploaded package. This file contains, among other items, a checksum used during the download process.

Fixing your builds

The easiest way to fix your builds is to make it directly in your code, e.g. bgdata get project/dataset/version?build. However, in some cases, it is useful to fix the builds of the packages used without modifying your code. Two typical use cases are (there might be many others):

  • fixing the builds for reproducibility without modifying your code. Your calls to bgdata get project/dataset/version will return the same build even if you add new builds.
  • make a particular package point to a different tag. This can be useful for developing. You associate your new build to a develop tag and force bgdata to use the develop data for that package and the default for the rest.

To fix your builds without explicitly indicating that in your code, you can pass a file using the environment variable BGDATA_BUILDS that points to a file that sets the builds. Such file, can contain three different ways of fixing your builds:

  1. Indicate a path to a file for a package in the paths section:

    [paths]
    project/dataset/version = /my/local/path
    

    In this case, any call to project/dataset/version will point to /my/local/path. This have no effect if the request is done indicating a tag or build.

  2. Override your tags in the builds section:

    [builds]
        [[project/datastet/version]]
            master = 20181105
    

    In this case, any request to the master tag of project/datastet/version point to 20181105 build. The request can be explicit (project/dataset/version?master) or implicit (project/dataset/version, when the default tag is master).

  3. Fix the tags in the tags section:

    [tags]
    project/dataset/version = master
    project = develop
    

    In this case, any call (that does not indicate the build or tag) of any data package under project will use the tag develop by default except the package project/dataset/version that will use the master. Note that this will not have any effect if you explicitly indicate the build or tag.