Repositories

bgdata manages the packages through 3 layers of repositories:

  • remote
  • local
  • caches

Remote

The remote represents a repository that serves as a source of data packages. Currently, it is an HTTP server that contains the compressed data packages and some tags.

When the user requests for a package that is not present in the local repository bgdata will download it from the remote into the local.

In addition, bgdata will keep in sync the tags. This means that if a tag of a particular package is updated in the remote, and the user requests that particular tag, he or she will get the latest version from the remote if the local tag was not up to date.

Note

bgdata can work in offline mode. In such case, packages will not be downloaded and tags will not be updated.

Local

The local repository is the one where the user can find the packages that have been requested.

While the remote is an HTTP server, the local should be a reachable path from the user’s machine.

The main difference with the remote repository, apart from being in the local machine, is that packages are uncompressed.

The download process

The download process from the remote is done using the Python package homura. Thanks to it, downloads can be resumed. After download, bgdata extracts all the files if they were compressed.

Once the download and extraction processes are done bgdata creates a file named .download with the date and time of that moment. If this file is not present or deleted, bgdata assumes the download has failed and reattempts it.

Caches

A cache is an extension of the local repository. Like the local repository, it should be reachable path from the user’s machine. Moreover, bgdata supports multiple caches.

When the user request a packages, bgdata will be search for it first in each cache and the in the local repository.

A cache can have different uses. As an example, we use the scratch space in the nodes of our cluster to as cache for the packages we use recurrently. For the others, we have a local repository reachable through the network file system.

Important

bgdata will not fail just because a cache is not present. This means that you can also use an external hard drive as a cache and if it is not connected bgdata can still be used.