Advanced usage¶
Understanding the local repository¶
As we have already mentioned in the package section bgdata identifies each data package with a 4-level structure: project, dataset, version and build.
In the local repository, the 4-level structure is converted into a 3-level folder structure following this layout project/dataset/version-build.
For example, for the hg19
version of the human genome,
we set the project to _
, the dataset to genomereference
,
the version to hg19
and the build to the date
used to create the package 20150724
.
If you request this package with bgdata (bgdata get _/genomereference/hg19?20150724), after downloading you will see that you have a local repository as:
|- .bgdata/
| |
| |- genomerefernce/
| | |
| | |- hg19-20150724/
| | | |
| | | |- chr1.txt
| | | |- chr2.txt
| | | |- ...
| | | |- .downloaded
This structure makes easy to map the query you make with project, dataset and version to the folder structure.
The .downloaded¶
The .downloaded
file is a file created after downloading and extracting the package
used internally by bgdata to check whether the package is present and correct.
The .singlefile¶
In some data packages you will find that there is a .singlefile
file.
It contains the name of one of the files in the folder.
This file, if present, is used by bgdata to retrieve the path to that particular
file rather than the path to the folder.
bgdata creates this file automatically if a downloaded package contains only one file. However, some packages can use this file, even if there is more than one file, to easy the usage. For example, a tabix file is formed by a data file and and index file. However, tools using it only need to receive the path to the data file. For packages consisting on a tabix file, although they contain two files, we retrieve always the path to the data file as if that was the only file in the package.
The tag files¶
The build
that is pointed by a tag
is indicated in a file,
named as the version.
For example, a tag
file for the hg19
package
mentioned above that sets the master
tag to 20150724
build will be located in:
|- .bgdata/
| |
| |- genomerefernce/
| | |
| | |- hg19-20150724/
| | |
| | |- hg19.master
The tag
file only contains a string with the build
.
Cache management¶
bgdata includes some commands to manage your caches. However, keep in mind that caches are like partial copies of your local repository so adding or removing packages from your caches is as simple as copying them from the local repository or deleting.
The commands you can use with bgdata cache are:
add | Add a package to the cache |
clean | Clean everything |
remove | Add a package to the cache |
update | Update packages in caches |
- add
- This command will copy a local package into the cache
- clean
- Clean is a command to remove everything in the cache
- remove
- This command will remove a particular build of package from the cache
- Update
Update will remove old versions of package and copy new ones. Care must be used when using this command. The flow is as follow:
- bgdata resolves which builds are associated with the indicated tags
- for each cache, bgdata gets which packages are present. If the build of that package is not in the resolved, it is deleted. The recent(s) version(s) of the packages are added to the cache.
It is important to note that if a package is not present in the cache it will not be updated.
Tags in caches¶
Tag files can be used in cache repositories.
In fact, when you request for a particular tag
bgdata looks first in the local repository and then
in the caches for it.
Warning
Using tag
files in the caches is not
recommended and the user must manually
update the tag files.
Creating your own packages¶
Building packages¶
The build command receives the path to a folder (or even a single file)
and creates a compressed data package with it.
Then it uncompress it in the local repository and associates
that build with the build
tag.
Thus you can use that tag (e.g. _/genomereference/hg19?build
)
for your tests.
Uploading packages to the remote¶
Warning
This section is only for people within our group or people that have set up their own system using bgdata.
Once the package is build, it can be uploaded to the remote making use of the upload command.
Important
Only packages that have been previously built can be uploaded.
The upload process does not go through HTTP. To avoid external users to update packages to our remote repository, the upload process is just a copy of files in the network file system. Thus, it will only work for people with access to the NFS.
If you have access, you need to edit your configuration file to add
remote_repository_upload = /path/to/remote
The upload process includes the creation of a metadata file for the uploaded package. This file contains, among other items, a checksum used during the download process.
Fixing your builds¶
The easiest way to fix your builds is to make it directly in your code,
e.g. bgdata get project/dataset/version?build
.
However, in some cases, it is useful to fix the builds of the packages used
without modifying your code.
Two typical use cases are (there might be many others):
- fixing the builds for reproducibility
without modifying your code. Your calls to
bgdata get project/dataset/version
will return the same build even if you add new builds. - make a particular package point to a different tag. This can be useful for developing. You associate your new build to a develop tag and force bgdata to use the develop data for that package and the default for the rest.
To fix your builds without explicitly indicating that in your code, you can pass a file using the environment variable BGDATA_BUILDS that points to a file that sets the builds. Such file, can contain three different ways of fixing your builds:
Indicate a path to a file for a package in the paths section:
[paths] project/dataset/version = /my/local/path
In this case, any call to
project/dataset/version
will point to/my/local/path
. This have no effect if the request is done indicating a tag or build.Override your tags in the builds section:
[builds] [[project/datastet/version]] master = 20181105
In this case, any request to the
master
tag ofproject/datastet/version
point to20181105
build. The request can be explicit (project/dataset/version?master
) or implicit (project/dataset/version
, when the default tag ismaster
).Fix the tags in the tags section:
[tags] project/dataset/version = master project = develop
In this case, any call (that does not indicate the build or tag) of any data package under
project
will use the tagdevelop
by default except the packageproject/dataset/version
that will use the master. Note that this will not have any effect if you explicitly indicate the build or tag.