Command Line Interface¶
With the exception of using ingest
to ingest the clinical and pipeline metadata
in bulk, candig_repo
command line interface is used for all other operation.
The registry contains links to files, as well as some metadata.
Warning
Because the data model objects returned via APIs are created at server start-up, at this time, you have to restart the server for the data you ingest to be reflected.
For instructions on adding metadata in bulk, see ingest.
When you are done ingesting data, you may start up your server instance by running the
candig_server
command, see Other commands for more information.
Initialize/Remove Dataset¶
This section contains commands that initialize the dataset, give you the overview of the data repository, as well as deleting the dataset.
You do not need to use init
to initialize the dataset if you already prepared
a json file of clinical information. You can run the ingest
command directly and
it will take care of everything for you.
init¶
Warning
If you already prepared a json file that conforms to our standard clinical or
pipeline metadata, you can run ingest
command directly without running init
.
For detailed instructions, see ingest.
The init
command initialises a new registry DB at a given
file path. Unless you have a clinical json file ready that can be ingested with ingest
,
you need to run this to initialize your DB.
Initialize a data repository
usage: candig_repo init [-h] [-f] registryPath
Positional Arguments¶
registryPath | the location of the registry database |
Named Arguments¶
-f, --force | do not prompt for confirmation |
Examples:
$ candig_repo init registry.db
list¶
The list
command is used to print the contents of a repository
to the screen. It is an essential tool for administrators to
understand the structure of the repository that they are managing.
Note
The list
command is under development and will
be much more sophisticated in the future. In particular, the output
of this command should improve considerably in the near future.
List the contents of the repo
usage: candig_repo list [-h] registryPath
Positional Arguments¶
registryPath | the location of the registry database |
Examples:
$ candig_repo list registry.db
verify¶
The verify
command is used to check that the integrity of the
data in a repository. The command checks each container object in turn
and ensures that it can read data from it. Read errors can occur for
any number of reasons (for example, a VCF file may have been moved
to another location since it was added to the registry), and the
verify
command allows an administrator to check that all is
well in their repository.
Note
The verify
command is currently under review.
Verifies the repository by examing all data files
usage: candig_repo verify [-h] registryPath
Positional Arguments¶
registryPath | the location of the registry database |
Examples:
$ candig_repo verify registry.db
add-dataset¶
Creates a new dataset in a repository. A dataset is an arbitrary collection of ReadGroupSets, VariantSets, VariantAnnotationSets and FeatureSets. Each dataset has a name, which is used to identify it in the repository manager.
Warning
If you already prepared a json file that conforms to our standard clinical or
pipeline metadata, you can run ingest
command directly without running add-dataset
.
For detailed instructions, see ingest.
Add a dataset to the data repo
usage: candig_repo add-dataset [-h] [-A ATTRIBUTES] [-d DESCRIPTION]
registryPath datasetName
Positional Arguments¶
registryPath | the location of the registry database |
datasetName | the name of the dataset |
Named Arguments¶
-A, --attributes | |
additional attributes for the message expressed as JSON | |
-d, --description | |
The human-readable description of the dataset. |
Examples:
$ candig_repo add-dataset registry.db 1kg -d 'Example dataset using 1000 genomes data'
Adds the dataset with the name 1kg
and description
'Example dataset using 1000 genomes data'
to the
registry database registry.db
.
add-dataset-duo¶
Create/update new Data Use Ontology Information for an existing dataset. Note that you have to have an existing dataset to be able to use this command. When you need to update the DUO info, simply run the command with updated DUO Json file.
Add DUO info to a dataset
usage: candig_repo add-dataset-duo [-h]
registryPath datasetName
dataUseOntologyFile
Positional Arguments¶
registryPath | the location of the registry database |
datasetName | the name of the dataset |
dataUseOntologyFile | |
Path to your duo config json file. |
Examples:
$ candig_repo add-dataset-duo registry.db mock1 duo.json
Adds the Data Use Ontology info to the dataset with the name mock1
.
To learn about how to prepare a json file that contains DUO info for a dataset, and a list
of DUO IDs that are allowed, see the Data Use Ontology
section under Prepare Data For Ingestion.
remove-dataset¶
Removes a dataset from the repository and recursively removes all objects (ReadGroupSets, VariantSets, etc) within this dataset.
Remove a dataset from the data repo
usage: candig_repo remove-dataset [-h] [-f] registryPath datasetName
Positional Arguments¶
registryPath | the location of the registry database |
datasetName | the name of the dataset |
Named Arguments¶
-f, --force | do not prompt for confirmation |
Examples:
$ candig_repo remove-dataset registry.db dataset1
Deletes the dataset with name dataset1
from the repository
represented by registry.db
remove-dataset-duo¶
Remove new Data Use Ontology Information for an existing dataset.
Remove DUO info from a dataset
usage: candig_repo remove-dataset-duo [-h] [-f] registryPath datasetName
Positional Arguments¶
registryPath | the location of the registry database |
datasetName | the name of the dataset |
Named Arguments¶
-f, --force | do not prompt for confirmation |
Examples:
$ candig_repo remove-dataset-duo registry.db mock1
Removes the Data Use Ontology info to the dataset with the name mock1
.
Add/Remove Clinical & Pipeline Metadata¶
This section contains commands that let you ingest data into the clinical and pipeline metadata tables, as well as the commands that delete them.
The ingest
command is the only way to ingest clinical or pipeline data in bulk.
It encapsulates all the write operations into a single transaction. To learn about preparing
the json files for the ingest
command, see Prepare Data For Ingestion
All of the remove
commands for removing clinical tables require you to specify their
name
, note that the name
here is actually their unique identifier, typically is composed
of their patientId, sometimes along with some other ID or timestamp information. This is the same
name
you see in the records of these clinical or pipeline data records.
ingest¶
The ingest
command is the preferred way to import metadata in bulk. It does not come with
candig-server by default, to use it, you need to install candig-ingest by running:
pip install candig-ingest
To import metadata in bulk, you need to have a specially formatted json file. A mock json file is available from https://github.com/CanDIG/candig-ingest/blob/master/candig/ingest/mock_data/clinical_metadata_tier1.json
To ingest the data, you need to run
usage: ingest registryPath datasetName metadataPath
If the dataset does not exist, it will create a new dataset of this name. There is no need
to run init
command before running ingest
.
Examples:
$ ingest registry.db mock1 mock_data.json
remove-patient¶
remove a patient.
Remove an Patient from the repo
usage: candig_repo remove-patient [-h] [-f]
registryPath datasetName patientName
Positional Arguments¶
registryPath | the location of the registry database |
datasetName | the name of the dataset |
patientName | the name of the patient |
Named Arguments¶
-f, --force | do not prompt for confirmation |
Examples:
$ candig_repo remove-patient registry.db mock1 PATIENT_81202
remove-enrollment¶
remove a enrollment.
Remove an Enrollment from the repo
usage: candig_repo remove-enrollment [-h] [-f]
registryPath datasetName enrollmentName
Positional Arguments¶
registryPath | the location of the registry database |
datasetName | the name of the dataset |
enrollmentName | the name of the enrollment |
Named Arguments¶
-f, --force | do not prompt for confirmation |
Examples:
$ candig_repo remove-enrollment registry.db mock1 PATIENT_81202_2005-08-23
remove-treatment¶
remove a treatment.
Remove an Treatment from the repo
usage: candig_repo remove-treatment [-h] [-f]
registryPath datasetName treatmentName
Positional Arguments¶
registryPath | the location of the registry database |
datasetName | the name of the dataset |
treatmentName | the name of the treatment |
Named Arguments¶
-f, --force | do not prompt for confirmation |
Examples:
$ candig_repo remove-treatment registry.db mock1 PATIENT_81202_2005-08-23
remove-sample¶
remove a sample.
Remove an Sample from the repo
usage: candig_repo remove-sample [-h] [-f] registryPath datasetName sampleName
Positional Arguments¶
registryPath | the location of the registry database |
datasetName | the name of the dataset |
sampleName | the name of the sample |
Named Arguments¶
-f, --force | do not prompt for confirmation |
Examples:
$ candig_repo remove-sample registry.db mock1 PATIENT_81202_SAMPLE_33409
remove-diagnosis¶
remove a diagnosis.
Remove an Diagnosis from the repo
usage: candig_repo remove-diagnosis [-h] [-f]
registryPath datasetName diagnosisName
Positional Arguments¶
registryPath | the location of the registry database |
datasetName | the name of the dataset |
diagnosisName | the name of the diagnosis |
Named Arguments¶
-f, --force | do not prompt for confirmation |
Examples:
$ candig_repo remove-diagnosis registry.db mock1 PATIENT_81202_SAMPLE_33409
remove-tumourboard¶
remove a tumourboard.
Remove an Tumourboard from the repo
usage: candig_repo remove-tumourboard [-h] [-f]
registryPath datasetName tumourboardName
Positional Arguments¶
registryPath | the location of the registry database |
datasetName | the name of the dataset |
tumourboardName | |
the name of the tumourboard |
Named Arguments¶
-f, --force | do not prompt for confirmation |
Examples:
$ candig_repo remove-tumourboard registry.db mock1 PATIENT_81202_SAMPLE_33409
remove-outcome¶
remove a outcome.
Remove an Outcome from the repo
usage: candig_repo remove-outcome [-h] [-f]
registryPath datasetName outcomeName
Positional Arguments¶
registryPath | the location of the registry database |
datasetName | the name of the dataset |
outcomeName | the name of the outcome |
Named Arguments¶
-f, --force | do not prompt for confirmation |
Examples:
$ candig_repo remove-outcome registry.db mock1 PATIENT_81202_2016-10-11
remove-complication¶
remove a complication.
Remove an Complication from the repo
usage: candig_repo remove-complication [-h] [-f]
registryPath datasetName
complicationName
Positional Arguments¶
registryPath | the location of the registry database |
datasetName | the name of the dataset |
complicationName | |
the name of the complication |
Named Arguments¶
-f, --force | do not prompt for confirmation |
Examples:
$ candig_repo remove-complication registry.db mock1 PATIENT_81202_2016-10-11
remove-consent¶
remove a consent.
Remove an Consent from the repo
usage: candig_repo remove-consent [-h] [-f]
registryPath datasetName consentName
Positional Arguments¶
registryPath | the location of the registry database |
datasetName | the name of the dataset |
consentName | the name of the consent |
Named Arguments¶
-f, --force | do not prompt for confirmation |
Examples:
$ candig_repo remove-consent registry.db mock1 PATIENT_81202_2016-10-11
remove-chemotherapy¶
remove a chemotherapy.
Remove an Chemotherapy from the repo
usage: candig_repo remove-chemotherapy [-h] [-f]
registryPath datasetName
chemotherapyName
Positional Arguments¶
registryPath | the location of the registry database |
datasetName | the name of the dataset |
chemotherapyName | |
the name of the chemotherapy |
Named Arguments¶
-f, --force | do not prompt for confirmation |
Examples:
$ candig_repo remove-chemotherapy registry.db mock1 PATIENT_81202_2016-10-11
remove-immunotherapy¶
remove a immunotherapy.
Remove an Immunotherapy from the repo
usage: candig_repo remove-immunotherapy [-h] [-f]
registryPath datasetName
immunotherapyName
Positional Arguments¶
registryPath | the location of the registry database |
datasetName | the name of the dataset |
immunotherapyName | |
the name of the immunotherapy |
Named Arguments¶
-f, --force | do not prompt for confirmation |
Examples:
$ candig_repo remove-immunotherapy registry.db mock1 PATIENT_81202_2016-10-11
remove-radiotherapy¶
remove a radiotherapy.
Remove an Radiotherapy from the repo
usage: candig_repo remove-radiotherapy [-h] [-f]
registryPath datasetName
radiotherapyName
Positional Arguments¶
registryPath | the location of the registry database |
datasetName | the name of the dataset |
radiotherapyName | |
the name of the radiotherapy |
Named Arguments¶
-f, --force | do not prompt for confirmation |
Examples:
$ candig_repo remove-radiotherapy registry.db mock1 PATIENT_81202_2016-10-11
remove-celltransplant¶
remove a celltransplant.
Remove an Celltransplant from the repo
usage: candig_repo remove-celltransplant [-h] [-f]
registryPath datasetName
celltransplantName
Positional Arguments¶
registryPath | the location of the registry database |
datasetName | the name of the dataset |
celltransplantName | |
the name of the celltransplant |
Named Arguments¶
-f, --force | do not prompt for confirmation |
Examples:
$ candig_repo remove-celltransplant registry.db mock1 PATIENT_81202_2016-10-11
remove-surgery¶
remove a surgery.
Remove an Surgery from the repo
usage: candig_repo remove-surgery [-h] [-f]
registryPath datasetName surgeryName
Positional Arguments¶
registryPath | the location of the registry database |
datasetName | the name of the dataset |
surgeryName | the name of the surgery |
Named Arguments¶
-f, --force | do not prompt for confirmation |
Examples:
$ candig_repo remove-surgery registry.db mock1 PATIENT_81202_2016-10-11
remove-study¶
remove a study.
Remove an Study from the repo
usage: candig_repo remove-study [-h] [-f] registryPath datasetName studyName
Positional Arguments¶
registryPath | the location of the registry database |
datasetName | the name of the dataset |
studyName | the name of the study |
Named Arguments¶
-f, --force | do not prompt for confirmation |
Examples:
$ candig_repo remove-study registry.db mock1 PATIENT_81202_2016-10-11
remove-slide¶
remove a slide.
Remove an Slide from the repo
usage: candig_repo remove-slide [-h] [-f] registryPath datasetName slideName
Positional Arguments¶
registryPath | the location of the registry database |
datasetName | the name of the dataset |
slideName | the name of the slide |
Named Arguments¶
-f, --force | do not prompt for confirmation |
Examples:
$ candig_repo remove-slide registry.db mock1 PATIENT_81202_2016-10-11
remove-labtest¶
remove a labtest.
Remove an Labtest from the repo
usage: candig_repo remove-labtest [-h] [-f]
registryPath datasetName labtestName
Positional Arguments¶
registryPath | the location of the registry database |
datasetName | the name of the dataset |
labtestName | the name of the labtest |
Named Arguments¶
-f, --force | do not prompt for confirmation |
Examples:
$ candig_repo remove-labtest registry.db mock1 PATIENT_81202_2016-10-11
Add/Remove Genomics Data¶
add-referenceset¶
Adds a reference set derived from a FASTA file to a repository. Each
record in the FASTA file will correspond to a Reference in the new
ReferenceSet. The input FASTA file must be compressed with bgzip
and indexed using samtools faidx
. Each ReferenceSet contains a
number of metadata values (.e.g. species
) which can be set
using command line options.
You may ingest raw FASTA (.fa) files.
If you ingest compressed FASTA files, you should download both .gz.fai and .gz.gzi files, along with the .fz.gz file. If gz.fai file is not available, the server will attempt to generate it at ingestion.
Example FASTA files are available from
`
http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/phase2_reference_assembly_sequence/hs37d5.fa.gz
http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/phase2_reference_assembly_sequence/hs37d5.fa.gz.fai
http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/phase2_reference_assembly_sequence/hs37d5.fa.gz.gzi
`
Add a reference set to the data repo
usage: candig_repo add-referenceset [-h] [-A ATTRIBUTES] [-r] [-n NAME]
[-d DESCRIPTION] [--species SPECIES]
[--isDerived ISDERIVED]
[--assemblyId ASSEMBLYID]
[--sourceAccessions SOURCEACCESSIONS]
[--sourceUri SOURCEURI]
registryPath filePath
Positional Arguments¶
registryPath | the location of the registry database |
filePath | The path of the FASTA file to use as a reference set. This file must be bgzipped and indexed. |
Named Arguments¶
-A, --attributes | |
additional attributes for the message expressed as JSON | |
-r, --relativePath | |
store relative path in database | |
-n, --name | The name of the reference set |
-d, --description | |
The human-readable description of the reference set. | |
--species | The species ontology term as a JSON string |
--isDerived | Indicates if this reference set is derived from another |
--assemblyId | The assembly id |
--sourceAccessions | |
The source accessions (pass as comma-separated list) | |
--sourceUri | The source URI |
Examples:
$ candig_repo add-referenceset registry.db hs37d5.fa.gz \
--description "NCBI37 assembly of the human genome" \
--species '{"termId": "NCBI:9606", "term": "Homo sapiens"}' \
--name NCBI37 \
--sourceUri ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/phase2_reference_assembly_sequence/hs37d5.fa.gz
Adds a reference set used in the 1000 Genomes project using the name
NCBI37
, also setting the species
to 9606 (human).
add-ontology¶
Warning
This command, as well as all ontology-related operations are under review. They might undergo changes in the near future.
Adds a new ontology to the repository. The ontology supplied must be a text file in OBO format. If you wish to serve sequence or variant annotations from a repository, a sequence ontology (SO) instance is required to translate ontology term names held in annotations to ontology IDs. Sequence ontology definitions can be downloaded from the Sequence Ontology site.
Adds an ontology in OBO format to the repo. Currently, a sequence ontology (SO) instance is required to translate ontology term names held in annotations to ontology IDs. Sequence ontology files can be found at https://github.com/The-Sequence-Ontology/SO-Ontologies
usage: candig_repo add-ontology [-h] [-r] [-n NAME] registryPath filePath
Positional Arguments¶
registryPath | the location of the registry database |
filePath | The path of the OBO file defining this ontology. |
Named Arguments¶
-r, --relativePath | |
store relative path in database | |
-n, --name | The name of the ontology |
Examples:
$ candig_repo add-ontology registry.db path/to/so-xp.obo
Adds the sequence ontology so-xp.obo
to the repository using the
default naming rules.
add-variantset¶
Adds a variant set to a named dataset in a repository. Variant sets are
currently derived from one or more non-overlapping VCF/BCF files which
may be either stored locally or come from a remote URL. Multiple VCF
files can be specified either directly on the command line or by
providing a single directory argument that contains indexed VCF files.
If remote URLs are used then index files in the local file system must be
provided using the -I
option.
Note: Starting from 0.9.3, you now need to specify a patientId
and a sampleId
. The server
does not validate either, so please double check to make sure the IDs are correct.
Add a variant set to the data repo based on one or more VCF files.
usage: candig_repo add-variantset [-h] [-r] [-I indexFiles [indexFiles ...]]
[-n NAME] [-R REFERENCESETNAME]
[-O ONTOLOGYNAME] [-A ATTRIBUTES] [-a]
registryPath datasetName patientId sampleId
dataFiles [dataFiles ...]
Positional Arguments¶
registryPath | the location of the registry database |
datasetName | the name of the dataset |
patientId | the ID of the patient |
sampleId | the ID of the sample |
dataFiles | The VCF/BCF files representing the new VariantSet. These may be specified either one or more paths to local files or remote URLS, or as a path to a local directory containing VCF files. Either a single directory argument may be passed or a list of file paths/URLS, but not a mixture of directories and paths. |
Named Arguments¶
-r, --relativePath | |
store relative path in database | |
-I, --indexFiles | |
The index files for the VCF/BCF files provided in the dataFiles argument. These must be provided in the same order as the data files. | |
-n, --name | The name of the VariantSet |
-R, --referenceSetName | |
the name of the reference set to associate with this VariantSet | |
-O, --ontologyName | |
the name of the sequence ontology instance used to translate ontology term names to IDs in this VariantSet | |
-A, --attributes | |
additional attributes for the message expressed as JSON | |
-a, --addAnnotationSets | |
If the supplied VCF file contains annotations, create the corresponding VariantAnnotationSet. |
Examples:
$ candig_repo add-variantset registry.db 1kg PATIENT_123 SAMPLE_123 1kgPhase1/ -R NCBI37
Adds a new variant set to the dataset named 1kg
in the repository defined
by the registry database registry.db
using the VCF files contained in the
directory 1kgPhase1
that belong to PATIENT_123 and SAMPLE_123. Note that this
directory must also contain the corresponding indexes for these files. We associate
the reference set named NCBI37
with this new variant set. Because we do not provide a --name
argument, a name is automatically generated using the default name generation
rules.
$ candig_repo add-variantset registry.db 1kg PATIENT_123 SAMPLE_123 \
1kgPhase1/chr1.vcf.gz -n phase1-subset -R NCBI37
Like the last example, we add a new variant set to the dataset 1kg
, with one VCF
and the corresponding patientId and sampleId. We also specify the
name for this new variant set to be phase1-subset
.
$ candig_repo add-variantset registry.db 1kg PATIENT_123 SAMPLE_123 \
--name phase1-subset-remote -R NCBI37 \
--indexFiles ALL.chr1.phase1_release_v3.20101123.snps_indels_svs.genotypes.vcf.gz.tbi ALL.chr2.phase1_release_v3.20101123.snps_indels_svs.genotypes.vcf.gz.tbi \
ftp://ftp.ncbi.nlm.nih.gov/1000genomes/ftp/release/20110521/ALL.chr1.phase1_release_v3.20101123.snps_indels_svs.genotypes.vcf.gz \
This example performs the same task of creating a subset of the phase1
VCFs, but this time we use the remote URL directly and do not keep a
local copy of the VCF file. Because we are using remote URLs to define
the variant set, we have to download a local copy of the corresponding
index files and provide them on the command line using the --indexFiles
option.
add-readgroupset¶
Adds a readgroup set to a named dataset in a repository. Readgroup sets are
currently derived from a single indexed BAM file, which can be either
stored locally or based on a remote URL. If the readgroup set is based on
a remote URL, then the index file must be stored locally and specified using
the --indexFile
option.
Each readgroup set must be associated with the reference set that it is aligned
to. The add-readgroupset
command first examines the headers of the BAM file
to see if it contains information about references, and then looks for a
reference set with name equal to the genome assembly identifer defined in the
header. (Specifically, we read the @SQ
header line and use the value of the
AS
tag as the default reference set name.) If this reference set exists,
then the readgroup set will be associated with it automatically. If it does not
(or we cannot find the appropriate information in the header), then the
add-readgroupset
command will fail. In this case, the user must provide the
name of the reference set using the --referenceSetName
option.
Add a read group set to the data repo
usage: candig_repo add-readgroupset [-h] [-n NAME] [-R REFERENCESETNAME]
[-A ATTRIBUTES] [-r] [-I INDEXFILE]
registryPath datasetName patientId
sampleId dataFile
Positional Arguments¶
registryPath | the location of the registry database |
datasetName | the name of the dataset |
patientId | the ID of the patient |
sampleId | the ID of the sample |
dataFile | The file path or URL of the BAM file for this ReadGroupSet |
Named Arguments¶
-n, --name | The name of the ReadGroupSet |
-R, --referenceSetName | |
the name of the reference set to associate with this ReadGroupSet | |
-A, --attributes | |
additional attributes for the message expressed as JSON | |
-r, --relativePath | |
store relative path in database | |
-I, --indexFile | |
The file path of the BAM index for this ReadGroupSet. If the dataFile argument is a local file, this will be automatically inferred by appending ‘.bai’ to the file name. If the dataFile is a remote URL the path to a local file containing the BAM index must be provided |
Examples:
$ candig_repo add-readgroupset registry.db 1kg \
path/to/HG00114.chrom11.ILLUMINA.bwa.GBR.low_coverage.20120522.bam
Adds a new readgroup set for an indexed 1000 Genomes BAM file stored on the
local file system. The index file follows the usual convention and is stored in
the same directory as the BAM file and has an extra .bai
extension. The
name of the readgroup set is automatically derived from the file name, and the
reference set automatically set from the BAM header.
$ candig_repo add-readgroupset registry.db 1kg PATIENT_123 SAMPLE_123 candig-example-data/HG00096.bam \
-R GRCh37-subset -n HG0096-subset
Adds a new readgroup set based on a subset of the 1000 genomes reads for the
HG00096 sample from the example data used in the reference server. In this case
we specify that the reference set name GRCh37-subset
be associated with the
readgroup set. We also override the default name generation rules and specify
the name HG00096-subset
for the new readgroup set.
$ candig_repo add-readgroupset registry.db 1kg PATIENT_123 SAMPLE_123 \
-n HG00114-remote
-I /path/to/HG00114.chrom11.ILLUMINA.bwa.GBR.low_coverage.20120522.bam.bai
ftp://ftp.ncbi.nlm.nih.gov/1000genomes/ftp/phase3/data/HG00114/alignment/HG00114.chrom11.ILLUMINA.bwa.GBR.low_coverage.20120522.bam
Adds a new readgroups set based on a 1000 genomes BAM directly from the NCBI
FTP server. Because this readgroup set uses a remote FTP URL, we must specify
the location of the .bai
index file on the local file system.
add-featureset¶
Warning
You may retrieve the latest version of gencode from here: https://www.gencodegenes.org/human/, you can usually download the GFF3 file from the first row: Comprehensive gene annotation.
Once you retrieve the GFF3 file, unzip it, then use a conversion script to convert the GFF3 file to a SQLite-compatible DB. The script is available from https://github.com/CanDIG/candig-server/blob/develop/scripts/generate_gff3_db.py.
The script, by default, will create composite indexes on (start, end, referenceName) and (geneName, type). This should suffice most of the use-cases.
If you are using this script mentioned above, ignore the following two paragraphs.
Before you add the feature set, you should make sure to index some of the columns in your
generated DB. Specifically, you should make sure that you both gene_name
and type
should be indexed. If you don’t, queries to this endpoint, and endpoints that depend on this,
e.g., variants/gene/search
will be very very slow.
To create a composite index on aforementioned fields, open the featureset DB
you generated via the sqlite browser,
then run CREATE INDEX name_type_index ON FEATURE (gene_name, type);
.
You should carefully review your use-case and index other fields accordingly.
Adds a feature set to a named dataset in a repository. Feature sets must be in a ‘.db’ file. An appropriate ‘.db’ file can be generate from a GFF3 file using scripts/generate_gff3_db.py.
Add a feature set to the data repo
usage: candig_repo add-featureset [-h] [-A ATTRIBUTES] [-r]
[-R REFERENCESETNAME] [-O ONTOLOGYNAME]
[-C CLASSNAME]
registryPath datasetName filePath
Positional Arguments¶
registryPath | the location of the registry database |
datasetName | the name of the dataset |
filePath | The path to the converted SQLite database containing Feature data |
Named Arguments¶
-A, --attributes | |
additional attributes for the message expressed as JSON | |
-r, --relativePath | |
store relative path in database | |
-R, --referenceSetName | |
the name of the reference set to associate with this feature set | |
-O, --ontologyName | |
the name of the sequence ontology instance used to translate ontology term names to IDs in this feature set | |
-C, --className | |
the name of the class used to fetch features in this feature set |
Examples:
$ candig_repo add-featureset registry.db 1KG gencode.db \
-R hg37 -O so-xp-simple
Adds the feature set gencode to the registry under the 1KG dataset. The flags set the reference genome to be hg37 and the ontology to use to so-xp-simple.
add-continuousset¶
Adds a continuous set to a named dataset in a repository. Continuous sets must be in a bigWig file. The bigWig format is described here: http://genome.ucsc.edu/goldenPath/help/bigWig.html. There are directions for converting wiggle files to bigWig files on the page also. Files in the bedGraph format can be converted using bedGraphToBigWig (https://www.encodeproject.org/software/bedgraphtobigwig/).
Add a continuous set to the data repo
usage: candig_repo add-continuousset [-h] [-r] [-R REFERENCESETNAME]
[-C CLASSNAME]
registryPath datasetName filePath
Positional Arguments¶
registryPath | the location of the registry database |
datasetName | the name of the dataset |
filePath | The path to the file contianing the continuous data |
Named Arguments¶
-r, --relativePath | |
store relative path in database | |
-R, --referenceSetName | |
the name of the reference set to associate with this continuous set | |
-C, --className | |
the name of the class used to fetch features in this continuous set |
Examples:
$ candig_repo add-continuousset registry.db 1KG continuous.bw \
-R hg37
Adds the continuous set continuous to the registry under the 1KG dataset. The flags set the reference genome to be hg37.
init-rnaquantificationset¶
Initializes a rnaquantification set.
Initializes an RNA quantification set
usage: candig_repo init-rnaquantificationset [-h] registryPath filePath
Positional Arguments¶
registryPath | the location of the registry database |
filePath | The path to the resulting Quantification Set |
Examples:
$ candig_repo init-rnaquantificationset repo.db rnaseq.db
Initializes the RNA Quantification Set with the filename rnaseq.db.
add-rnaquantification¶
Adds a rnaquantification to a RNA quantification set.
RNA quantification formats supported are currently kallisto and RSEM.
Add an RNA quantification to the data repo
usage: candig_repo add-rnaquantification [-h] [--biosampleName BIOSAMPLENAME]
[--sampleId SAMPLEID]
[--patientId PATIENTID]
[--readGroupSetName READGROUPSETNAME]
[--featureSetNames FEATURESETNAMES]
[-n NAME] [-d DESCRIPTION] [-t]
[-A ATTRIBUTES]
filePath quantificationFilePath
format registryPath datasetName
Positional Arguments¶
filePath | The path to the RNA SQLite database to create or modify |
quantificationFilePath | |
The path to the expression file. | |
format | format of the quantification input data |
registryPath | the location of the registry database |
datasetName | the name of the dataset |
Named Arguments¶
--biosampleName | |
Biosample Name | |
--sampleId | SampleId |
--patientId | PatientId |
--readGroupSetName | |
Read Group Set Name | |
--featureSetNames | |
Comma separated list | |
-n, --name | The name of the rna quantification |
-d, --description | |
The human-readable description of the RnaQuantification. | |
-t, --transcript | |
sets the quantification type to transcript | |
-A, --attributes | |
additional attributes for the message expressed as JSON |
Examples:
$ candig_repo add-rnaquantification rnaseq.db data.tsv \
kallisto candig-example-data/registry.db brca1 \
--biosampleName HG00096 --featureSetNames gencodev19
--readGroupSetName HG00096rna --transcript
Adds the data.tsv in kallisto format to the rnaseq.db quantification set with optional fields for associating a quantification with a Feature Set, Read Group Set, and Biosample.
add-rnaquantificationset¶
When the desired RNA quantification have been added to the set, use this command to add them to the registry.
Add an RNA quantification set to the data repo
usage: candig_repo add-rnaquantificationset [-h] [-R REFERENCESETNAME]
[-n NAME] [-A ATTRIBUTES]
registryPath datasetName filePath
Positional Arguments¶
registryPath | the location of the registry database |
datasetName | the name of the dataset |
filePath | The path to the converted SQLite database containing RNA data |
Named Arguments¶
-R, --referenceSetName | |
the name of the reference set to associate with this RnaQuantificationSet | |
-n, --name | The name of the RnaQuantificationSet |
-A, --attributes | |
additional attributes for the message expressed as JSON |
Examples:
$ candig_repo add-rnaquantificationset registry.db brca1 rnaseq.db \
-R hg37 -n rnaseq
Adds the RNA quantification set rnaseq.db to the registry under the brca1 dataset. The flags set the reference genome to be hg37 and the name of the set to rnaseq.
add-phenotypeassociationset¶
Adds an rdf object store. The cancer genome database Clinical Genomics Knowledge Base http://nif-crawler.neuinfo.org/monarch/ttl/cgd.ttl, published by the Monarch project, is the supported format for Evidence.
Adds phenotypes in ttl format to the repo.
usage: candig_repo add-phenotypeassociationset [-h] [-n NAME] [-A ATTRIBUTES]
registryPath datasetName
dirPath
Positional Arguments¶
registryPath | the location of the registry database |
datasetName | the name of the dataset |
dirPath | The path of the directory containing ttl files. |
Named Arguments¶
-n, --name | The name of the PhenotypeAssociationSet |
-A, --attributes | |
additional attributes for the message expressed as JSON |
Examples:
$ candig_repo add-phenotypeassociationset registry.db dataset1 /monarch/ttl/cgd.ttl -n cgd
remove-referenceset¶
Removes a reference set from the repository. Attempting to remove a reference set that is referenced by other objects in the repository will result in an error.
Remove a reference set from the repo
usage: candig_repo remove-referenceset [-h] [-f] registryPath referenceSetName
Positional Arguments¶
registryPath | the location of the registry database |
referenceSetName | |
the name of the reference set |
Named Arguments¶
-f, --force | do not prompt for confirmation |
Examples:
$ candig_repo remove-referenceset registry.db NCBI37
Deletes the reference set with name NCBI37
from the repository
represented by registry.db
remove-ontology¶
Removes an ontology from the repository. Attempting to remove an ontology that is referenced by other objects in the repository will result in an error.
Remove an ontology from the repo
usage: candig_repo remove-ontology [-h] [-f] registryPath ontologyName
Positional Arguments¶
registryPath | the location of the registry database |
ontologyName | the name of the ontology |
Named Arguments¶
-f, --force | do not prompt for confirmation |
Examples:
$ candig_repo remove-ontology registry.db so-xp
Deletes the ontology with name so-xp
from the repository
represented by registry.db
remove-variantset¶
Removes a variant set from the repository. This also deletes all associated call sets and variant annotation sets from the repository.
Remove a variant set from the repo
usage: candig_repo remove-variantset [-h] [-f]
registryPath datasetName variantSetName
Positional Arguments¶
registryPath | the location of the registry database |
datasetName | the name of the dataset |
variantSetName | the name of the variant set |
Named Arguments¶
-f, --force | do not prompt for confirmation |
Examples:
$ candig_repo remove-variantset registry.db dataset1 phase3-release
Deletes the variant set named phase3-release
from the dataset
named dataset1
from the repository represented by registry.db
.
remove-readgroupset¶
Removes a read group set from the repository.
Remove a read group set from the repo
usage: candig_repo remove-readgroupset [-h] [-f]
registryPath datasetName
readGroupSetName
Positional Arguments¶
registryPath | the location of the registry database |
datasetName | the name of the dataset |
readGroupSetName | |
the name of the read group set |
Named Arguments¶
-f, --force | do not prompt for confirmation |
Examples:
$ candig_repo remove-readgroupset registry.db dataset1 HG00114
Deletes the readgroup set named HG00114
from the dataset named
dataset1
from the repository represented by registry.db
.
remove-featureset¶
Removes a feature set from the repository.
Remove a feature set from the repo
usage: candig_repo remove-featureset [-h] [-f]
registryPath datasetName featureSetName
Positional Arguments¶
registryPath | the location of the registry database |
datasetName | the name of the dataset |
featureSetName | the name of the feature set |
Named Arguments¶
-f, --force | do not prompt for confirmation |
Examples:
$ candig_repo remove-featureset registry.db 1KG gencode-genes
Deletes the feature set named gencode-genes
from the dataset
named 1KG
from the repository represented by registry.db
.
remove-continuousset¶
Removes a continuous set from the repository.
Remove a continuous set from the repo
usage: candig_repo remove-continuousset [-h] [-f]
registryPath datasetName
continuousSetName
Positional Arguments¶
registryPath | the location of the registry database |
datasetName | the name of the dataset |
continuousSetName | |
the name of the continuous set |
Named Arguments¶
-f, --force | do not prompt for confirmation |
Examples:
$ candig_repo remove-continuousset registry.db 1KG continuous
Deletes the feature set named continuous
from the dataset
named 1KG
from the repository represented by registry.db
.
remove-rnaquantificationset¶
Removes a RNA quantification set from the repository.
Remove an RNA quantification set from the repo
usage: candig_repo remove-rnaquantificationset [-h] [-f]
registryPath datasetName
rnaQuantificationSetName
Positional Arguments¶
registryPath | the location of the registry database |
datasetName | the name of the dataset |
rnaQuantificationSetName | |
the name of the RNA Quantification Set |
Named Arguments¶
-f, --force | do not prompt for confirmation |
Examples:
$ candig_repo remove-rnaquantificationset registry.db dataset1 ENCFF305LZB
Deletes the rnaquantification set named ENCFF305LZB
from the dataset named
dataset1
from the repository represented by registry.db
.
remove-phenotypeassociationset¶
Removes an rdf object store.
Remove an phenotypes from the repo
usage: candig_repo remove-phenotypeassociationset [-h] [-f]
registryPath datasetName
name
Positional Arguments¶
registryPath | the location of the registry database |
datasetName | the name of the dataset |
name | The name of the phenotype association set |
Named Arguments¶
-f, --force | do not prompt for confirmation |
Examples:
$ candig_repo remove-phenotypeassociationset registry.db dataset1 cgd
add-biosample¶
Warning
This command is deprecated, and may be removed soon in future. Use ingest command to add Sample-related information.
Adds a new biosample to the repository. The biosample argument is a JSON document according to the GA4GH JSON schema.
Add a Biosample to the dataset
usage: candig_repo add-biosample [-h]
registryPath datasetName biosampleName
biosample
Positional Arguments¶
registryPath | the location of the registry database |
datasetName | the name of the dataset |
biosampleName | the name of the biosample |
biosample | the JSON of the biosample |
Examples:
$ candig_repo add-biosample registry.db dataset1 HG00096 '{"individualId": "abc"}'
Adds the biosample named HG00096 to the repository with the individual ID “abc”.
add-individual¶
Warning
This command is deprecated, and may be removed soon in future. Use ingest command to add Patient-related information.
Adds a new individual to the repository. The individual argument is a JSON document following the GA4GH JSON schema.
Add an Individual to the dataset
usage: candig_repo add-individual [-h]
registryPath datasetName individualName
individual
Positional Arguments¶
registryPath | the location of the registry database |
datasetName | the name of the dataset |
individualName | the name of the individual |
individual | the JSON of the individual |
Examples:
$ candig_repo add-individual registry.db dataset1 HG00096 '{"description": "A description"}'
remove-biosample¶
Removes a biosample from the repository.
Remove a Biosample from the repo
usage: candig_repo remove-biosample [-h] [-f]
registryPath datasetName biosampleName
Positional Arguments¶
registryPath | the location of the registry database |
datasetName | the name of the dataset |
biosampleName | the name of the biosample |
Named Arguments¶
-f, --force | do not prompt for confirmation |
Examples:
$ candig_repo remove-biosample registry.db dataset1 HG00096
Deletes the biosample with name HG00096
in the dataset
dataset1
from the repository represented by registry.db
remove-individual¶
Removes an individual from the repository.
Remove an Individual from the repo
usage: candig_repo remove-individual [-h] [-f]
registryPath datasetName individualName
Positional Arguments¶
registryPath | the location of the registry database |
datasetName | the name of the dataset |
individualName | the name of the individual |
Named Arguments¶
-f, --force | do not prompt for confirmation |
Examples:
$ candig_repo remove-individual registry.db dataset1 HG00096
Deletes the individual with name HG00096
in the dataset
dataset1
from the repository represented by registry.db
Other commands¶
candig_server¶
There are a number of optional parameters to start up the server.
When no paramters are set, running candig-server
would start up the server at
http://127.0.0.1:8000
.
You may supply your own config file (.py), as indicated below. This config.py
specifies
the DATA_SOURCE
to be at a custom location, and the DEFAULT_PAGE_SIZE
to be 1500, overridding the default values for both.
DATA_SOURCE = '/home/user/dev/data.db'
DEFAULT_PAGE_SIZE = 1500
usage: candig_server [-h] [--port PORT] [--host HOST] [--config CONFIG]
[--config-file CONFIG_FILE] [--tls] [--gunicorn]
[--certfile CERTFILE] [--keyfile KEYFILE]
[--dont-use-reloader] [--workers WORKERS]
[--timeout TIMEOUT] [--worker_class WORKER_CLASS]
[--epsilon EPSILON] [--version]
[--disable-urllib-warnings]
Examples:
$ candig_server --host 0.0.0.0 --port 3000 --config-file config.py
add-peer¶
Adds a new peer server.
Add a peer to the registry by URL.
usage: candig_repo add-peer [-h] [-A ATTRIBUTES] registryPath url
Positional Arguments¶
registryPath | the location of the registry database |
url | The URL of the given resource |
Named Arguments¶
-A, --attributes | |
additional attributes for the message expressed as JSON |
Examples:
$ candig_repo add-peer registry.db https://candig.test.ca
remove-peer¶
Removes a peer server.
Warning
If you did not add a trailing path when you add the peer URL, a trailing path is added automatically,
therefore, as the examples show, if you add https://candig.test.ca
, when you delete
it, you will need to run https://candig.test.ca/
.
Remove a peer from the registry by URL.
usage: candig_repo remove-peer [-h] [-f] registryPath url
Positional Arguments¶
registryPath | the location of the registry database |
url | The URL of the given resource |
Named Arguments¶
-f, --force | do not prompt for confirmation |
Examples:
$ candig_repo remove-peer registry.db https://candig.test.ca/
candig_snapshot¶
Creates a report containing information about Clinical; Pipeline; Genomic; Dataset; and Id of Patients stored on the database.
Create a CanDIG-Server Database Snapshot Report
usage: candig_snapshot [-h] [--markdown] [--html]
[--destination /output/directory/]
database
Positional Arguments¶
database | Path to CanDIG-Server database file |
Named Arguments¶
--markdown | Generate report in markdown format |
--html | Generate report in HTML format |
--destination | Directory where the outputs will be saved |
Examples:
Warning
You must pass at least one of the following arguments to the script:
--html
--markdown
$ candig_snapshot candig-example-data/registry.db --html
$ candig_snapshot candig-example-data/registry.db --markdown
$ candig_snapshot candig-example-data/registry.db --markdown --html