
CanDIG Server¶
This the documentation for version 1.4.0 of the candig-server.
The candig-server is a CanDIG project, and can be installed via pip install candig-server
.
If you need help or want to report a bug, please refer to our Contribution Guideline
Setup¶
Introduction¶
The candig-server is a CanDIG project, aiming to provide distributed, secure data access to participating institutions across Canada. You can learn more about our work from https://www.distributedgenomics.ca
It has been built on the ga4gh-server with extended functionalities including native support for peer-to-peer communications, clinical and pipeline metadata support, as well as the upgraded codebase, written in Python 3.
Prepare Data For Ingestion¶
Warning
This part is a work in progress.
The candig-server has a general guideline on which data are accepted.
The following sections will talk about how to prepare for the data that is accepted by the candig-server, and provide you with some mock data for testing.
Clinical and Pipeline Metadata¶
File format required for ingestion¶
At this time, ingest
command is the only way to ingest any clinical and pipeline
metadata, detailed instructions for ingestion can be found under Command Line Interface.
However, before you run the ingest
command, you need to prepare for a json file
that conforms to the ingest standard format.
For clinical data, it is a json object, with metadata as the key. However, for pipeline data, its key is pipeline_metadata. Make sure you have the correct key specified.
The value of the key is a list of objects. Each object should have the table name as the key, and the object(s) as its value. Therefore, it is possible to specify multiple tables in one single object. However, each table can only be specified once, due to the uniqueness of the key in the object.
As of candig-ingest==1.5.0, if you need to specify, for example, two samples for one patient, you can specify both samples in a single list and make this list be the value of the Sample table key, as shown below. For all clinical data objects, you always need to specify patientId.
Warning
Please do not include Tier information yourself. Use the load_tier that comes with candig-ingest to load tiers. More details follow.
The following examples only work with candig-ingest>=1.5.0
{
"metadata": [
{
"Patient": {
"patientId": "Patient_12345",
"patientIdTier": 0
},
"Sample": [
{
"sampleId": "Sample_1",
"sampleIdTier": 0,
"patientId": "Patient_12345",
"patientIdTier": 4
},
{
"sampleId": "Sample_2",
"sampleIdTier": 0,
"patientId": "Patient_12345",
"patientIdTier": 4
}
]
}
]
}
Warning
In candig-ingest<=1.4.0, it was recommended that you specify the second sample as an independent object in the list, as shown below. Do not use this way as it is obsolete.
{
"metadata": [
{
"Patient": {
"patientId": "Patient_12345",
"patientIdTier": 0
},
"Sample": {
"sampleId": "Sample_1",
"sampleIdTier": 0,
"patientId": "Patient_12345",
"patientIdTier": 4
}
},
{
"Sample": {
"sampleId": "Sample_2",
"sampleIdTier": 0,
"patientId": "Patient_12345",
"patientIdTier": 4
}
}
]
}
Similar structure is used for pipeline metadata, however, for all pipeline metadata objects,
you should always include sampleId
.
Specify unique identifiers of the object¶
For Patient
and Sample
record, their unique identifiers are PatientId
and SampleId
, respectively.
For all other clinical records, you will have the option to specify localId
as their unique identifier.
For example, if you were to ingest a Diagnosis
object, you may write
{ "metadata": [ { "Patient": { "patientId": "Patient_12345", "patientIdTier": 0 }, "Diagnosis": { "localId": "diag_1", "sampleType": "metastatic", "sampleTypeTier": 2 } } ] }
So, what happens if you do not specify a localId
? The ingest command will attempt to construct a unique
identifier based on several pre-selected fields, they vary from table to table. They are listed in the following table.
If you specify a localId
already, then localId
will take precedence, regardless if these pre-selected fields
are populated.
If you did not specify a localId
, and the ingest utility is not able to generate an identifier based on these fields,
the ingest will fail.
Therefore, we recommend that you pre-populate the localId
for clinical records.
Table | ||||
---|---|---|---|---|
Enrollment | patientId | enrollmentApprovalDate | ||
Consent | patientId | consentDate | ||
Treatment | patientId | startDate | ||
Outcome | patientId | dateOfAssessment | ||
Complication | patientId | date | ||
Tumourboard | patientId | dateOfMolecularTumorBoard | ||
Chemotherapy | patientId | treatmentPlanId | systematicTherapyAgentName | |
Radiotherapy | patientId | courseNumber | treatmentPlanId | startDate |
Immunotherapy | patientId | treatmentPlanId | startDate | startDate |
Surgery | patientId | treatmentPlanId | startDate | |
Celltransplant | patientId | treatmentPlanId | startDate | |
Slide | patientId | slideId | ||
Study | patientId | startDate | ||
Labtest | patientId | startDate |
How to load tiers¶
If you have a valid json file, but missing tier information, you should use the load_tier
utility provided by the candig-ingest to load the tier information.
The load_tiers
command is the preferred way to load tier information. It does not come with
candig-server by default, to use it, you need to install candig-ingest by running:
pip install candig-ingest
To tier the data, you need to run
usage: load_tiers <project_name> <json_filepath> <tier_filepath> <output_filepath>
Examples:
$ load_tiers pog mock.json tier.tsv mock_tier.json
Mock data for testing¶
We provide some mock data files, should you want to use them to quickly test your server.
Please note that the mock data listed below only contain data for clinical and pipeline metadata.
They are available from https://github.com/CanDIG/candig-ingest/tree/master/candig/ingest/mock_data
Use the clinical_metadata_tier[1,2,3].json
files.
Note
If you are interested in testing the 1k genome data, you can find a ingest-compliant
clinical mock dataset here https://github.com/CanDIG/candig-ingest/releases/tag/v1.3.1
This contains all individuals information.
Migrate data from RedCap Cloud¶
If your clinical meta data is on RedCapCloud, we provide a script that would transform the related data into ready-to-ingest format.
It is available from here: https://github.com/CanDIG/redcap-cloud
Data Use Ontology¶
To enable future automated discovery, we have adopted the use of Data Use Ontology (DUO) Terms to describe our datasets. For the current version of candig-server, you have the option to use a json file to define your dataset.
You can find a list of DUO Terms through this csv file. You can also use an ontology parsing tool of your choice to visualize/parse a more complete list of DUO’s raw OWL definition.
Warning
We only support a limited subset of the DUO terms.
Terms whose ID is between DUO:0000031 and DUO:0000039, as well as DUO:0000022 and DUO:0000025 are not supported, as we expect these terms to be updated in the near future.
If you think an ID should be supported, but is not. You can let us know by opening an issue on our Github repo.
The supported IDs are listed below.
[
"DUO:0000001", "DUO:0000004", "DUO:0000005",
"DUO:0000006", "DUO:0000007", "DUO:0000011", "DUO:0000012", "DUO:0000014",
"DUO:0000015", "DUO:0000016", "DUO:0000017", "DUO:0000018", "DUO:0000019",
"DUO:0000020", "DUO:0000021", "DUO:0000024", "DUO:0000026", "DUO:0000027"
"DUO:0000028", "DUO:0000029", "DUO:0000042"
]
Note
We do not currently provide API to look up the definitions via their ID.
If one of the supported ids listed above is not in the csv file provided above, you may be able to look up their definitions via EBI’s DUO page.
To ingest the DUO Terms, you need to prepare a json file listed like below. You should only specify id in a single DUO object, unless the modifier is also required, then you specify the id along with the modifier.
{
"duo": [
{
"id": "DUO:0000021"
},
{
"id": "DUO:0000024",
"modifier": "2020-01-01"
}
]
}
Warning
For now, DUO:0000024 is the only DUO Term that requires modifier. The modifier has to be formatted exactly like YYYY-MM-DD, invalid dates will be rejected.
When your file is ready, run the add-dataset-duo command to populate the DUO information of the dataset. Please note that this command will always overwrite the existing DUO information stored.
Reads, Variants and References Data¶
If you are interested in testing the candig-server with some variants data, we provide a mock dataset here: https://github.com/CanDIG/test_data/releases
Currently, there are three groups of test data provided, containing clinical, pipeline metadata, as well as the variants data. We have provided a loading script, note that you might need to modify the DB path, or the dataset name.
We provide three group datasets since we often use it to test federation of three test servers.
Importing sample FASTA:
- Download http://hgdownload.cse.ucsc.edu/goldenpath/hg19/bigZips/hg19.fa.gz
gunzip hg19.fa.gz
to unzip- Import by running
candig_repo add-referenceset candig-example-data/registry.db /path/to/hg19a.fa -d "hg19a reference genome" --name hg19a
Import sample VCF:
- To work with a certain group, download the tar file and load script.
- Assumptions:
- the
referenceSet
ishg19a
(this will depend on your data)
tar xvf group1.tar
to be run for unarchiving- In group1_load.sh
- rename all instances of
GRCh37-lite
tohg19a
(again, this will depend on your data) - give path to
registry.db
on your file system - give path to all
group1/.*tbi
files
- rename all instances of
- Run
chmod +x group1_load.sh
to make the script executable - Run
ingest registry.db test300 group1_clinphen.json
to create a newtest300
dataset using the metadata - Run ./group1_load.sh to run the import
Development Setup¶
Thanks for your interest in the candig-server. If you have any questions, please let us know
Warning
This guide is a work in progress, and is incomplete.
Standalone candig-server Setup¶
You will need Python 3.6.x to successfully run the candig-server.
First, create a virtualenv with Python 3.6.x
and run
git clone https://github.com/CanDIG/candig-server.git
pip install -r dev-requirements.txt
The default DB location for development server is candig-example-data/registry.db.
Assume your local python3 instance is of Python 3.6, with sqlite3 and other required modules built, you can run the following script to set up a dev server in a few minutes. Even better, it will be populated with two datasets with mock data.
The script has been tested on CentOS machines, and we expect the install to run OK on most unix distributions. You can run it on MacOS, but wget likely won’t work and you should either download the files yourself, or use a replacement for wget.
See Command Line Interface for more information regarding ingesting data and starting up server.
Warning
Please note that by default, the script installs the server on the directory test_server. You can either change it, or make sure that you do not have a test_server directory.
# Wipe previous virtualenv
if [ -d test_server ];
then rm -r test_server;
fi
# Server 1
echo "setting up test_server"
python3 -m venv test_server
cd test_server
source bin/activate
pip install -U pip
pip install -U setuptools
pip install candig-server
pip install candig-ingest==1.3.1
mkdir candig-example-data
wget https://raw.githubusercontent.com/CanDIG/candig-ingest/master/candig/ingest/mock_data/clinical_metadata_tier1.json
wget https://raw.githubusercontent.com/CanDIG/candig-ingest/master/candig/ingest/mock_data/clinical_metadata_tier2.json
ingest candig-example-data/registry.db mock1 clinical_metadata_tier1.json
ingest candig-example-data/registry.db mock2 clinical_metadata_tier2.json
echo "test server set up has completed."
candig_server --host 0.0.0.0 --port 3000
Optionally, you can run two servers with federation set up. Assume you have server A running
at 0.0.0.0:3000
, you need to run the script again to set up a server B.
Use this file
as the clinical data for your server B, and run your server B at 0.0.0.0:3001
.
Now that you have both servers installed, you need to add them to be the peer of each other
For server A, you need to run
candig_repo add-peer candig-example-data/registry.db http://0.0.0.0:3001
For server B, you need to run
candig_repo add-peer candig-example-data/registry.db http://0.0.0.0:3000
You do not need to have anything running on the peer when you execute the add-peer command. It simply registers that URL as a peer.
Now, you will get federated response from both servers A and B. You can certainly choose to run them on different ports, or different servers, the script makes these assumptions only for your convenience.
Tyk and Keycloak Setup¶
It is possible to run set up a local test server, with Tyk and Keycloak set up. These two components provide authentication capabilities.
For more information, refer to https://github.com/CanDIG/candig_tyk.
Production Deployment¶
The candig-server has a Configuration file that allows Flask and application specific configuration values to be set.
Configuration file¶
The candig-server is a Flask application and uses the standard Flask configuration file mechanisms. Many configuration files will be very simple, and will consist of just one directive instructing the server where to find the data repository; example, we might have
DATA_SOURCE = "/path/to/registry.db"
For production deployments, we shouldn’t need to add any more configuration than this, as the other keys have sensible defaults. However, all of Flask’s builtin configuration values are supported, as well as the extra custom configuration values documented here. For information on ingesting data, see Command Line Interface.
When debugging deployment issues, it may be useful to turn on extra debugging information as follows:
DEBUG = True
Warning
Debugging should only be used temporarily and not left on by default. Running the server with Flask debugging enable is insecure and should never be used in a production environment.
Configuration Values¶
- DEFAULT_PAGE_SIZE
The default maximum number of values to fill into a page when responding to search queries. If a client does not specify a page size in a query, this value is used.
Depending on your audiences or your dataset, you could choose to set a number between 1000 to 4000. If you are part of a federated network, it is best to choose a single number and use that within the federated network.
- MAX_RESPONSE_LENGTH
The approximate maximum size of the server buffer used when creating responses. This is somewhat smaller than the size of the JSON response returned to the client. When a client makes a search request with a given page size, the server will process this query and incrementally build a response until (a) the number of values in the page list is equal to the page size; (b) the size of the internal buffer in bytes is >= MAX_RESPONSE_LENGTH; or (c) there are no more results left in the query.
The default is 1024 * 1024, which is equivalent to roughly 1 MB. You can make this bigger, though we don’t recommend setting anything too big, e.g., bigger than 10 MB.
- REQUEST_VALIDATION
Set this to True to strictly validate all incoming requests to ensure that they conform to the protocol. This may result in clients with poor standards compliance receiving errors rather than the expected results.
This defaults to True. Don’t change this unless you have a particularly good reason to.
Docker Deployment¶
Refer to this page for Docker installation instructions
Access List Setup¶
An example access list file looks like below. It is a tab-separated file. The server
default name for the file is access_list.txt
(not .tsv) due to backward compatibility reasons.
You can place your own access_list file anywhere you like, but you will need to specify the location at ACCESS_LIST. Under production environment, you should define the location of the file to be somewhere secure.
Warning
As of candig-server==1.2.1, it is recommended that you use letter X to indicate that the user has no access to a dataset, instead of an empty space, or a null character.
issuer username project1 project2 project3 projectN
https://candigauth.bcgsc.ca/auth/realms/candig userA 4 4 4 4
https://candigauth.bcgsc.ca/auth/realms/candig userB 4 X 0 1
https://candigauth.uhnresearch.ca/auth/realms/CanDIG userC 4 3 2 1
https://candigauth.uhnresearch.ca/auth/realms/CanDIG userD X X 4 4
Command Line Interface¶
With the exception of using ingest
to ingest the clinical and pipeline metadata
in bulk, candig_repo
command line interface is used for all other operation.
The registry contains links to files, as well as some metadata.
Warning
Because the data model objects returned via APIs are created at server start-up, at this time, you have to restart the server for the data you ingest to be reflected.
For instructions on adding metadata in bulk, see ingest.
When you are done ingesting data, you may start up your server instance by running the
candig_server
command, see Other commands for more information.
Initialize/Remove Dataset¶
This section contains commands that initialize the dataset, give you the overview of the data repository, as well as deleting the dataset.
You do not need to use init
to initialize the dataset if you already prepared
a json file of clinical information. You can run the ingest
command directly and
it will take care of everything for you.
init¶
Warning
If you already prepared a json file that conforms to our standard clinical or
pipeline metadata, you can run ingest
command directly without running init
.
For detailed instructions, see ingest.
The init
command initialises a new registry DB at a given
file path. Unless you have a clinical json file ready that can be ingested with ingest
,
you need to run this to initialize your DB.
Initialize a data repository
usage: candig_repo init [-h] [-f] registryPath
Positional Arguments¶
registryPath | the location of the registry database |
Named Arguments¶
-f, --force | do not prompt for confirmation |
Examples:
$ candig_repo init registry.db
list¶
The list
command is used to print the contents of a repository
to the screen. It is an essential tool for administrators to
understand the structure of the repository that they are managing.
Note
The list
command is under development and will
be much more sophisticated in the future. In particular, the output
of this command should improve considerably in the near future.
List the contents of the repo
usage: candig_repo list [-h] registryPath
Positional Arguments¶
registryPath | the location of the registry database |
Examples:
$ candig_repo list registry.db
verify¶
The verify
command is used to check that the integrity of the
data in a repository. The command checks each container object in turn
and ensures that it can read data from it. Read errors can occur for
any number of reasons (for example, a VCF file may have been moved
to another location since it was added to the registry), and the
verify
command allows an administrator to check that all is
well in their repository.
Note
The verify
command is currently under review.
Verifies the repository by examing all data files
usage: candig_repo verify [-h] registryPath
Positional Arguments¶
registryPath | the location of the registry database |
Examples:
$ candig_repo verify registry.db
add-dataset¶
Creates a new dataset in a repository. A dataset is an arbitrary collection of ReadGroupSets, VariantSets, VariantAnnotationSets and FeatureSets. Each dataset has a name, which is used to identify it in the repository manager.
Warning
If you already prepared a json file that conforms to our standard clinical or
pipeline metadata, you can run ingest
command directly without running add-dataset
.
For detailed instructions, see ingest.
Add a dataset to the data repo
usage: candig_repo add-dataset [-h] [-A ATTRIBUTES] [-d DESCRIPTION]
registryPath datasetName
Positional Arguments¶
registryPath | the location of the registry database |
datasetName | the name of the dataset |
Named Arguments¶
-A, --attributes | |
additional attributes for the message expressed as JSON | |
-d, --description | |
The human-readable description of the dataset. |
Examples:
$ candig_repo add-dataset registry.db 1kg -d 'Example dataset using 1000 genomes data'
Adds the dataset with the name 1kg
and description
'Example dataset using 1000 genomes data'
to the
registry database registry.db
.
add-dataset-duo¶
Create/update new Data Use Ontology Information for an existing dataset. Note that you have to have an existing dataset to be able to use this command. When you need to update the DUO info, simply run the command with updated DUO Json file.
Add DUO info to a dataset
usage: candig_repo add-dataset-duo [-h]
registryPath datasetName
dataUseOntologyFile
Positional Arguments¶
registryPath | the location of the registry database |
datasetName | the name of the dataset |
dataUseOntologyFile | |
Path to your duo config json file. |
Examples:
$ candig_repo add-dataset-duo registry.db mock1 duo.json
Adds the Data Use Ontology info to the dataset with the name mock1
.
To learn about how to prepare a json file that contains DUO info for a dataset, and a list
of DUO IDs that are allowed, see the Data Use Ontology
section under Prepare Data For Ingestion.
remove-dataset¶
Removes a dataset from the repository and recursively removes all objects (ReadGroupSets, VariantSets, etc) within this dataset.
Remove a dataset from the data repo
usage: candig_repo remove-dataset [-h] [-f] registryPath datasetName
Positional Arguments¶
registryPath | the location of the registry database |
datasetName | the name of the dataset |
Named Arguments¶
-f, --force | do not prompt for confirmation |
Examples:
$ candig_repo remove-dataset registry.db dataset1
Deletes the dataset with name dataset1
from the repository
represented by registry.db
remove-dataset-duo¶
Remove new Data Use Ontology Information for an existing dataset.
Remove DUO info from a dataset
usage: candig_repo remove-dataset-duo [-h] [-f] registryPath datasetName
Positional Arguments¶
registryPath | the location of the registry database |
datasetName | the name of the dataset |
Named Arguments¶
-f, --force | do not prompt for confirmation |
Examples:
$ candig_repo remove-dataset-duo registry.db mock1
Removes the Data Use Ontology info to the dataset with the name mock1
.
Add/Remove Clinical & Pipeline Metadata¶
This section contains commands that let you ingest data into the clinical and pipeline metadata tables, as well as the commands that delete them.
The ingest
command is the only way to ingest clinical or pipeline data in bulk.
It encapsulates all the write operations into a single transaction. To learn about preparing
the json files for the ingest
command, see Prepare Data For Ingestion
All of the remove
commands for removing clinical tables require you to specify their
name
, note that the name
here is actually their unique identifier, typically is composed
of their patientId, sometimes along with some other ID or timestamp information. This is the same
name
you see in the records of these clinical or pipeline data records.
ingest¶
The ingest
command is the preferred way to import metadata in bulk. It does not come with
candig-server by default, to use it, you need to install candig-ingest by running:
pip install candig-ingest
To import metadata in bulk, you need to have a specially formatted json file. A mock json file is available from https://github.com/CanDIG/candig-ingest/blob/master/candig/ingest/mock_data/clinical_metadata_tier1.json
To ingest the data, you need to run
usage: ingest registryPath datasetName metadataPath
If the dataset does not exist, it will create a new dataset of this name. There is no need
to run init
command before running ingest
.
Examples:
$ ingest registry.db mock1 mock_data.json
remove-patient¶
remove a patient.
Remove an Patient from the repo
usage: candig_repo remove-patient [-h] [-f]
registryPath datasetName patientName
Positional Arguments¶
registryPath | the location of the registry database |
datasetName | the name of the dataset |
patientName | the name of the patient |
Named Arguments¶
-f, --force | do not prompt for confirmation |
Examples:
$ candig_repo remove-patient registry.db mock1 PATIENT_81202
remove-enrollment¶
remove a enrollment.
Remove an Enrollment from the repo
usage: candig_repo remove-enrollment [-h] [-f]
registryPath datasetName enrollmentName
Positional Arguments¶
registryPath | the location of the registry database |
datasetName | the name of the dataset |
enrollmentName | the name of the enrollment |
Named Arguments¶
-f, --force | do not prompt for confirmation |
Examples:
$ candig_repo remove-enrollment registry.db mock1 PATIENT_81202_2005-08-23
remove-treatment¶
remove a treatment.
Remove an Treatment from the repo
usage: candig_repo remove-treatment [-h] [-f]
registryPath datasetName treatmentName
Positional Arguments¶
registryPath | the location of the registry database |
datasetName | the name of the dataset |
treatmentName | the name of the treatment |
Named Arguments¶
-f, --force | do not prompt for confirmation |
Examples:
$ candig_repo remove-treatment registry.db mock1 PATIENT_81202_2005-08-23
remove-sample¶
remove a sample.
Remove an Sample from the repo
usage: candig_repo remove-sample [-h] [-f] registryPath datasetName sampleName
Positional Arguments¶
registryPath | the location of the registry database |
datasetName | the name of the dataset |
sampleName | the name of the sample |
Named Arguments¶
-f, --force | do not prompt for confirmation |
Examples:
$ candig_repo remove-sample registry.db mock1 PATIENT_81202_SAMPLE_33409
remove-diagnosis¶
remove a diagnosis.
Remove an Diagnosis from the repo
usage: candig_repo remove-diagnosis [-h] [-f]
registryPath datasetName diagnosisName
Positional Arguments¶
registryPath | the location of the registry database |
datasetName | the name of the dataset |
diagnosisName | the name of the diagnosis |
Named Arguments¶
-f, --force | do not prompt for confirmation |
Examples:
$ candig_repo remove-diagnosis registry.db mock1 PATIENT_81202_SAMPLE_33409
remove-tumourboard¶
remove a tumourboard.
Remove an Tumourboard from the repo
usage: candig_repo remove-tumourboard [-h] [-f]
registryPath datasetName tumourboardName
Positional Arguments¶
registryPath | the location of the registry database |
datasetName | the name of the dataset |
tumourboardName | |
the name of the tumourboard |
Named Arguments¶
-f, --force | do not prompt for confirmation |
Examples:
$ candig_repo remove-tumourboard registry.db mock1 PATIENT_81202_SAMPLE_33409
remove-outcome¶
remove a outcome.
Remove an Outcome from the repo
usage: candig_repo remove-outcome [-h] [-f]
registryPath datasetName outcomeName
Positional Arguments¶
registryPath | the location of the registry database |
datasetName | the name of the dataset |
outcomeName | the name of the outcome |
Named Arguments¶
-f, --force | do not prompt for confirmation |
Examples:
$ candig_repo remove-outcome registry.db mock1 PATIENT_81202_2016-10-11
remove-complication¶
remove a complication.
Remove an Complication from the repo
usage: candig_repo remove-complication [-h] [-f]
registryPath datasetName
complicationName
Positional Arguments¶
registryPath | the location of the registry database |
datasetName | the name of the dataset |
complicationName | |
the name of the complication |
Named Arguments¶
-f, --force | do not prompt for confirmation |
Examples:
$ candig_repo remove-complication registry.db mock1 PATIENT_81202_2016-10-11
remove-consent¶
remove a consent.
Remove an Consent from the repo
usage: candig_repo remove-consent [-h] [-f]
registryPath datasetName consentName
Positional Arguments¶
registryPath | the location of the registry database |
datasetName | the name of the dataset |
consentName | the name of the consent |
Named Arguments¶
-f, --force | do not prompt for confirmation |
Examples:
$ candig_repo remove-consent registry.db mock1 PATIENT_81202_2016-10-11
remove-chemotherapy¶
remove a chemotherapy.
Remove an Chemotherapy from the repo
usage: candig_repo remove-chemotherapy [-h] [-f]
registryPath datasetName
chemotherapyName
Positional Arguments¶
registryPath | the location of the registry database |
datasetName | the name of the dataset |
chemotherapyName | |
the name of the chemotherapy |
Named Arguments¶
-f, --force | do not prompt for confirmation |
Examples:
$ candig_repo remove-chemotherapy registry.db mock1 PATIENT_81202_2016-10-11
remove-immunotherapy¶
remove a immunotherapy.
Remove an Immunotherapy from the repo
usage: candig_repo remove-immunotherapy [-h] [-f]
registryPath datasetName
immunotherapyName
Positional Arguments¶
registryPath | the location of the registry database |
datasetName | the name of the dataset |
immunotherapyName | |
the name of the immunotherapy |
Named Arguments¶
-f, --force | do not prompt for confirmation |
Examples:
$ candig_repo remove-immunotherapy registry.db mock1 PATIENT_81202_2016-10-11
remove-radiotherapy¶
remove a radiotherapy.
Remove an Radiotherapy from the repo
usage: candig_repo remove-radiotherapy [-h] [-f]
registryPath datasetName
radiotherapyName
Positional Arguments¶
registryPath | the location of the registry database |
datasetName | the name of the dataset |
radiotherapyName | |
the name of the radiotherapy |
Named Arguments¶
-f, --force | do not prompt for confirmation |
Examples:
$ candig_repo remove-radiotherapy registry.db mock1 PATIENT_81202_2016-10-11
remove-celltransplant¶
remove a celltransplant.
Remove an Celltransplant from the repo
usage: candig_repo remove-celltransplant [-h] [-f]
registryPath datasetName
celltransplantName
Positional Arguments¶
registryPath | the location of the registry database |
datasetName | the name of the dataset |
celltransplantName | |
the name of the celltransplant |
Named Arguments¶
-f, --force | do not prompt for confirmation |
Examples:
$ candig_repo remove-celltransplant registry.db mock1 PATIENT_81202_2016-10-11
remove-surgery¶
remove a surgery.
Remove an Surgery from the repo
usage: candig_repo remove-surgery [-h] [-f]
registryPath datasetName surgeryName
Positional Arguments¶
registryPath | the location of the registry database |
datasetName | the name of the dataset |
surgeryName | the name of the surgery |
Named Arguments¶
-f, --force | do not prompt for confirmation |
Examples:
$ candig_repo remove-surgery registry.db mock1 PATIENT_81202_2016-10-11
remove-study¶
remove a study.
Remove an Study from the repo
usage: candig_repo remove-study [-h] [-f] registryPath datasetName studyName
Positional Arguments¶
registryPath | the location of the registry database |
datasetName | the name of the dataset |
studyName | the name of the study |
Named Arguments¶
-f, --force | do not prompt for confirmation |
Examples:
$ candig_repo remove-study registry.db mock1 PATIENT_81202_2016-10-11
remove-slide¶
remove a slide.
Remove an Slide from the repo
usage: candig_repo remove-slide [-h] [-f] registryPath datasetName slideName
Positional Arguments¶
registryPath | the location of the registry database |
datasetName | the name of the dataset |
slideName | the name of the slide |
Named Arguments¶
-f, --force | do not prompt for confirmation |
Examples:
$ candig_repo remove-slide registry.db mock1 PATIENT_81202_2016-10-11
remove-labtest¶
remove a labtest.
Remove an Labtest from the repo
usage: candig_repo remove-labtest [-h] [-f]
registryPath datasetName labtestName
Positional Arguments¶
registryPath | the location of the registry database |
datasetName | the name of the dataset |
labtestName | the name of the labtest |
Named Arguments¶
-f, --force | do not prompt for confirmation |
Examples:
$ candig_repo remove-labtest registry.db mock1 PATIENT_81202_2016-10-11
Add/Remove Genomics Data¶
add-referenceset¶
Adds a reference set derived from a FASTA file to a repository. Each
record in the FASTA file will correspond to a Reference in the new
ReferenceSet. The input FASTA file must be compressed with bgzip
and indexed using samtools faidx
. Each ReferenceSet contains a
number of metadata values (.e.g. species
) which can be set
using command line options.
You may ingest raw FASTA (.fa) files.
If you ingest compressed FASTA files, you should download both .gz.fai and .gz.gzi files, along with the .fz.gz file. If gz.fai file is not available, the server will attempt to generate it at ingestion.
Example FASTA files are available from
`
http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/phase2_reference_assembly_sequence/hs37d5.fa.gz
http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/phase2_reference_assembly_sequence/hs37d5.fa.gz.fai
http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/phase2_reference_assembly_sequence/hs37d5.fa.gz.gzi
`
Add a reference set to the data repo
usage: candig_repo add-referenceset [-h] [-A ATTRIBUTES] [-r] [-n NAME]
[-d DESCRIPTION] [--species SPECIES]
[--isDerived ISDERIVED]
[--assemblyId ASSEMBLYID]
[--sourceAccessions SOURCEACCESSIONS]
[--sourceUri SOURCEURI]
registryPath filePath
Positional Arguments¶
registryPath | the location of the registry database |
filePath | The path of the FASTA file to use as a reference set. This file must be bgzipped and indexed. |
Named Arguments¶
-A, --attributes | |
additional attributes for the message expressed as JSON | |
-r, --relativePath | |
store relative path in database | |
-n, --name | The name of the reference set |
-d, --description | |
The human-readable description of the reference set. | |
--species | The species ontology term as a JSON string |
--isDerived | Indicates if this reference set is derived from another |
--assemblyId | The assembly id |
--sourceAccessions | |
The source accessions (pass as comma-separated list) | |
--sourceUri | The source URI |
Examples:
$ candig_repo add-referenceset registry.db hs37d5.fa.gz \
--description "NCBI37 assembly of the human genome" \
--species '{"termId": "NCBI:9606", "term": "Homo sapiens"}' \
--name NCBI37 \
--sourceUri ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/phase2_reference_assembly_sequence/hs37d5.fa.gz
Adds a reference set used in the 1000 Genomes project using the name
NCBI37
, also setting the species
to 9606 (human).
add-ontology¶
Warning
This command, as well as all ontology-related operations are under review. They might undergo changes in the near future.
Adds a new ontology to the repository. The ontology supplied must be a text file in OBO format. If you wish to serve sequence or variant annotations from a repository, a sequence ontology (SO) instance is required to translate ontology term names held in annotations to ontology IDs. Sequence ontology definitions can be downloaded from the Sequence Ontology site.
Adds an ontology in OBO format to the repo. Currently, a sequence ontology (SO) instance is required to translate ontology term names held in annotations to ontology IDs. Sequence ontology files can be found at https://github.com/The-Sequence-Ontology/SO-Ontologies
usage: candig_repo add-ontology [-h] [-r] [-n NAME] registryPath filePath
Positional Arguments¶
registryPath | the location of the registry database |
filePath | The path of the OBO file defining this ontology. |
Named Arguments¶
-r, --relativePath | |
store relative path in database | |
-n, --name | The name of the ontology |
Examples:
$ candig_repo add-ontology registry.db path/to/so-xp.obo
Adds the sequence ontology so-xp.obo
to the repository using the
default naming rules.
add-variantset¶
Adds a variant set to a named dataset in a repository. Variant sets are
currently derived from one or more non-overlapping VCF/BCF files which
may be either stored locally or come from a remote URL. Multiple VCF
files can be specified either directly on the command line or by
providing a single directory argument that contains indexed VCF files.
If remote URLs are used then index files in the local file system must be
provided using the -I
option.
Note: Starting from 0.9.3, you now need to specify a patientId
and a sampleId
. The server
does not validate either, so please double check to make sure the IDs are correct.
Add a variant set to the data repo based on one or more VCF files.
usage: candig_repo add-variantset [-h] [-r] [-I indexFiles [indexFiles ...]]
[-n NAME] [-R REFERENCESETNAME]
[-O ONTOLOGYNAME] [-A ATTRIBUTES] [-a]
registryPath datasetName patientId sampleId
dataFiles [dataFiles ...]
Positional Arguments¶
registryPath | the location of the registry database |
datasetName | the name of the dataset |
patientId | the ID of the patient |
sampleId | the ID of the sample |
dataFiles | The VCF/BCF files representing the new VariantSet. These may be specified either one or more paths to local files or remote URLS, or as a path to a local directory containing VCF files. Either a single directory argument may be passed or a list of file paths/URLS, but not a mixture of directories and paths. |
Named Arguments¶
-r, --relativePath | |
store relative path in database | |
-I, --indexFiles | |
The index files for the VCF/BCF files provided in the dataFiles argument. These must be provided in the same order as the data files. | |
-n, --name | The name of the VariantSet |
-R, --referenceSetName | |
the name of the reference set to associate with this VariantSet | |
-O, --ontologyName | |
the name of the sequence ontology instance used to translate ontology term names to IDs in this VariantSet | |
-A, --attributes | |
additional attributes for the message expressed as JSON | |
-a, --addAnnotationSets | |
If the supplied VCF file contains annotations, create the corresponding VariantAnnotationSet. |
Examples:
$ candig_repo add-variantset registry.db 1kg PATIENT_123 SAMPLE_123 1kgPhase1/ -R NCBI37
Adds a new variant set to the dataset named 1kg
in the repository defined
by the registry database registry.db
using the VCF files contained in the
directory 1kgPhase1
that belong to PATIENT_123 and SAMPLE_123. Note that this
directory must also contain the corresponding indexes for these files. We associate
the reference set named NCBI37
with this new variant set. Because we do not provide a --name
argument, a name is automatically generated using the default name generation
rules.
$ candig_repo add-variantset registry.db 1kg PATIENT_123 SAMPLE_123 \
1kgPhase1/chr1.vcf.gz -n phase1-subset -R NCBI37
Like the last example, we add a new variant set to the dataset 1kg
, with one VCF
and the corresponding patientId and sampleId. We also specify the
name for this new variant set to be phase1-subset
.
$ candig_repo add-variantset registry.db 1kg PATIENT_123 SAMPLE_123 \
--name phase1-subset-remote -R NCBI37 \
--indexFiles ALL.chr1.phase1_release_v3.20101123.snps_indels_svs.genotypes.vcf.gz.tbi ALL.chr2.phase1_release_v3.20101123.snps_indels_svs.genotypes.vcf.gz.tbi \
ftp://ftp.ncbi.nlm.nih.gov/1000genomes/ftp/release/20110521/ALL.chr1.phase1_release_v3.20101123.snps_indels_svs.genotypes.vcf.gz \
This example performs the same task of creating a subset of the phase1
VCFs, but this time we use the remote URL directly and do not keep a
local copy of the VCF file. Because we are using remote URLs to define
the variant set, we have to download a local copy of the corresponding
index files and provide them on the command line using the --indexFiles
option.
add-readgroupset¶
Adds a readgroup set to a named dataset in a repository. Readgroup sets are
currently derived from a single indexed BAM file, which can be either
stored locally or based on a remote URL. If the readgroup set is based on
a remote URL, then the index file must be stored locally and specified using
the --indexFile
option.
Each readgroup set must be associated with the reference set that it is aligned
to. The add-readgroupset
command first examines the headers of the BAM file
to see if it contains information about references, and then looks for a
reference set with name equal to the genome assembly identifer defined in the
header. (Specifically, we read the @SQ
header line and use the value of the
AS
tag as the default reference set name.) If this reference set exists,
then the readgroup set will be associated with it automatically. If it does not
(or we cannot find the appropriate information in the header), then the
add-readgroupset
command will fail. In this case, the user must provide the
name of the reference set using the --referenceSetName
option.
Add a read group set to the data repo
usage: candig_repo add-readgroupset [-h] [-n NAME] [-R REFERENCESETNAME]
[-A ATTRIBUTES] [-r] [-I INDEXFILE]
registryPath datasetName patientId
sampleId dataFile
Positional Arguments¶
registryPath | the location of the registry database |
datasetName | the name of the dataset |
patientId | the ID of the patient |
sampleId | the ID of the sample |
dataFile | The file path or URL of the BAM file for this ReadGroupSet |
Named Arguments¶
-n, --name | The name of the ReadGroupSet |
-R, --referenceSetName | |
the name of the reference set to associate with this ReadGroupSet | |
-A, --attributes | |
additional attributes for the message expressed as JSON | |
-r, --relativePath | |
store relative path in database | |
-I, --indexFile | |
The file path of the BAM index for this ReadGroupSet. If the dataFile argument is a local file, this will be automatically inferred by appending ‘.bai’ to the file name. If the dataFile is a remote URL the path to a local file containing the BAM index must be provided |
Examples:
$ candig_repo add-readgroupset registry.db 1kg \
path/to/HG00114.chrom11.ILLUMINA.bwa.GBR.low_coverage.20120522.bam
Adds a new readgroup set for an indexed 1000 Genomes BAM file stored on the
local file system. The index file follows the usual convention and is stored in
the same directory as the BAM file and has an extra .bai
extension. The
name of the readgroup set is automatically derived from the file name, and the
reference set automatically set from the BAM header.
$ candig_repo add-readgroupset registry.db 1kg PATIENT_123 SAMPLE_123 candig-example-data/HG00096.bam \
-R GRCh37-subset -n HG0096-subset
Adds a new readgroup set based on a subset of the 1000 genomes reads for the
HG00096 sample from the example data used in the reference server. In this case
we specify that the reference set name GRCh37-subset
be associated with the
readgroup set. We also override the default name generation rules and specify
the name HG00096-subset
for the new readgroup set.
$ candig_repo add-readgroupset registry.db 1kg PATIENT_123 SAMPLE_123 \
-n HG00114-remote
-I /path/to/HG00114.chrom11.ILLUMINA.bwa.GBR.low_coverage.20120522.bam.bai
ftp://ftp.ncbi.nlm.nih.gov/1000genomes/ftp/phase3/data/HG00114/alignment/HG00114.chrom11.ILLUMINA.bwa.GBR.low_coverage.20120522.bam
Adds a new readgroups set based on a 1000 genomes BAM directly from the NCBI
FTP server. Because this readgroup set uses a remote FTP URL, we must specify
the location of the .bai
index file on the local file system.
add-featureset¶
Warning
You may retrieve the latest version of gencode from here: https://www.gencodegenes.org/human/, you can usually download the GFF3 file from the first row: Comprehensive gene annotation.
Once you retrieve the GFF3 file, unzip it, then use a conversion script to convert the GFF3 file to a SQLite-compatible DB. The script is available from https://github.com/CanDIG/candig-server/blob/develop/scripts/generate_gff3_db.py.
The script, by default, will create composite indexes on (start, end, referenceName) and (geneName, type). This should suffice most of the use-cases.
If you are using this script mentioned above, ignore the following two paragraphs.
Before you add the feature set, you should make sure to index some of the columns in your
generated DB. Specifically, you should make sure that you both gene_name
and type
should be indexed. If you don’t, queries to this endpoint, and endpoints that depend on this,
e.g., variants/gene/search
will be very very slow.
To create a composite index on aforementioned fields, open the featureset DB
you generated via the sqlite browser,
then run CREATE INDEX name_type_index ON FEATURE (gene_name, type);
.
You should carefully review your use-case and index other fields accordingly.
Adds a feature set to a named dataset in a repository. Feature sets must be in a ‘.db’ file. An appropriate ‘.db’ file can be generate from a GFF3 file using scripts/generate_gff3_db.py.
Add a feature set to the data repo
usage: candig_repo add-featureset [-h] [-A ATTRIBUTES] [-r]
[-R REFERENCESETNAME] [-O ONTOLOGYNAME]
[-C CLASSNAME]
registryPath datasetName filePath
Positional Arguments¶
registryPath | the location of the registry database |
datasetName | the name of the dataset |
filePath | The path to the converted SQLite database containing Feature data |
Named Arguments¶
-A, --attributes | |
additional attributes for the message expressed as JSON | |
-r, --relativePath | |
store relative path in database | |
-R, --referenceSetName | |
the name of the reference set to associate with this feature set | |
-O, --ontologyName | |
the name of the sequence ontology instance used to translate ontology term names to IDs in this feature set | |
-C, --className | |
the name of the class used to fetch features in this feature set |
Examples:
$ candig_repo add-featureset registry.db 1KG gencode.db \
-R hg37 -O so-xp-simple
Adds the feature set gencode to the registry under the 1KG dataset. The flags set the reference genome to be hg37 and the ontology to use to so-xp-simple.
add-continuousset¶
Adds a continuous set to a named dataset in a repository. Continuous sets must be in a bigWig file. The bigWig format is described here: http://genome.ucsc.edu/goldenPath/help/bigWig.html. There are directions for converting wiggle files to bigWig files on the page also. Files in the bedGraph format can be converted using bedGraphToBigWig (https://www.encodeproject.org/software/bedgraphtobigwig/).
Add a continuous set to the data repo
usage: candig_repo add-continuousset [-h] [-r] [-R REFERENCESETNAME]
[-C CLASSNAME]
registryPath datasetName filePath
Positional Arguments¶
registryPath | the location of the registry database |
datasetName | the name of the dataset |
filePath | The path to the file contianing the continuous data |
Named Arguments¶
-r, --relativePath | |
store relative path in database | |
-R, --referenceSetName | |
the name of the reference set to associate with this continuous set | |
-C, --className | |
the name of the class used to fetch features in this continuous set |
Examples:
$ candig_repo add-continuousset registry.db 1KG continuous.bw \
-R hg37
Adds the continuous set continuous to the registry under the 1KG dataset. The flags set the reference genome to be hg37.
init-rnaquantificationset¶
Initializes a rnaquantification set.
Initializes an RNA quantification set
usage: candig_repo init-rnaquantificationset [-h] registryPath filePath
Positional Arguments¶
registryPath | the location of the registry database |
filePath | The path to the resulting Quantification Set |
Examples:
$ candig_repo init-rnaquantificationset repo.db rnaseq.db
Initializes the RNA Quantification Set with the filename rnaseq.db.
add-rnaquantification¶
Adds a rnaquantification to a RNA quantification set.
RNA quantification formats supported are currently kallisto and RSEM.
Add an RNA quantification to the data repo
usage: candig_repo add-rnaquantification [-h] [--biosampleName BIOSAMPLENAME]
[--sampleId SAMPLEID]
[--patientId PATIENTID]
[--readGroupSetName READGROUPSETNAME]
[--featureSetNames FEATURESETNAMES]
[-n NAME] [-d DESCRIPTION] [-t]
[-A ATTRIBUTES]
filePath quantificationFilePath
format registryPath datasetName
Positional Arguments¶
filePath | The path to the RNA SQLite database to create or modify |
quantificationFilePath | |
The path to the expression file. | |
format | format of the quantification input data |
registryPath | the location of the registry database |
datasetName | the name of the dataset |
Named Arguments¶
--biosampleName | |
Biosample Name | |
--sampleId | SampleId |
--patientId | PatientId |
--readGroupSetName | |
Read Group Set Name | |
--featureSetNames | |
Comma separated list | |
-n, --name | The name of the rna quantification |
-d, --description | |
The human-readable description of the RnaQuantification. | |
-t, --transcript | |
sets the quantification type to transcript | |
-A, --attributes | |
additional attributes for the message expressed as JSON |
Examples:
$ candig_repo add-rnaquantification rnaseq.db data.tsv \
kallisto candig-example-data/registry.db brca1 \
--biosampleName HG00096 --featureSetNames gencodev19
--readGroupSetName HG00096rna --transcript
Adds the data.tsv in kallisto format to the rnaseq.db quantification set with optional fields for associating a quantification with a Feature Set, Read Group Set, and Biosample.
add-rnaquantificationset¶
When the desired RNA quantification have been added to the set, use this command to add them to the registry.
Add an RNA quantification set to the data repo
usage: candig_repo add-rnaquantificationset [-h] [-R REFERENCESETNAME]
[-n NAME] [-A ATTRIBUTES]
registryPath datasetName filePath
Positional Arguments¶
registryPath | the location of the registry database |
datasetName | the name of the dataset |
filePath | The path to the converted SQLite database containing RNA data |
Named Arguments¶
-R, --referenceSetName | |
the name of the reference set to associate with this RnaQuantificationSet | |
-n, --name | The name of the RnaQuantificationSet |
-A, --attributes | |
additional attributes for the message expressed as JSON |
Examples:
$ candig_repo add-rnaquantificationset registry.db brca1 rnaseq.db \
-R hg37 -n rnaseq
Adds the RNA quantification set rnaseq.db to the registry under the brca1 dataset. The flags set the reference genome to be hg37 and the name of the set to rnaseq.
add-phenotypeassociationset¶
Adds an rdf object store. The cancer genome database Clinical Genomics Knowledge Base http://nif-crawler.neuinfo.org/monarch/ttl/cgd.ttl, published by the Monarch project, is the supported format for Evidence.
Adds phenotypes in ttl format to the repo.
usage: candig_repo add-phenotypeassociationset [-h] [-n NAME] [-A ATTRIBUTES]
registryPath datasetName
dirPath
Positional Arguments¶
registryPath | the location of the registry database |
datasetName | the name of the dataset |
dirPath | The path of the directory containing ttl files. |
Named Arguments¶
-n, --name | The name of the PhenotypeAssociationSet |
-A, --attributes | |
additional attributes for the message expressed as JSON |
Examples:
$ candig_repo add-phenotypeassociationset registry.db dataset1 /monarch/ttl/cgd.ttl -n cgd
remove-referenceset¶
Removes a reference set from the repository. Attempting to remove a reference set that is referenced by other objects in the repository will result in an error.
Remove a reference set from the repo
usage: candig_repo remove-referenceset [-h] [-f] registryPath referenceSetName
Positional Arguments¶
registryPath | the location of the registry database |
referenceSetName | |
the name of the reference set |
Named Arguments¶
-f, --force | do not prompt for confirmation |
Examples:
$ candig_repo remove-referenceset registry.db NCBI37
Deletes the reference set with name NCBI37
from the repository
represented by registry.db
remove-ontology¶
Removes an ontology from the repository. Attempting to remove an ontology that is referenced by other objects in the repository will result in an error.
Remove an ontology from the repo
usage: candig_repo remove-ontology [-h] [-f] registryPath ontologyName
Positional Arguments¶
registryPath | the location of the registry database |
ontologyName | the name of the ontology |
Named Arguments¶
-f, --force | do not prompt for confirmation |
Examples:
$ candig_repo remove-ontology registry.db so-xp
Deletes the ontology with name so-xp
from the repository
represented by registry.db
remove-variantset¶
Removes a variant set from the repository. This also deletes all associated call sets and variant annotation sets from the repository.
Remove a variant set from the repo
usage: candig_repo remove-variantset [-h] [-f]
registryPath datasetName variantSetName
Positional Arguments¶
registryPath | the location of the registry database |
datasetName | the name of the dataset |
variantSetName | the name of the variant set |
Named Arguments¶
-f, --force | do not prompt for confirmation |
Examples:
$ candig_repo remove-variantset registry.db dataset1 phase3-release
Deletes the variant set named phase3-release
from the dataset
named dataset1
from the repository represented by registry.db
.
remove-readgroupset¶
Removes a read group set from the repository.
Remove a read group set from the repo
usage: candig_repo remove-readgroupset [-h] [-f]
registryPath datasetName
readGroupSetName
Positional Arguments¶
registryPath | the location of the registry database |
datasetName | the name of the dataset |
readGroupSetName | |
the name of the read group set |
Named Arguments¶
-f, --force | do not prompt for confirmation |
Examples:
$ candig_repo remove-readgroupset registry.db dataset1 HG00114
Deletes the readgroup set named HG00114
from the dataset named
dataset1
from the repository represented by registry.db
.
remove-featureset¶
Removes a feature set from the repository.
Remove a feature set from the repo
usage: candig_repo remove-featureset [-h] [-f]
registryPath datasetName featureSetName
Positional Arguments¶
registryPath | the location of the registry database |
datasetName | the name of the dataset |
featureSetName | the name of the feature set |
Named Arguments¶
-f, --force | do not prompt for confirmation |
Examples:
$ candig_repo remove-featureset registry.db 1KG gencode-genes
Deletes the feature set named gencode-genes
from the dataset
named 1KG
from the repository represented by registry.db
.
remove-continuousset¶
Removes a continuous set from the repository.
Remove a continuous set from the repo
usage: candig_repo remove-continuousset [-h] [-f]
registryPath datasetName
continuousSetName
Positional Arguments¶
registryPath | the location of the registry database |
datasetName | the name of the dataset |
continuousSetName | |
the name of the continuous set |
Named Arguments¶
-f, --force | do not prompt for confirmation |
Examples:
$ candig_repo remove-continuousset registry.db 1KG continuous
Deletes the feature set named continuous
from the dataset
named 1KG
from the repository represented by registry.db
.
remove-rnaquantificationset¶
Removes a RNA quantification set from the repository.
Remove an RNA quantification set from the repo
usage: candig_repo remove-rnaquantificationset [-h] [-f]
registryPath datasetName
rnaQuantificationSetName
Positional Arguments¶
registryPath | the location of the registry database |
datasetName | the name of the dataset |
rnaQuantificationSetName | |
the name of the RNA Quantification Set |
Named Arguments¶
-f, --force | do not prompt for confirmation |
Examples:
$ candig_repo remove-rnaquantificationset registry.db dataset1 ENCFF305LZB
Deletes the rnaquantification set named ENCFF305LZB
from the dataset named
dataset1
from the repository represented by registry.db
.
remove-phenotypeassociationset¶
Removes an rdf object store.
Remove an phenotypes from the repo
usage: candig_repo remove-phenotypeassociationset [-h] [-f]
registryPath datasetName
name
Positional Arguments¶
registryPath | the location of the registry database |
datasetName | the name of the dataset |
name | The name of the phenotype association set |
Named Arguments¶
-f, --force | do not prompt for confirmation |
Examples:
$ candig_repo remove-phenotypeassociationset registry.db dataset1 cgd
add-biosample¶
Warning
This command is deprecated, and may be removed soon in future. Use ingest command to add Sample-related information.
Adds a new biosample to the repository. The biosample argument is a JSON document according to the GA4GH JSON schema.
Add a Biosample to the dataset
usage: candig_repo add-biosample [-h]
registryPath datasetName biosampleName
biosample
Positional Arguments¶
registryPath | the location of the registry database |
datasetName | the name of the dataset |
biosampleName | the name of the biosample |
biosample | the JSON of the biosample |
Examples:
$ candig_repo add-biosample registry.db dataset1 HG00096 '{"individualId": "abc"}'
Adds the biosample named HG00096 to the repository with the individual ID “abc”.
add-individual¶
Warning
This command is deprecated, and may be removed soon in future. Use ingest command to add Patient-related information.
Adds a new individual to the repository. The individual argument is a JSON document following the GA4GH JSON schema.
Add an Individual to the dataset
usage: candig_repo add-individual [-h]
registryPath datasetName individualName
individual
Positional Arguments¶
registryPath | the location of the registry database |
datasetName | the name of the dataset |
individualName | the name of the individual |
individual | the JSON of the individual |
Examples:
$ candig_repo add-individual registry.db dataset1 HG00096 '{"description": "A description"}'
remove-biosample¶
Removes a biosample from the repository.
Remove a Biosample from the repo
usage: candig_repo remove-biosample [-h] [-f]
registryPath datasetName biosampleName
Positional Arguments¶
registryPath | the location of the registry database |
datasetName | the name of the dataset |
biosampleName | the name of the biosample |
Named Arguments¶
-f, --force | do not prompt for confirmation |
Examples:
$ candig_repo remove-biosample registry.db dataset1 HG00096
Deletes the biosample with name HG00096
in the dataset
dataset1
from the repository represented by registry.db
remove-individual¶
Removes an individual from the repository.
Remove an Individual from the repo
usage: candig_repo remove-individual [-h] [-f]
registryPath datasetName individualName
Positional Arguments¶
registryPath | the location of the registry database |
datasetName | the name of the dataset |
individualName | the name of the individual |
Named Arguments¶
-f, --force | do not prompt for confirmation |
Examples:
$ candig_repo remove-individual registry.db dataset1 HG00096
Deletes the individual with name HG00096
in the dataset
dataset1
from the repository represented by registry.db
Other commands¶
candig_server¶
There are a number of optional parameters to start up the server.
When no paramters are set, running candig-server
would start up the server at
http://127.0.0.1:8000
.
You may supply your own config file (.py), as indicated below. This config.py
specifies
the DATA_SOURCE
to be at a custom location, and the DEFAULT_PAGE_SIZE
to be 1500, overridding the default values for both.
DATA_SOURCE = '/home/user/dev/data.db'
DEFAULT_PAGE_SIZE = 1500
usage: candig_server [-h] [--port PORT] [--host HOST] [--config CONFIG]
[--config-file CONFIG_FILE] [--tls] [--gunicorn]
[--certfile CERTFILE] [--keyfile KEYFILE]
[--dont-use-reloader] [--workers WORKERS]
[--timeout TIMEOUT] [--worker_class WORKER_CLASS]
[--epsilon EPSILON] [--version]
[--disable-urllib-warnings]
Examples:
$ candig_server --host 0.0.0.0 --port 3000 --config-file config.py
add-peer¶
Adds a new peer server.
Add a peer to the registry by URL.
usage: candig_repo add-peer [-h] [-A ATTRIBUTES] registryPath url
Positional Arguments¶
registryPath | the location of the registry database |
url | The URL of the given resource |
Named Arguments¶
-A, --attributes | |
additional attributes for the message expressed as JSON |
Examples:
$ candig_repo add-peer registry.db https://candig.test.ca
remove-peer¶
Removes a peer server.
Warning
If you did not add a trailing path when you add the peer URL, a trailing path is added automatically,
therefore, as the examples show, if you add https://candig.test.ca
, when you delete
it, you will need to run https://candig.test.ca/
.
Remove a peer from the registry by URL.
usage: candig_repo remove-peer [-h] [-f] registryPath url
Positional Arguments¶
registryPath | the location of the registry database |
url | The URL of the given resource |
Named Arguments¶
-f, --force | do not prompt for confirmation |
Examples:
$ candig_repo remove-peer registry.db https://candig.test.ca/
candig_snapshot¶
Creates a report containing information about Clinical; Pipeline; Genomic; Dataset; and Id of Patients stored on the database.
Create a CanDIG-Server Database Snapshot Report
usage: candig_snapshot [-h] [--markdown] [--html]
[--destination /output/directory/]
database
Positional Arguments¶
database | Path to CanDIG-Server database file |
Named Arguments¶
--markdown | Generate report in markdown format |
--html | Generate report in HTML format |
--destination | Directory where the outputs will be saved |
Examples:
Warning
You must pass at least one of the following arguments to the script:
--html
--markdown
$ candig_snapshot candig-example-data/registry.db --html
$ candig_snapshot candig-example-data/registry.db --markdown
$ candig_snapshot candig-example-data/registry.db --markdown --html
Usage¶
API Usage & Sample Queries¶
Warning
This guide is a work in progress, and it is incomplete.
This section will provide instructions on API usages. Upon logging in, you will see a dashboard that gives high level overview of your authorized datasets.
When you click on the top right corner, you will see an API Info section, where a Swagger UI is provided. That UI does not yet contain any information on /search and /count endpoints. Please refer to the resources here for instructions on using these two endpoints.
Warning
Please note that while reasonable efforts have been put in to validate the request and response schemas for the swagger UI definitions, at this time, we cannot guarantee it to be 100% accurate. Please open a ticket if you encounter any problems.
Sample queries for all metadata services¶
This includes all of the endpoints under Clinical and Pipeline Metadata Services.
Even though some of the sample queries below are made to specific endpoint, the same format applies to any pipeline or clinical metadata endpoint. Typically you will need to change the field in your filters, depending on which table you are querying on.
How to write filter objects¶
As you can see from Sample Queries II to IV, it is possible to submit optional filters list along with your request.
filters is a list of filter objects. There are two types of filter objects.
Type I: Compare the field’s value to a string¶
{
"field": "gender",
"operator": "=",
"value": "female"
}
In the filter object above, you specify gender as the field, and value to be the female, = as the operator. This filter object asks server to find records whose gender is female.
Note
For Type I filter object, the supported operators are >, <, >=, <=, =, ==, !=, contains. Note that = and == are equivalent.
Type II: Check if a record’s field belongs to a list¶
{
"field": "provinceOfResidence",
"operator": "in",
"values": ["British Columbia", "Ontario"]
}
In the filter object above, you specify provinceOfResidence as the field, and values to be the [“British Columbia”, “Ontario”], in as the operator.
This filter object asks server to find records whose provinceOfResidence is one of British Columbia or Ontario.
Note
For Type II filter object, in is the only supported operator. Also, you specify values, instead of value.
Fetch Datasets¶
Description: To fetch all of the datasets, make a query to datasets/search endpoint with an empty body. You need datasetId for subsequent queries to other endpoints.
Query:
{}
Sample Query I¶
Description: Fetch all of the items from the specified dataset from any clinical or pipeline metadata service endpoint.
Query:
{
"datasetId": "WyIxa2dlbm9tZSJd"
}
Sample Query II¶
Description: Fetch all of the items from the patients/search endpoint, whose gender is female.
Note
We support a number of operators in the filters, which include: >, <, >=, <=, =, !=, and contains.
Query:
{
"datasetId": "WyIxa2dlbm9tZSJd",
"filters": [
{
"field": "gender",
"operator": "=",
"value": "female"
}
]
}
Sample Query III¶
Description: Fetch all of the items from the patients/search endpoint, whose gender is female and whose ethnicity is NOT ‘GBR’.
Note
You can specify more than one filter under filters.
{
"datasetId": "WyIxa2dlbm9tZSJd",
"filters": [
{
"field": "gender",
"operator": "=",
"value": "female"
},
{
"field": "ethnicity",
"operator": "!=",
"value": "GBR"
}
]
}
Sample Query IV¶
Description: Fetch all of the items from the patients/search endpoint, whose ethnicity is one of GBR, FIN or ESN.
Note
Note that this query is very different from the previous ones.
To specify a list of values you are interested in, you need to:
Specify values
in your filters, instead of value
.
Specify a list of values that you are interested in.
Specify in
as the operator.
{
"datasetId": "WyIxa2dlbm9tZSJd",
"filters": [
{
"field": "ethnicity",
"operator": "in",
"values": [
"GBR",
"FIN",
"ESN"
]
}
]
}
Sample queries for all variants services¶
This mainly includes the /variantsets/search, /variants/search and /variantsbygene endpoints.
Sample Query I¶
Description: Fetch all of the variantsets associated with a particular dataset.
Endpoint: variantsets/search
Note
This query is the same as Sample Query I under Metadata services, but it is the same across metadata, variantSets, referenceSets, etc.
{
"datasetId": "WyIxa2dlbm9tZSJd"
}
Sample Query II¶
Description: Search for variants within the range between the start and end that are on chromesome 22, from the designated variantSets.
Endpoint: variants/search
{
"start": "50158561",
"end": "50158565",
"referenceName": "22",
"variantSetIds": [
"yourVariantSetId1",
"yourVariantSetId2"
]
}
Sample Query III¶
Description: Search for variants within the range between the start and end that are on chromesome 22, from all variantsets that are associated with one particular dataset.
Endpoint: variants/search
Warning
You should never attempt to specify both datasetId and variantSetIds.
{
"datasetId": "WyIxa2dlbm9tZSJd",
"start": "50158561",
"end": "50158565",
"referenceName": "22"
}
Sample Query IV¶
Description: Search for variants that are associated with a particular gene.
Warning
Do not use /variantsbygenesearch
endpoint, it has been deprecated.
Endpoint: /variantsbygenesearch or /variants/gene/search
{
"datasetId": "WyIxa2dlbm9tZSJd",
"gene": "ABCD",
}
Instructions for /search and /count endpoints¶
You need to write complex queries to be able to use the /search
and /count
endpoints.
It has 4 mandatory fields, datasetId
, logic
, components
, and results
. Queries
for both endpoints are largely the same, and the differences will be explained below.
You always need to specify datasetId in your query.
You may want to look at the sample queries first before you can look at the how to instructions below.
Warning
You may specify pageToken
only when you see a nextPageToken
returned in the
previous response. You cannot set pageSize
for requests made to /search
and
/count
endpoints.
How to write logic¶
Logic is where you specify the relationship between various components. The only operators you will need to specify are either AND or OR. When writing the logic, the operation becomes the key.
Write Logic for multiple components¶
This is possibly the most common use-case, where you want to find records that satisfy multiple filters you set.
For example, conditionA and conditionB and condition C would be written as below.
In this example, the records will have to satisfy conditions of all three components.
{
"logic": {
"and": [
{
"id": "conditionA"
},
{
"id": "conditionB"
},
{
"id": "conditionC"
}
]
}
}
In the above example, you can replace the and key with or. In this case, the records will only need to fulfill the condition of any single component.
Write Logic for 1 component¶
When you only have 1 component in your query, however, you may only specify id.
{
"logic": {
"id": "condition1"
}
}
For this case only, that is, when you only have one single id in your logic. Optionally, you can specify the negate flag, which would basically negate the logic of the componenet.
{
"logic": {
"id": "condition1",
"negate": true
}
}
Write Logic for nested components¶
It is possible to write more complex logic, with multiple nested operations involved. However, the examples explained above should suffice most basic needs.
As an example, the following request is the equivalent to A∩B∩(C∪D)
,
which is equivalent to (A∩B) ∩ (C∪D)
.
{
"logic": {
"and": [
{
"id": "A"
},
{
"id": "B"
},
{
"or": [
{
"id": "C"
},
{
"id": "D"
}
]
}
]
}
}
How to write components¶
The components
part is a list, each corresponding to a filter of a specified table.
Be careful that the id has to match with the one you specified in the logic part of
your query. It can be almost any string, but they have to match.
In a component
, you specify the tables you want to search on to be the
key.
There are 3 different types of components
objects.
Components for Clinical tables¶
{
"components": [
{
"id": "condition1",
"patients": {
"filters": [
{
"field": "provinceOfResidence",
"operator": "!=",
"value": "Ontario"
}
]
}
}
]
}
You write the filter objects the same way you would write for individual endpoints. If you need a reminder on that, check How to write filter objects.
Components for /variants/search endpoint¶
{
"components": [
{
"id": "condition1",
"variants": {
"start": "100000",
"end": "500000",
"referenceName": "1"
}
}
]
}
Note that you can also specify variantSetIds in here, which will limit the scope to your list of variantSetIds. If you don’t specify any, by default, it will try to search through all variantSets associated with this dataset.
Components for /variantsbygenesearch endpoint¶
Warning
Note that this component is deprecated, and may be removed in subsequent releases. Use Components for /variants/gene/search endpoint.
{
"components": [
{
"id": "condition1",
"variantsByGene": {
"gene": "MUC1"
}
}
]
}
Components for /variants/gene/search endpoint¶
{
"components": [
{
"id": "condition1",
"variants": {
"gene": "MUC1"
}
}
]
}
How to write results¶
In the results
part of your query, you will need to specify the table you want
the server to return. For a query made to the /search endpoint, you can
simply specify the table name.
Warning
The only endpoints that are accepted here are all clinical metadata
endpoints, as well as variants
.
Results section for Clinical tables¶
{
"results": {
"table": "patients",
"fields": ["gender", "ethnicity"]
}
}
Warning
fields is a list of fields that you want the server to return. It is optional for /search endpoint, but mandatory for /count endpoint. If you do not specify this in /search endpoint, the server will just return all the fields.
Results section for /variants endpoint¶
Warning
Please be considerate when you are submitting any /variants request, we recommend you to not search more than 1 million bps at a time. If you have a lot of variantSets, you should limit your search size to 100,000.
{
"results": {
"table": "variants",
"start": "1232555",
"end": "1553222",
"referenceName": "1"
}
}
Results section for /variantsbygenesearch endpoint¶
Warning
This endpoint is deprecated. Use Results section for /variants/gene/search endpoint.
Warning
Please note that while you need to specify the table name to be variantByGene, it still returns a list of variants in its response.
{
"results": {
"table": "variantsByGene",
"gene": "MUC1"
}
}
Results section for /variants/gene/search endpoint¶
{
"results": {
"table": "variants",
"gene": "MUC1"
}
}
Sample Query I¶
Description: Return a list of patients, whose diseaseResponseOrStatus is “Complete Response”, AND have a courseNumber that is not 100.
Warning
The example query below only works for the /search endpoint, as it did not specify fields.
Query:
{
"datasetId": "yourDatasetId",
"logic": {
"and": [
{
"id": "A"
},
{
"id": "B"
}
]
},
"components": [
{
"id": "A",
"outcomes": {
"filters": [
{
"field": "diseaseResponseOrStatus",
"operator": "==",
"value": "Complete Response"
}
]
}
},
{
"id": "B",
"treatments": {
"filters": [
{
"field": "courseNumber",
"operator": "!=",
"value": "100"
}
]
}
}
],
"results": [
{
"table": "patients"
}
]
}
Sample Query II¶
Description: Return the aggregated stats on patients’ gender and ethnicity who have mutations present between “50158561” and “50158565” on chromosome 22, from the list of variantsetIds.
{
"datasetId": "yourDatasetId",
"logic": {
"id": "A"
},
"components": [
{
"id": "A",
"variants": {
"start": "50158561",
"end": "50158565",
"referenceName": "22",
"variantSetIds": [
"yourVariantSetId_1",
"yourVariantSetId_2",
"yourVariantSetId_3",
"yourVariantSetId_4",
"yourVariantSetId_5",
"yourVariantSetId_6",
"yourVariantSetId_7",
"yourVariantSetId_8"
]
}
}
],
"results": [
{
"table": "patients",
"fields": [
"gender",
"ethnicity"
]
}
]
}
Sample Query III¶
Description: Return the aggregated stats on patients’ gender and ethnicity, who have mutations present between “50158561” and “50158565” on chromosome 22.
Note
Note: Since a list of variantSetIds
was not specified, the server will attempt to
locate all variantSets associated with the dataset. If you have a lot of variantSets
associated with this particular dataset, the query might take some time.
{
"datasetId": "yourDatasetId",
"logic": {
"id": "A"
},
"components": [
{
"id": "A",
"variants": {
"start": "50158561",
"end": "50158565",
"referenceName": "22"
}
}
],
"results": [
{
"table": "patients",
"fields": [
"gender",
"ethnicity"
]
}
]
}
Sample Query IV¶
Description: Retrieve all the variants between 50100000 and 50158565 on chromosome 22
associated with an individual [HG00105]
.
{
"datasetId": "yourDatasetId",
"logic": {
"id": "A"
},
"components": [
{
"id": "A",
"patients": {
"filters": [
{
"field": "patientId",
"operator": "==",
"value": "HG00105"
}
]
}
}
],
"results": [
{
"table": "variants",
"start": "50100000",
"end": "50158565",
"referenceName": "22"
}
]
}
Appendix¶
Contribution Guideline¶
Warning
This guide is a work in progress, and is incomplete.
If you encounter a bug, or have a problem of using the package, please contact us by opening an issue at https://github.com/candig/candig-server.
GitHub workflow¶
We mainly employ three different types of branches: feature branches, develop branch, and master branch.
Feature branches are used to resolve a limited set of issues, and typically
follows the naming convention of username/fix_one_particular_issue
. When initiating a
PR, you should request it to be merged back into the develop
branch. The commits in
individual feature branches are usually squashed, and code review usually happens at this step.
Develop branch is used to host code that has passed all the tests, but may not yet be production-ready, As a developer, you are welcome to play with this branch to test some of the new functionalities.
Commits in the develop
branch is merged into the master
branch once in a while. Release
is tagged directly from the master
branch, and follows the convention of vX.Y.Z
(X, Y, Z
are all integers.
A hotfix
branch may be employed to fix production-critical issue that was later found
on the master
branch. They are parched directly to the master
branch, and will be
synced back to develop
branch later.
If you would like to contribute code, please fork the package to your own git repository, then initiate a PR to be merged into develop.
Status¶
Please refer to https://github.com/CanDIG/candig-server/releases for the latest release notes.
Upgrade Guidelines¶
This section is mainly prepared for system administrators.
1.4.0¶
Release note available from https://github.com/CanDIG/candig-server/releases/tag/v1.4.0
This release does not require any configuration changes.
1.3.0¶
Release note available from https://github.com/CanDIG/candig-server/releases/tag/v1.3.0
This release introduces a schema update, and thus requires you to back up your existing database, and perform a one-time schema upgrade.
To update the schema of your existing database:
- Activate the virtual environment of your candig-server instance.
- Upgrade the candig-server to 1.3.0
pip install -U candig-server==1.3.0
- Change directory to where your database is, and make a backup copy of it.
- Retrieve migration script:
wget https://raw.githubusercontent.com/CanDIG/candig-server/v1.3.0/scripts/database_migration/migration.py
- Retrieve migration schema:
wget https://raw.githubusercontent.com/CanDIG/candig-server/v1.3.0/scripts/database_migration/add_columns.json
- Make sure your virtual environment is still active.
- Run the migration script
python migration.py --database {path_to_your_database_file} --add_columns add_columns.json
Once the schema migration is done, you may restart your server.
Three newly-generated sets of mock data are now available from https://github.com/CanDIG/candig-ingest/tree/v1.4.0/candig/ingest/mock_data, should you be interested in ingesting any of those.
1.2.3¶
Release note available from https://github.com/CanDIG/candig-server/releases/tag/v1.2.3
This release does not require any configuration changes.
1.2.2¶
Release note available from https://github.com/CanDIG/candig-server/releases/tag/v1.2.2
As indicated in the release note, the previous v1.2.1 release did not correctly preserve backware compatibilities for access_list that uses null (not specifying anything) to indicate no access. It also did not correctly process the information when an empty space is used.
This release fixed the issue.
If you are already using X
to indicate no access when you upgraded to v1.2.1, no further action is required. You may
update to this version of candig-server without performing any additional changes.
If you are still using empty space, or null to indicate no access, please do not use v1.2.1
release of the candig-server, use this one instead. It is still recommended to use X
over
empty space or null to indicate no access.
- Step 1: Update the candig-server in your virtual environment to v1.2.2.
- Step 2: Replace all empty space with X.
1.2.1¶
Release note available from https://github.com/CanDIG/candig-server/releases/tag/v1.2.1
As indicated in the release note, X
is now used to indicate no access. You may see a newly-updated
sample access_list file from Production Deployment.
If you have used empty space to indicate no access, you should:
- Step 1: Update the candig-server in your virtual environment to v1.2.1.
- Step 2: Replace all empty space with X.
You should always restart the candig-server service when you update to a new candig-server.
As indicated in the release note, DUO:0000002
and DUO:0000003
are no longer valid DUO
ids.
If you have specified one of these two for one or more of your datasets, you should
- Step 1: Remove
DUO:0000002
andDUO:0000003
from the json file that holds info regarding your data use restrictions. - Step 2: Re-run the
add-dataset-duo
command.
You should restart the candig-server service when you make changes to the data.