Supporting information
0. Quick reference
All of the data can be downloaded directly from our servers.
The resource is structured in STUDIES (e.g., a publication) with associated SAMPLES from which we reconstruct GENOMES.
1. Genomes, Genes and Annotations
The graphical user interface allows for easy and fast inspection of individual genomes, associated annotations, studies, and samples. Access to multiple datasets is also possible by using the FTP data backend of OMDB.
Alternatively, data can be downloaded via the command line using the OMDB links file. After downloading the file (8MB, MD5=c1b5f14c9b7899f7300ccf41e62f8681
), users have access to links to all genome and genome annotation files on OMDB.
# Download and decompress:
$ curl -O https://sunagawalab.ethz.ch/share/microbiomics/ocean/db/2.0/data/catalogs/OMDBv2.0_data.tsv.gz
# Decompress the file:
$ gunzip OMDBv2.0_data.tsv.gz
The file has one line per genome and contains public links to the OMDB data. Example:
GENOME: GARB21-1_SAMN12799101_MAG_00000001
SAMPLE: GARB21-1_SAMN12799101_METAG
STUDY: GARB21-1
GENOME_FILE: https://sunagawalab.ethz.ch/share/microbiomics/ocean/db/2.0/data/genomes/genomes/GARB21-1/GARB21-1_SAMN12799101_METAG/GARB21-1_SAMN12799101_MAG_00000001/GARB21-1_SAMN12799101_MAG_00000001.fa.gz
GENES_NT_FILE: https://sunagawalab.ethz.ch/share/microbiomics/ocean/db/2.0/data/genomes/genomes/GARB21-1/GARB21-1_SAMN12799101_METAG/GARB21-1_SAMN12799101_MAG_00000001/GARB21-1_SAMN12799101_MAG_00000001.genes.fna.gz
GENES_AA_FILE: https://sunagawalab.ethz.ch/share/microbiomics/ocean/db/2.0/data/genomes/genomes/GARB21-1/GARB21-1_SAMN12799101_METAG/GARB21-1_SAMN12799101_MAG_00000001/GARB21-1_SAMN12799101_MAG_00000001.genes.faa.gz
GENES_GFF_FILE: https://sunagawalab.ethz.ch/share/microbiomics/ocean/db/2.0/data/genomes/genomes/GARB21-1/GARB21-1_SAMN12799101_METAG/GARB21-1_SAMN12799101_MAG_00000001/GARB21-1_SAMN12799101_MAG_00000001.genes.gff.gz
ANTISMASH_FILE: https://sunagawalab.ethz.ch/share/microbiomics/ocean/db/2.0/data/genomes/genomes/GARB21-1/GARB21-1_SAMN12799101_METAG/GARB21-1_SAMN12799101_MAG_00000001/GARB21-1_SAMN12799101_MAG_00000001-antismash.tar.gz
Those links can be used to download files using curl
or wget
.
E.g., to download the AntiSMASH file from genome GARB21-1_SAMN12799101_MAG_00000001
:
# Using cut/grep (if tsv was unzipped):
$ curl -O $(grep "GARB21-1_SAMN12799101_MAG_00000001" OMDBv2.0_data.tsv | cut -f8)
# Or directly:
$ curl -O https://sunagawalab.ethz.ch/share/microbiomics/ocean/db/2.0/data/genomes/genomes/GARB21-1/GARB21-1_SAMN12799101_METAG/GARB21-1_SAMN12799101_MAG_00000001/GARB21-1_SAMN12799101_MAG_00000001-antismash.tar.gz
Alternatively, use download.file
in R or the requests
module in Python to automate downloads.
2. Catalogs
OMDB genomes and derived genes have been compiled into several catalogs and are released on this page:
Gene Catalog (NT)
Complete genes of all OMDB genomes were called, aggregated, and clustered in nucleotide space at different levels.
Catalog | Genes | Clustering Threshold | Singletons | Sequences | Clusters |
---|---|---|---|---|---|
OMDBv2.0_NT_G_R | 508,832,278 | No clustering | 100% | Sequences - 128GB | Clusters - 5GB |
OMDBv2.0_NT_G_NR100 | 325,384,975 | 100% | 85% | Sequences - 88GB | Clusters - 4GB |
OMDBv2.0_NT_G_NR95 | 103,044,829 | 95% | 57% | Sequences - 27GB | Clusters - 3GB |
Gene Catalog (AA)
Complete genes of all OMDB genomes were called, aggregated, and clustered in amino acid space at different levels.
Catalog | Genes | Clustering Threshold | Singletons | Sequences | Clusters |
---|---|---|---|---|---|
OMDBv2.0_AA_G_R | 508,832,278 | No clustering | 100% | Sequences - 88GB | Clusters - 5GB |
OMDBv2.0_AA_G_NR100 | 249,518,434 | 100% | 79% | Sequences - 46GB | Clusters - 4GB |
OMDBv2.0_AA_G_NR50 | 28,862,112 | 50% | 53% | Sequences - 4GB | Clusters - 4GB |
OMDBv2.0_AA_G_NR30 | 18,342,415 | 30% | 53% | Sequences - 2GB | Clusters - 4GB |
Genome Catalog
All OMDB genomes were compiled into a single file and dereplicated at 100%.
Catalog | Genomes | Clustering Threshold | Singletons | Sequences | Clusters |
---|---|---|---|---|---|
OMDBv2.0_SC_G_R | 69,280,421 | No clustering | 100% | Sequences - 150GB | Clusters - 1GB |
OMDBv2.0_SC_G_NR100 | 68,726,394 | 100% | 99% | Sequences - 145GB | Clusters - 1GB |
Terminology
All catalogs were named with the same structure:
OMDBv2.0_XX_Y_Z
where:
- XX refers to the data type:
NT
– Genes in nucleotide spaceAA
– Genes in amino acid spaceSC
– Scaffolds
- Y refers to the data source:
G
– entries come from genomesA
– entries come from assemblies (not used for this release)
- Z refers to the data dereplication:
R
– redundant (no dereplication)NR100
– exact sequences merged into one clusterNR95
– clustered at 95% similarityNR50
– clustered at 50% similarityNR30
– clustered at 30% similarity
Methods
Redundant catalogs and the catalogs dereplicated at 100% were generated with custom scripts.
The OMDBv2.0_NT_G_NR95 catalog was clustered using mmseqs2
with the following parameters:
mmseqs createdb OMDBv2.0_NT_G_R.fna OMDBv2.0_NT_G_NR95.mmseqs.db --dbtype 2 --shuffle 0
mmseqs cluster OMDBv2.0_NT_G_NR95.mmseqs.db OMDBv2.0_NT_G_NR95.mmseqs.db.9590.cluster mmseqs_tmp --kmer-per-seq-scale 0 --kmer-per-seq 1000 -s 4 --max-seq-len 80000 --remove-tmp-files 0 --cluster-mode 2 --min-seq-id 0.95 --threads 96 --cov-mode 1 -c 0.9 --spaced-kmer-mode 0 --alignment-mode 3 --cluster-reassign 1
mmseqs createtsv OMDBv2.0_NT_G_NR95.mmseqs.db OMDBv2.0_NT_G_NR95.mmseqs.db OMDBv2.0_NT_G_NR95.mmseqs.db.9590.cluster OMDBv2.0_NT_G_NR95.mmseqs.9590.cluster.tsv
The OMDBv2.0_AA_G_NR50 catalog was clustered using mmseqs2
with the following parameters:
mmseqs easy-cluster OMDBv2.0_AA_G_R.faa mmseqs_dir mmseqs_tmp --min-seq-id 0.5 -c 0.9 --cov-mode 1 --threads 96
The OMDBv2.0_AA_G_NR30 catalog was clustered using mmseqs2
with the following parameters:
mmseqs easy-cluster OMDBv2.0_AA_G_R.faa mmseqs_dir mmseqs_tmp --min-seq-id 0.3 -c 0.9 --cov-mode 1 --threads 96