Supporting information

0. Quick reference

All of the data can be downloaded directly from our servers.

The resource is structured in STUDIES (e.g., a publication) with associated SAMPLES from which we reconstruct GENOMES.

Repository Structure

1. Genomes, Genes and Annotations

The graphical user interface allows for easy and fast inspection of individual genomes, associated annotations, studies, and samples. Access to multiple datasets is also possible by using the FTP data backend of OMDB.

Alternatively, data can be downloaded via the command line using the OMDB links file. After downloading the file (8MB, MD5=c1b5f14c9b7899f7300ccf41e62f8681), users have access to links to all genome and genome annotation files on OMDB.

# Download and decompress:
$ curl -O https://sunagawalab.ethz.ch/share/microbiomics/ocean/db/2.0/data/catalogs/OMDBv2.0_data.tsv.gz

# Decompress the file:    
$ gunzip OMDBv2.0_data.tsv.gz

The file has one line per genome and contains public links to the OMDB data. Example:

GENOME: GARB21-1_SAMN12799101_MAG_00000001
SAMPLE: GARB21-1_SAMN12799101_METAG
STUDY: GARB21-1
GENOME_FILE: https://sunagawalab.ethz.ch/share/microbiomics/ocean/db/2.0/data/genomes/genomes/GARB21-1/GARB21-1_SAMN12799101_METAG/GARB21-1_SAMN12799101_MAG_00000001/GARB21-1_SAMN12799101_MAG_00000001.fa.gz
GENES_NT_FILE: https://sunagawalab.ethz.ch/share/microbiomics/ocean/db/2.0/data/genomes/genomes/GARB21-1/GARB21-1_SAMN12799101_METAG/GARB21-1_SAMN12799101_MAG_00000001/GARB21-1_SAMN12799101_MAG_00000001.genes.fna.gz
GENES_AA_FILE: https://sunagawalab.ethz.ch/share/microbiomics/ocean/db/2.0/data/genomes/genomes/GARB21-1/GARB21-1_SAMN12799101_METAG/GARB21-1_SAMN12799101_MAG_00000001/GARB21-1_SAMN12799101_MAG_00000001.genes.faa.gz
GENES_GFF_FILE: https://sunagawalab.ethz.ch/share/microbiomics/ocean/db/2.0/data/genomes/genomes/GARB21-1/GARB21-1_SAMN12799101_METAG/GARB21-1_SAMN12799101_MAG_00000001/GARB21-1_SAMN12799101_MAG_00000001.genes.gff.gz
ANTISMASH_FILE: https://sunagawalab.ethz.ch/share/microbiomics/ocean/db/2.0/data/genomes/genomes/GARB21-1/GARB21-1_SAMN12799101_METAG/GARB21-1_SAMN12799101_MAG_00000001/GARB21-1_SAMN12799101_MAG_00000001-antismash.tar.gz

Those links can be used to download files using curl or wget.

E.g., to download the AntiSMASH file from genome GARB21-1_SAMN12799101_MAG_00000001:

# Using cut/grep (if tsv was unzipped):
$ curl -O $(grep "GARB21-1_SAMN12799101_MAG_00000001" OMDBv2.0_data.tsv | cut -f8)

# Or directly:
$ curl -O https://sunagawalab.ethz.ch/share/microbiomics/ocean/db/2.0/data/genomes/genomes/GARB21-1/GARB21-1_SAMN12799101_METAG/GARB21-1_SAMN12799101_MAG_00000001/GARB21-1_SAMN12799101_MAG_00000001-antismash.tar.gz
        

Alternatively, use download.file in R or the requests module in Python to automate downloads.

2. Catalogs

OMDB genomes and derived genes have been compiled into several catalogs and are released on this page:

Gene Catalog (NT)

Complete genes of all OMDB genomes were called, aggregated, and clustered in nucleotide space at different levels.

Catalog Genes Clustering Threshold Singletons Sequences Clusters
OMDBv2.0_NT_G_R 508,832,278 No clustering 100% Sequences - 128GB Clusters - 5GB
OMDBv2.0_NT_G_NR100 325,384,975 100% 85% Sequences - 88GB Clusters - 4GB
OMDBv2.0_NT_G_NR95 103,044,829 95% 57% Sequences - 27GB Clusters - 3GB

Gene Catalog (AA)

Complete genes of all OMDB genomes were called, aggregated, and clustered in amino acid space at different levels.

Catalog Genes Clustering Threshold Singletons Sequences Clusters
OMDBv2.0_AA_G_R 508,832,278 No clustering 100% Sequences - 88GB Clusters - 5GB
OMDBv2.0_AA_G_NR100 249,518,434 100% 79% Sequences - 46GB Clusters - 4GB
OMDBv2.0_AA_G_NR50 28,862,112 50% 53% Sequences - 4GB Clusters - 4GB
OMDBv2.0_AA_G_NR30 18,342,415 30% 53% Sequences - 2GB Clusters - 4GB

Genome Catalog

All OMDB genomes were compiled into a single file and dereplicated at 100%.

Catalog Genomes Clustering Threshold Singletons Sequences Clusters
OMDBv2.0_SC_G_R 69,280,421 No clustering 100% Sequences - 150GB Clusters - 1GB
OMDBv2.0_SC_G_NR100 68,726,394 100% 99% Sequences - 145GB Clusters - 1GB

Terminology

All catalogs were named with the same structure:

OMDBv2.0_XX_Y_Z where:

  • XX refers to the data type:
    • NT – Genes in nucleotide space
    • AA – Genes in amino acid space
    • SC – Scaffolds
  • Y refers to the data source:
    • G – entries come from genomes
    • A – entries come from assemblies (not used for this release)
  • Z refers to the data dereplication:
    • R – redundant (no dereplication)
    • NR100 – exact sequences merged into one cluster
    • NR95 – clustered at 95% similarity
    • NR50 – clustered at 50% similarity
    • NR30 – clustered at 30% similarity

Methods

Redundant catalogs and the catalogs dereplicated at 100% were generated with custom scripts.

The OMDBv2.0_NT_G_NR95 catalog was clustered using mmseqs2 with the following parameters:

mmseqs createdb OMDBv2.0_NT_G_R.fna OMDBv2.0_NT_G_NR95.mmseqs.db --dbtype 2 --shuffle 0

mmseqs cluster OMDBv2.0_NT_G_NR95.mmseqs.db OMDBv2.0_NT_G_NR95.mmseqs.db.9590.cluster mmseqs_tmp --kmer-per-seq-scale 0 --kmer-per-seq 1000 -s 4 --max-seq-len 80000 --remove-tmp-files 0 --cluster-mode 2 --min-seq-id 0.95 --threads 96 --cov-mode 1 -c 0.9 --spaced-kmer-mode 0 --alignment-mode 3 --cluster-reassign 1 

mmseqs createtsv OMDBv2.0_NT_G_NR95.mmseqs.db OMDBv2.0_NT_G_NR95.mmseqs.db OMDBv2.0_NT_G_NR95.mmseqs.db.9590.cluster OMDBv2.0_NT_G_NR95.mmseqs.9590.cluster.tsv

The OMDBv2.0_AA_G_NR50 catalog was clustered using mmseqs2 with the following parameters:

mmseqs easy-cluster OMDBv2.0_AA_G_R.faa mmseqs_dir mmseqs_tmp --min-seq-id 0.5 -c 0.9 --cov-mode 1 --threads 96

The OMDBv2.0_AA_G_NR30 catalog was clustered using mmseqs2 with the following parameters:

mmseqs easy-cluster OMDBv2.0_AA_G_R.faa mmseqs_dir mmseqs_tmp --min-seq-id 0.3 -c 0.9 --cov-mode 1 --threads 96