Ocean Map | Microbiomics

0. Quick reference

All of the data can be downloaded directly from our servers.

The resource is structured in STUDIES (e.g., a publication) with associated SAMPLES from which we reconstruct GENOMES.

1. Genomes, Genes and Annotations

The graphical user interface allows for easy and fast inspection of individual genomes, associated annotations, studies, and samples. Access to multiple datasets is also possible by using the FTP data backend of OMDB.

Alternatively, data can be downloaded via the command line using the OMDB links file. After downloading the file (8MB, MD5=c1b5f14c9b7899f7300ccf41e62f8681), users have access to links to all genome and genome annotation files on OMDB.

# Download and decompress:
$ curl -O https://sunagawalab.ethz.ch/share/microbiomics/ocean/db/2.0/data/catalogs/OMDBv2.0_data.tsv.gz

# Decompress the file:    
$ gunzip OMDBv2.0_data.tsv.gz

The file has one line per genome and contains public links to the OMDB data. Example:

GENOME: GARB21-1_SAMN12799101_MAG_00000001
SAMPLE: GARB21-1_SAMN12799101_METAG
STUDY: GARB21-1
GENOME_FILE: https://sunagawalab.ethz.ch/share/microbiomics/ocean/db/2.0/data/genomes/genomes/GARB21-1/GARB21-1_SAMN12799101_METAG/GARB21-1_SAMN12799101_MAG_00000001/GARB21-1_SAMN12799101_MAG_00000001.fa.gz
GENES_NT_FILE: https://sunagawalab.ethz.ch/share/microbiomics/ocean/db/2.0/data/genomes/genomes/GARB21-1/GARB21-1_SAMN12799101_METAG/GARB21-1_SAMN12799101_MAG_00000001/GARB21-1_SAMN12799101_MAG_00000001.genes.fna.gz
GENES_AA_FILE: https://sunagawalab.ethz.ch/share/microbiomics/ocean/db/2.0/data/genomes/genomes/GARB21-1/GARB21-1_SAMN12799101_METAG/GARB21-1_SAMN12799101_MAG_00000001/GARB21-1_SAMN12799101_MAG_00000001.genes.faa.gz
GENES_GFF_FILE: https://sunagawalab.ethz.ch/share/microbiomics/ocean/db/2.0/data/genomes/genomes/GARB21-1/GARB21-1_SAMN12799101_METAG/GARB21-1_SAMN12799101_MAG_00000001/GARB21-1_SAMN12799101_MAG_00000001.genes.gff.gz
ANTISMASH_FILE: https://sunagawalab.ethz.ch/share/microbiomics/ocean/db/2.0/data/genomes/genomes/GARB21-1/GARB21-1_SAMN12799101_METAG/GARB21-1_SAMN12799101_MAG_00000001/GARB21-1_SAMN12799101_MAG_00000001-antismash.tar.gz

Those links can be used to download files using curl or wget.

E.g., to download the AntiSMASH file from genome GARB21-1_SAMN12799101_MAG_00000001:

# Using cut/grep (if tsv was unzipped):
$ curl -O $(grep "GARB21-1_SAMN12799101_MAG_00000001" OMDBv2.0_data.tsv | cut -f8)

# Or directly:
$ curl -O https://sunagawalab.ethz.ch/share/microbiomics/ocean/db/2.0/data/genomes/genomes/GARB21-1/GARB21-1_SAMN12799101_METAG/GARB21-1_SAMN12799101_MAG_00000001/GARB21-1_SAMN12799101_MAG_00000001-antismash.tar.gz

Alternatively, use download.file in R or the requests module in Python to automate downloads.

2. Catalogs

OMDB genomes and derived genes have been compiled into several catalogs and are released on this page:

Gene Catalog (NT)

Complete genes of all OMDB genomes were called, aggregated, and clustered in nucleotide space at different levels.

Catalog	Genes	Clustering Threshold	Singletons	Sequences	Clusters
OMDBv2.0_NT_G_R	508,832,278	No clustering	100%	Sequences - 128GB	Clusters - 5GB
OMDBv2.0_NT_G_NR100	325,384,975	100%	85%	Sequences - 88GB	Clusters - 4GB
OMDBv2.0_NT_G_NR95	103,044,829	95%	57%	Sequences - 27GB	Clusters - 3GB

Gene Catalog (AA)

Complete genes of all OMDB genomes were called, aggregated, and clustered in amino acid space at different levels.

Catalog	Genes	Clustering Threshold	Singletons	Sequences	Clusters
OMDBv2.0_AA_G_R	508,832,278	No clustering	100%	Sequences - 88GB	Clusters - 5GB
OMDBv2.0_AA_G_NR100	249,518,434	100%	79%	Sequences - 46GB	Clusters - 4GB
OMDBv2.0_AA_G_NR50	28,862,112	50%	53%	Sequences - 4GB	Clusters - 4GB
OMDBv2.0_AA_G_NR30	18,342,415	30%	53%	Sequences - 2GB	Clusters - 4GB

Genome Catalog

All OMDB genomes were compiled into a single file and dereplicated at 100%.

Catalog	Genomes	Clustering Threshold	Singletons	Sequences	Clusters
OMDBv2.0_SC_G_R	69,280,421	No clustering	100%	Sequences - 150GB	Clusters - 1GB
OMDBv2.0_SC_G_NR100	68,726,394	100%	99%	Sequences - 145GB	Clusters - 1GB

Terminology

All catalogs were named with the same structure:

OMDBv2.0_XX_Y_Z where:

XX refers to the data type:
- NT – Genes in nucleotide space
- AA – Genes in amino acid space
- SC – Scaffolds
Y refers to the data source:
- G – entries come from genomes
- A – entries come from assemblies (not used for this release)
Z refers to the data dereplication:
- R – redundant (no dereplication)
- NR100 – exact sequences merged into one cluster
- NR95 – clustered at 95% similarity
- NR50 – clustered at 50% similarity
- NR30 – clustered at 30% similarity

Methods

Redundant catalogs and the catalogs dereplicated at 100% were generated with custom scripts.

The OMDBv2.0_NT_G_NR95 catalog was clustered using mmseqs2 with the following parameters:

mmseqs createdb OMDBv2.0_NT_G_R.fna OMDBv2.0_NT_G_NR95.mmseqs.db --dbtype 2 --shuffle 0

mmseqs cluster OMDBv2.0_NT_G_NR95.mmseqs.db OMDBv2.0_NT_G_NR95.mmseqs.db.9590.cluster mmseqs_tmp --kmer-per-seq-scale 0 --kmer-per-seq 1000 -s 4 --max-seq-len 80000 --remove-tmp-files 0 --cluster-mode 2 --min-seq-id 0.95 --threads 96 --cov-mode 1 -c 0.9 --spaced-kmer-mode 0 --alignment-mode 3 --cluster-reassign 1 

mmseqs createtsv OMDBv2.0_NT_G_NR95.mmseqs.db OMDBv2.0_NT_G_NR95.mmseqs.db OMDBv2.0_NT_G_NR95.mmseqs.db.9590.cluster OMDBv2.0_NT_G_NR95.mmseqs.9590.cluster.tsv

The OMDBv2.0_AA_G_NR50 catalog was clustered using mmseqs2 with the following parameters:

mmseqs easy-cluster OMDBv2.0_AA_G_R.faa mmseqs_dir mmseqs_tmp --min-seq-id 0.5 -c 0.9 --cov-mode 1 --threads 96

The OMDBv2.0_AA_G_NR30 catalog was clustered using mmseqs2 with the following parameters:

mmseqs easy-cluster OMDBv2.0_AA_G_R.faa mmseqs_dir mmseqs_tmp --min-seq-id 0.3 -c 0.9 --cov-mode 1 --threads 96

Supporting information