Microbiome Summer School 2017 - Metagenomics using Ray Meta
This is the tutorial for the assembly-based metagenomic course of the 2017 Microbiome Summer School.
Software requirements
For this tutorial, you will need the following software :
- Ray Meta : https://github.com/zorino/ray
- Ray Platform (required to install Ray) : https://github.com/sebhtml/RayPlatform
- MPI : https://www.open-mpi.org/software/ompi/v2.1/
- R : https://www.r-project.org/
- Laughing-nemesis : https://github.com/plpla/laughing-nemesis.git
- Python
The tutorial will be performed on the following dataset :
http://perso.genome.ulaval.ca/~plapie01/SummerSchool2017/SummerSchool2017-UsingRayMeta.tar.gz
Introduction
Ray Meta is one of the many softwares that allow the assembly of metagenomes. As explained in the plenary session, each assembly software has its advantages and inconveniences. The main advantage of Ray Meta is that it integrates assembly with taxonomical profiling. It also allows to identify the putative taxonomical origin of contigs. These are tools that are very useful when working with metagenomic data. The scalability of Ray makes it supercomputer friendly and thus useful to work on large metagenomic studies with samples that have hundreds of millions of reads. As a drawback, current versions Ray Meta may output smaller assemblies that other softwares such as MegaHit or MetaSpades. During your metagenomic work, it is important to consider all the available tools in order to make the best of your data. In this tutorial, I will show how to use Ray Meta and interpret its output in order to better understand important concepts relating to assembly-based metagenomics.
Step 0 - Download data
For this tutorial, we will use a syntetic dataset inspired by a genome we once sequenced for a collaborator. So, in theory, we are sequencing a genome from Clostridium difficile…
cd
wget http://perso.genome.ulaval.ca/~plapie01/SummerSchool2017/SummerSchool2017-UsingRayMeta.tar.gz
tar -xf SummerSchool2017-UsingRayMeta.tar.gz
Step 1 - Metagenome assembly using Ray Meta
In the first part of this tutorial, we will assemble a mixed bacterial culture and evaluate the quality of the assembly.
For the purpose of this tutorial, we will use a rarefied dataset that will not provide a complete assembly and then compare the results with a sample with sufficient sequencing depth.
Assemble and profile sample
The command to run the assembly and profiling using Ray Meta is as follow. Please go to the ~/UsingRayMeta/ folder and run the preceding command.
cd ~/UsingRayMeta/
mpiexec -n 2 Ray \
-o \
Sample-Small \
-k \
31 \
-p \
reads/MetaSim_Cdiff8_Ecoli2_50000_fir.fastq \
reads/MetaSim_Cdiff8_Ecoli2_50000_sec.fastq \
-search \
genomes \
-with-taxonomy \
Genome-to-Taxon.tsv \
TreeOfLife-Edges.tsv \
Taxon-Names.tsv
This will take around 45 minutes, so we will continue the tutorial below. Now that it is running, we will examine this command line-by-line.
mpiexec -n 2 Ray \
Mpiexec is the paralellization software used by Ray. The “-n” allows to indicate how many cores should be used for the analysis. For this tutorial, we will be using 2 cores. In a real project, the number of processors will depend on the amount of sequence to be assembled and on the size of the profiling datasets (which we will explain soon). To assemble a single genome, we often use a value of 32, while for complex microbiomes we can use up to 128 or 256 processors.
-o \
Sample-Small \
The “-o” is the output directory.
-k \
31 \
The “-k” is the k-mer length used for the analysis. In a majority of cases a k-mer length of 31 will provide optimal assembly. However, depending on the sequencing depth and on the complexity of the organism or community, a different value for k could be optimal. In doubt, it can be useful to compare assemblies using different values of k. In our lab, it is standard procedure to test 21, 31, 51 and 71 as values of k.
-p \
reads/MetaSim_Cdiff8_Ecoli2_50000_fir.fastq \
reads/MetaSim_Cdiff8_Ecoli2_50000_sec.fastq \
The “-p” tells Ray to use paired-end reads. Therefore, we give it two files, one for read 1 and one for read 2. Alternatively, “-s” can be used for single reads. Several entries of -p or -s can be added to the arguments
-search \
genomes \
The “-search” function allows to quantify the k-mer content based on a reference database. In this example, the genome directory contains three genomes :
- Clostridium_difficile_630.fasta
- Escherichia_coli_K_12_substr__DH10B.fasta
- Leuconostoc_lactis_KCTC_3528_uid68683.fasta
In a real analysis, this directory can contain thousands of bacterial genomes. Several directories can be given to the software by repeating the “-search” argument. For example, if working on the human microbiome, the following sequence file could be included in addition to bacterial genomes (could be : “-search GRCh38”). This is important as k-mers from the host can sometimes cause false positive signals for certain bacteria.
-with-taxonomy \
Genome-to-Taxon.tsv \
TreeOfLife-Edges.tsv \
Taxon-Names.tsv
This final argument tells Ray which taxonomy to use.
- Genome-to-Taxon.tsv : Makes the link between genome sequences and their taxonomical classification. For this tutorial, we use a simplified version of the Genome-to-Taxon.tsv file, since the complete version can be several Gb.
- TreeOfLife-Edges.tsv : Allows reconstitution of the taxonomical tree.
- Taxon-Names.tsv : Gives the names of the nodes for the taxonomical tree (the taxa names).
When adding new genomes to the taxonomy, one must make sure to include the matching information in these three files.
Step 2 - Understanding the Ray output directory
How is our assembly doing? To see at which step the assembly currently is, we can consult the file ElapsedTime.txt.
cd ~/UsingRayMeta/Sample-Small
more ElapsedTime.txt
Precomputed results for the assembly of this sample (but with a lot more reads!) is available in the following directory : ~/UsingRayMeta/Sample-Big.
We will look at the final version of the precomputed assembly. Note that this assembly was done using 48 processors, so it took only 20 minutes to run.
cd ~/UsingRayMeta/Sample-Big
more ElapsedTime.txt
#Step Date Elapsed time Since Beginning
Network testing 2017-06-15T14:47:08 1 seconds 1 seconds
Counting sequences to assemble 2017-06-15T14:47:09 1 seconds 2 seconds
Sequence loading 2017-06-15T14:47:16 7 seconds 9 seconds
K-mer counting 2017-06-15T14:47:36 20 seconds 29 seconds
Coverage distribution analysis 2017-06-15T14:47:43 7 seconds 36 seconds
Graph construction 2017-06-15T14:48:09 26 seconds 1 minutes, 2 seconds
Null edge purging 2017-06-15T14:48:49 40 seconds 1 minutes, 42 seconds
Selection of optimal read markers 2017-06-15T14:49:24 35 seconds 2 minutes, 17 seconds
Detection of assembly seeds 2017-06-15T14:51:39 2 minutes, 15 seconds 4 minutes, 32 seconds
Estimation of outer distances for paired reads 2017-06-15T14:51:52 13 seconds 4 minutes, 45 seconds
Bidirectional extension of seeds 2017-06-15T14:55:36 3 minutes, 44 seconds 8 minutes, 29 seconds
Merging of redundant paths 2017-06-15T15:03:44 8 minutes, 8 seconds 16 minutes, 37 seconds
Generation of contigs 2017-06-15T15:03:53 9 seconds 16 minutes, 46 seconds
Scaffolding of contigs 2017-06-15T15:05:33 1 minutes, 40 seconds 18 minutes, 26 seconds
Counting sequences to search 2017-06-15T15:05:34 1 seconds 18 minutes, 27 seconds
Graph coloring 2017-06-15T15:06:01 27 seconds 18 minutes, 54 seconds
Counting contig biological abundances 2017-06-15T15:06:13 12 seconds 19 minutes, 6 seconds
Counting sequence biological abundances 2017-06-15T15:06:55 42 seconds 19 minutes, 48 seconds
Loading taxons 2017-06-15T15:07:02 7 seconds 19 minutes, 55 seconds
Loading tree 2017-06-15T15:07:22 20 seconds 20 minutes, 15 seconds
Processing gene ontologies 2017-06-15T15:07:34 12 seconds 20 minutes, 27 seconds
Computing neighbourhoods 2017-06-15T15:07:34 0 seconds 20 minutes, 27 seconds
Now, is this assembly any good? To investigate, we will look at the OutputNumbers.txt file.
more OutputNumbers.txt
Contigs >= 100 nt
Number: 13633
Total length: 8137962
Average: 596
N50: 10391
Median: 234
Largest: 265237
Contigs >= 500 nt
Number: 1531
Total length: 5224457
Average: 3412
N50: 85464
Median: 632
Largest: 265237
Scaffolds >= 100 nt
Number: 13619
Total length: 8139296
Average: 597
N50: 15499
Median: 234
Largest: 265237
Scaffolds >= 500 nt
Number: 1517
Total length: 5225791
Average: 3444
N50: 109710
Median: 631
Largest: 265237
How do you interpret these statistics?
Q1: What can you say about this assembly? Is it good? Do we have good coverage of the expected C. difficile genome?
The total length of this assembly is higher than what is expecter for a C. difficile genome. When looking only at contigs longer than 500 nucleotide, the assembly is 5,224,457 nt. The number of contigs is very high for the length of the assembly. Here, we were assembling a Clostridium difficile genome. Thus, we were expecting a genome length around 4 Mb. To troubleshoot these results, we will look at the sequencing statistics.
The number of reads used for assembly can be found in the file NumberOfSequences.txt
more NumberOfSequences.txt
Files: 2
FileNumber: 0
FilePath: MetaSim_Cdiff8_Ecoli2_1000000_fir.fastq
NumberOfSequences: 1000000
FirstSequence: 0
LastSequence: 999999
FileNumber: 1
FilePath: MetaSim_Cdiff8_Ecoli2_1000000_sec.fastq
NumberOfSequences: 1000000
FirstSequence: 1000000
LastSequence: 1999999
Summary
NumberOfSequences: 2000000
FirstSequence: 0
LastSequence: 1999999
The number of reads in the analysis is 2,000,000. The number of nucleotides sequenced is 101 * 2,000,000 = 202,000,000. If we divide it by the expected genome length, we get 202,000,000 / 4e6 = 50X coverage, which is indeed sufficient for assembly.
Further troubleshooting can be done by looking at the CoverageDistributionAnalysis.txt file
more CoverageDistributionAnalysis.txt
k-mer length: 31
Number of k-mers in the distributed de Bruijn graph: 18616594
Lowest coverage observed: 2
MinimumCoverage: 10
PeakCoverage: 17
RepeatCoverage: 24
Number of k-mers with at least MinimumCoverage: 8459298 k-mers
Percentage of vertices with coverage 2: 15.5668 %
DistributionFile: Sample_SS-Big/CoverageDistribution.txt
The information of interest is the number of k-mers with at least MinimumCoverage: 8,459,298 k-mers. This estimates the size of the de bruijn graph that attains the minimum coverage, in this case a depth of 10. If we divide this number by 2 (to account for the two strands of DNA), we can estimate the actual size of the genome or metagenome. In this case, the genome size would be approximately 4,2 Mb, which is what we would expect as a genome size. However, the number of k-mers in the graph is almost twice that size : 18,616,594. This may indicate something is not behaving as expected…
We will look into to profiling results of Ray to investigate the taxonomical content of the sample. This information is found in the BiologicalAbundance folder.
cd BiologicalAbundances
ls
0.Profile.genomes.tsv 0.Profile.TaxonomyRank=no rank.tsv 0.Profile.TaxonomyRank=superkingdom.tsv _Directories.tsv _Taxonomy
0.Profile.TaxonomyRank=class.tsv 0.Profile.TaxonomyRank=order.tsv 20170616_Nemesis.tsv _Frequencies
0.Profile.TaxonomyRank=family.tsv 0.Profile.TaxonomyRank=phylum.tsv _Coloring _GeneOntology
0.Profile.TaxonomyRank=genus.tsv 0.Profile.TaxonomyRank=species.tsv _DeNovoAssembly genomes
The .tsv files provide the proportion of samples that could be identified as taxonomic units (in k-mers).
Q2: Take a few minutes to look at the .tsv files and try to understand what is happening with this assembly.
In *0.Profile.TaxonomyRank=species.tsv*, we can see that 20% of the sample is *E. coli* while the expected *C. difficile* is only 20% of the sample. The *Leuconostoc lactis* signal is background noise (< 10e-6). Yep, this sample is not pure. It is a mixed culture.
To simplify the interpretation when working with mixed population, we can look at the taxa that represent more than 0.01% of a sample. In this example, it is not that useful, but as you can see it removes the background signal from E. coli.
awk -F "\t" '$4>0.0001' 0.Profile.TaxonomyRank=species.tsv
#TaxonIdentifier TaxonName TaxonRank TaxonProportion
562 Escherichia coli species 0.209279
1496 Peptoclostridium difficile species 0.79072
We will move to another directory, with more complex metagenomic sample. In this second assembly, take a look at the files we consulted in this section to answer the following questions :
cd ~/UsingRayMeta/Sample_P4J7-Assembly
How many reads did we use? What is the assembly size? Broken in how many contigs? What is the size of the graph? How does it relate with the size of the assembly? How is the species proportion compared to the lower sequencing depth analysis?
Step 3 - Contig binning
From this complete assembly, we will bin the contigs to separate the Escherichia from the Clostridium.
Some data on contig identification is already available in the files, although we will need to do some processing to make this information more valuable.
The first file to look into is in the BiologicalAbundances/genomes/ContigIdentification.tsv.
cd ~/UsingRayMeta/Sample-Big
more BiologicalAbundances/genomes/ContigIdentifications.tsv
#Contig name K-mer length Contig length in k-mers Contig strand Category Sequence number Sequence name Sequence length in k-mers Matches in contig Contig length ratio Sequence length ratio
contig-0 31 37550 F Clostridium_difficile_630 1 gi|126697566|ref|NC_009089.1| Clostridium difficile 630, complet 4290222 103 0.00274301 2.40E-05
contig-0 31 37550 R Clostridium_difficile_630 1 gi|126697566|ref|NC_009089.1| Clostridium difficile 630, complet 4290222 37401 0.996032 0.00871773
contig-1 31 1601 F Escherichia_coli_K_12_substr__DH10B 0 gi|170079663|ref|NC_010473.1| Escherichia coli str. K-12 substr. 4686107 1574 0.983136 0.000335886
contig-1 31 1601 R Escherichia_coli_K_12_substr__DH10B 0 gi|170079663|ref|NC_010473.1| Escherichia coli str. K-12 substr. 4686107 27 0.0168645 5.76E-06
contig-10 31 1401 R Escherichia_coli_K_12_substr__DH10B 0 gi|170079663|ref|NC_010473.1| Escherichia coli str. K-12 substr. 4686107 1401 1 0.000298969
contig-1000000 31 82506 F Clostridium_difficile_630 1 gi|126697566|ref|NC_009089.1| Clostridium difficile 630, complet 4290222 82447 0.999285 0.0192174
contig-10000000 31 717 F Escherichia_coli_K_12_substr__DH10B 0 gi|170079663|ref|NC_010473.1| Escherichia coli str. K-12 substr. 4686107 717 1 0.000153005
contig-100000000 31 276 F Escherichia_coli_K_12_substr__DH10B 0 gi|170079663|ref|NC_010473.1| Escherichia coli str. K-12 substr. 4686107 276 1 5.89E-05
contig-100000001 31 234 F Escherichia_coli_K_12_substr__DH10B 0 gi|170079663|ref|NC_010473.1| Escherichia coli str. K-12 substr. 4686107 234 1 4.99E-05
contig-100000002 31 254 F Escherichia_coli_K_12_substr__DH10B 0 gi|170079663|ref|NC_010473.1| Escherichia coli str. K-12 substr. 4686107 254 1 5.42E-05
contig-100000003 31 256 R Escherichia_coli_K_12_substr__DH10B 0 gi|170079663|ref|NC_010473.1| Escherichia coli str. K-12 substr. 4686107 256 1 5.46E-05
contig-100000004 31 265 F Escherichia_coli_K_12_substr__DH10B 0 gi|170079663|ref|NC_010473.1| Escherichia coli str. K-12 substr. 4686107 265 1 5.66E-05
contig-100000005 31 248 R Escherichia_coli_K_12_substr__DH10B 0 gi|170079663|ref|NC_010473.1| Escherichia coli str. K-12 substr. 4686107 248 1 5.29E-05
contig-100000006 31 273 R Escherichia_coli_K_12_substr__DH10B 0 gi|170079663|ref|NC_010473.1| Escherichia coli str. K-12 substr. 4686107 273 1 5.83E-05
contig-100000007 31 267 R Escherichia_coli_K_12_substr__DH10B 0 gi|170079663|ref|NC_010473.1| Escherichia coli str. K-12 substr. 4686107 267 1 5.70E-05
contig-100000008 31 261 F Escherichia_coli_K_12_substr__DH10B 0 gi|170079663|ref|NC_010473.1| Escherichia coli str. K-12 substr. 4686107 261 1 5.57E-05
contig-100000009 31 269 R Escherichia_coli_K_12_substr__DH10B 0 gi|170079663|ref|NC_010473.1| Escherichia coli str. K-12 substr. 4686107 269 1 5.74E-05
contig-10000001 31 687 R Escherichia_coli_K_12_substr__DH10B 0 gi|170079663|ref|NC_010473.1| Escherichia coli str. K-12 substr. 4686107 687 1 0.000146604
contig-100000010 31 239 R Escherichia_coli_K_12_substr__DH10B 0 gi|170079663|ref|NC_010473.1| Escherichia coli str. K-12 substr. 4686107 239 1 5.10E-05
contig-100000011 31 259 R Escherichia_coli_K_12_substr__DH10B 0 gi|170079663|ref|NC_010473.1| Escherichia coli str. K-12 substr. 4686107 259 1 5.53E-05
contig-100000012 31 1737 R Leuconostoc_lactis_KCTC_3528_uid68683 1149 gi|339305013|ref|NZ_AEOR01001150.1| Leuconostoc lactis KCTC 3528 2075 16 0.00921128 0.00771084
contig-100000012 31 1737 F Clostridium_difficile_630 1 gi|126697566|ref|NC_009089.1| Clostridium difficile 630, complet 4290222 35 0.0201497 8.16E-06
contig-100000012 31 1737 F Escherichia_coli_K_12_substr__DH10B 0 gi|170079663|ref|NC_010473.1| Escherichia coli str. K-12 substr. 4686107 1620 0.932642 0.000345703
contig-100000012 31 1737 R Escherichia_coli_K_12_substr__DH10B 0 gi|170079663|ref|NC_010473.1| Escherichia coli str. K-12 substr. 4686107 117 0.0673575 2.50E-05
For each contig, we get several information, notably the lenght of the contig, the reference sequence matching the contig and the proportion of the contig matched by the reference sequence. The problem (or the advantage?) with this file is that it includes several entry for each contig (see contig-100000012), providing its similarity with several sequences from the reference database. For a small community like this one, we could analyse this file. However, for more complex metagenome, we need help!
We will not be running the following steps since it take some time. It has already been computed and available in the directory.
Enter the following commands to download the laughing-nemesis last common ancestor tool for Ray Meta.
## Do not run this in the tutorial
cd ~
mkdir software
cd software
git clone https://github.com/plpla/laughing-nemesis.git
We would then use the following command line to run it on our assembly :
## Do not run this in the tutorial
python ~/software/laughing-nemesis/FindLastCommonAncester.py lca -d ~/SummerSchoolMicrobiome/Sample_RVH-2106-Ray-2017-06-06/BiologicalAbundances -t ~/SummerSchoolMicrobiome/UsingRayMeta/TreeOfLife-Edges.tsv
Now, we look at the precomputed results.
cd ~/UsingRayMeta/Sample-Big
more BiologicalAbundances/20170616_Nemesis.tsv
Contig-name Contig_length_in_kmers Contig_mode_kmer_depth Total_colored_kmer LCA_taxon_id LCA_name LCA_rank LCA_score phylum phylum_score class class_score order order_score family family_score genus genus_score species species_score
contig-187000039 149 3 149 562 Escherichia coli species 1 Proteobacteria 1 Gammaproteobacteria 1 Enterobacteriales 1 Enterobacteriaceae 1 Escherichia 1 Escherichia coli 1
contig-293000047 95 2 95 562 Escherichia coli species 1 Proteobacteria 1 Gammaproteobacteria 1 Enterobacteriales 1 Enterobacteriaceae 1 Escherichia 1 Escherichia coli 1
contig-293000046 106 5 106 562 Escherichia coli species 1 Proteobacteria 1 Gammaproteobacteria 1 Enterobacteriales 1 Enterobacteriaceae 1 Escherichia 1 Escherichia coli 1
contig-203000019 138 2 138 562 Escherichia coli species 1 Proteobacteria 1 Gammaproteobacteria 1 Enterobacteriales 1 Enterobacteriaceae 1 Escherichia 1 Escherichia coli 1
contig-203000018 129 5 129 562 Escherichia coli species 1 Proteobacteria 1 Gammaproteobacteria 1 Enterobacteriales 1 Enterobacteriaceae 1 Escherichia 1 Escherichia coli 1
contig-280000038 101 5 101 562 Escherichia coli species 1 Proteobacteria 1 Gammaproteobacteria 1 Enterobacteriales 1 Enterobacteriaceae 1 Escherichia 1 Escherichia coli 1
contig-228000026 122 2 122 562 Escherichia coli species 1 Proteobacteria 1 Gammaproteobacteria 1 Enterobacteriales 1 Enterobacteriaceae 1 Escherichia 1 Escherichia coli 1
contig-276000038 103 4 103 562 Escherichia coli species 1 Proteobacteria 1 Gammaproteobacteria 1 Enterobacteriales 1 Enterobacteriaceae 1 Escherichia 1 Escherichia coli 1
contig-199000039 133 2 133 562 Escherichia coli species 1 Proteobacteria 1 Gammaproteobacteria 1 Enterobacteriales 1 Enterobacteriaceae 1 Escherichia 1 Escherichia coli 1
contig-280000032 99 3 99 562 Escherichia coli species 1 Proteobacteria 1 Gammaproteobacteria 1 Enterobacteriales 1 Enterobacteriaceae 1 Escherichia 1 Escherichia coli 1
contig-280000035 97 2 97 562 Escherichia coli species 1 Proteobacteria 1 Gammaproteobacteria 1 Enterobacteriales 1 Enterobacteriaceae 1 Escherichia 1 Escherichia coli 1
contig-199000038 178 8 178 562 Escherichia coli species 1 Proteobacteria 1 Gammaproteobacteria 1 Enterobacteriales 1 Enterobacteriaceae 1 Escherichia 1 Escherichia coli 1
contig-247000024 105 2 105 562 Escherichia coli species 1 Proteobacteria 1 Gammaproteobacteria 1 Enterobacteriales 1 Enterobacteriaceae 1 Escherichia 1 Escherichia coli 1
contig-247000025 98 4 98 562 Escherichia coli species 1 Proteobacteria 1 Gammaproteobacteria 1 Enterobacteriales 1 Enterobacteriaceae 1 Escherichia 1 Escherichia coli 1
contig-247000026 113 6 113 562 Escherichia coli species 1 Proteobacteria 1 Gammaproteobacteria 1 Enterobacteriales 1 Enterobacteriaceae 1 Escherichia 1 Escherichia coli 1
contig-247000027 118 4 118 562 Escherichia coli species 1 Proteobacteria 1 Gammaproteobacteria 1 Enterobacteriales 1 Enterobacteriaceae 1 Escherichia 1 Escherichia coli 1
In this file, we gather all the beauty of Ray Meta. For each contig, we get the following information :
- Contig-name
- Contig_length_in_kmers
- Contig_mode_kmer_depth - Mode coverage depth of the k-mers in the contig.
- Total_colored_kmer - Number of k-mers in the contig that have been colored by the reference database.
- LCA_taxon_id - Best hit taxonomical association NCBI taxID.
- LCA_name - Best hit taxonomical association taxa name.
- LCA_rank - Best hit taxonomical association rank of the best hit.
- LCA_score - Best hit taxonomical association score
The next colums indicate the best classification at each taxonomical rank and its score. A score > 0.01 indicate a high probability that the taxonomical association is good.
- phylum
- phylum_score
- class
- class_score
- order
- order_score
- family
- family_score
- genus
- genus_score
- species
- species_score
The score, also known as the pl-value, is computed using the following formula : Matches_in_contig^2 / (Contig_length_in_k-mers * Colored_k-mers)
We can now bin contigs and compute the size of each bin that is associated to a species. We will work with R.
Let’s open R.
R
Load the contig identification file.
data <- read.table("~/UsingRayMeta/Sample-Big/BiologicalAbundances/20170616_Nemesis.tsv", header=1, row.names=1, sep="\t")
data[1:5,]
At which rank is the LCA the best ?
summary(data$LCA_rank)
How many families have been identified in our file ?
summary(data$family)
We select the lines for which species are identified with a score greater than 0.01.
data.select <- data[which(data$species_score>=0.01),]
### List species that were detected and sort them based on the number of contigs
sort(summary(droplevels(data.select$species)))
We calculate the sum of the length of contigs identified as Clostridium difficile (it is named Peptoclostridium difficile in this taxonomy).
sum(data.select[which(data.select$species=="Peptoclostridium difficile"),"Contig_length_in_kmers"])
We can generate a table to do that with all species that had at least one contig.
genome.size <- c()
for(species in levels(droplevels(data.select$species))){
genome.size <- c(genome.size, sum(data.select[which(data.select$species==species),"Contig_length_in_kmers"]))
}
names(genome.size) <- levels(droplevels(data.select$species))
sort(genome.size)
We will look at the coverage distribution of contigs overall. This will allow to quality control contig binning and better interpret results.
summary(data[,2])
plot(density(data[,2]))
Now, we look at a specific species.
summary(data.select[which(data.select$species=="Peptoclostridium difficile"),"Contig_mode_kmer_depth"])
plot(density(data.select[which(data.select$species=="Peptoclostridium difficile"),"Contig_mode_kmer_depth"]))
summary(data.select[which(data.select$species=="Escherichia coli"),"Contig_mode_kmer_depth"])
plot(density(data.select[which(data.select$species=="Escherichia coli"),"Contig_mode_kmer_depth"]))
We can also overlap the curve for the two species that we know to be present in this sample.
plot(density(data.select[which(data.select$species=="Peptoclostridium difficile"),"Contig_mode_kmer_depth"]))
lines(density(data.select[which(data.select$species=="Escherichia coli"),"Contig_mode_kmer_depth"]), col="red")
Finally, we may want to generate a list of contigs that probably originate from SPECIES. With this list, we could extract the contigs from the Contigs.fasta file in the assembly directory.
rownames(data.select[which(data.select$species=="Peptoclostridium difficile"),])
Want to try it with more complex data? Repeat the same procedure with THIS file.
data <- read.table("~/UsingRayMeta/Sample_P4J7-Assembly_nemesis.tsv", header=1, row.names=1, sep="\t")
data[1:5,]
Congratulation! You have completed the tutorial. Now, make good art!