Thursday, November 19, 2009

Using Vmatch to combine assemblies

In praise of Vmatch

If I could take only one bioinformatics tool with me to a desert island it would be Vmatch.

In addition to being a versatile alignment tool, Vmatch has many lesser-known features which also leverage suffix trees. The dbcluster option allows Vmatch to cluster similar sequences using Vmatch index files. This method is about 1000x faster than attempting to align all sequences against each other, no matter how clever the algorithm.

Originally presented as means to cluster EST sequences, this option is useful for processing the output from de novo short read assemblers like Velvet and ABYSS. Vmatch -dbcluster allows us to easily create a non-redundant set of contig sequences originating from several assemblies.

Why would we want to combine assemblies?

Every kmer is sacred

Why not just take the one with the highest N50? At the recent Genome Informatics meeting at CSHL, Inanc Birol (ABYSS) said "every kmer is sacred". Our experience with the de novo transcriptome assembly of plants has been that the best set of contigs is spread out all over the parameter space. Longer kmer and cvCut settings can produce a longer set of elite contigs at the expense of omitting lowly expressed (or spurious?) contigs that appear at less stringent settings.

Reads held hostage

To the point above, I would add "every cvCut is also sacred". That doesn't exactly roll off the tongue, but we have seen some instances in which an assembly will have a higher read usage at a cvCut of 10 than at 5. This paradox suggests there is a "contig read hostage" situation, in which reads critical to the formation of longer contigs can be held captive in low coverage short contigs. Raising the cvCut threshold frees up these critical reads to extend more plausible contigs and allowing additional reads from the pool to be recruited into the assembly.

What is a non-redundant set?

The following sequences have some redundancies. NODE2 and NODE3 do not add any information.
The non-redundant set will have only NODE1 and NODE4.
>NODE2 a substring of NODE1
>NODE3 a reverse complement substring of NODE1
>NODE4 a new sequence

How do we create a non-redundant set?

Do not use -nonredundant

No, really. There is an option called -nonredundant which should presumably do what we want, but unfortunately that writes a "representative" member of each cluster to a file, which may or may not be the longest contig. I'm not sure what makes a sequence representative of a cluster, but for this application we want the longest member of each cluster.

On April 29th 2010, Vmatch 2.1.3 was released. The most important change is that the option -nonredundant now delivers the longest sequence from the corresponding cluster (instead
of an unspecified representative). This should make the approach unnecessary.

To create a non-redundant set we will produce cluster files and then extract the longest sequence from each file. Use the following commands to produce your cluster files from an index:
mkvtree -allout -pl -db contigs1.fa contigs2.fa -dna -indexname myIndex
#mkvtree will accept multiple fasta or gzipped fasta files

#if using Vmatch 2.1.3 or later:
vmatch -d -p -l 25 -dbcluster 100 0 -v -nonredundant nonredundantset.fa myIndex > mySeqs.rpt
# thx to Hamid Ashrafi for debugging this syntax

#if using older Vmatch
mkdir sequenceMatches 
#this is where vmatch will put each cluster

vmatch -d -p -l 25 -dbcluster 100 0 mySeqs "(1,0)" -v -s myIndex > mySeqs.rpt
mv mySeqs*.match mySeqs*.fna sequenceMatches

What do these options do?

  • -d direct matches (forward strand)
  • -p palindromic matches (reverse strand)
  • -l search length (set this below your shortest sequence)
  • -dbcluster queryPerc targetPerc - runs an internal alignment of the index created by mkvtree.
    • The two numeric arguments specify what percentage of your query sequence (the smaller) is involved in an alignment to the cluster sentinel/target superstring. For our purposes we require 100% of our query sequence substring to match the target. We don't care what percentage of the target is aligned, so set the second parameter to 0.
    • The third argument to dbcluster is the index prefix name that vmatch will give to the THOUSANDS of cluster .fna and .match files it will create.
    • The fourth argument "(1,0)" specifies that we want to keep singletons (in a file called mySeqs.single.fna) and that there is no limit to the number of sequences in an acceptable cluster.

  • -v verbose report (redirected to a the .rpt)
  • -s create the individual cluster fasta files and match reports

Consult the vmatch manual for fuzzy matches and more examples:

The standard output details the clusters and singlets.
0:761437: NODE_83_length_31_cov_22.451612857642: 
NODE_106_length_31_cov_22.451612621409: NODE_152_length_29_cov_27.758621749981: 
NODE_185_length_29_cov_27.7586211:761861: NODE_531_length_424_cov_26.851416805590: 
NODE_1413_length_320_cov_28.187500837480: NODE_1236_length_320_cov_28.187500858407: 
NODE_937_length_320_cov_28.187500765510: NODE_1542_length_108_cov_34.870369621979: 
NODE_786_length_425_cov_32.602352750915: NODE_1290_length_321_cov_34.392525

Extract the longest sequence from your cluster files

If using Vmatch 2.1.3 or later this is unnecessary
Here is a perl script, which we will call, to do that
#print the longest sequence in a fasta file
use strict;my %seqs;my $defline;
while (<>) {
  if ( /^#/ || /^\n/ ) {
    #comment line or empty line do nothing
  }elsif (/>/) {
    s/>//g;$defline = $_;
    $seqs{$defline} = "";
  }elsif ($defline) {
    $seqs{$defline} .= $_;
    die('invalid FASTA file');
my $max = $defline;
foreach my $def ( keys %seqs ) {
  $max = ( length( $seqs{$def} ) > length( $seqs{$max} ) ) ? $def : $max;
print ">" . $max . "\n" . $seqs{$max} . "\n";

We want the longest member of each cluster and all sequences in the singletons file:
for f in sequenceMatches/*fna;
if [ "$f" = "sequenceMatches/mySeqs.single.fna" ];
cat $f >> mySeqs_longest_seqs.fa;
else perl $f >> mySeqs_longest_seqs.fa;

That's it - now you have a comprehensive and non-redundant set of the longest contigs from a number of assemblies.


  1. very useful! thank you for posting such a detailed work flow.

  2. thanks for posting this, very interesting...

    any thoughts on how to convert such a combined assembly to ace format?

  3. Thanks for posting this! I found it incredibly useful.