Nucleotide sequence alignment with BlastN

This blog post introduces the concept of sequence alignment, the BLAST algorithm, and an example of how to use it on Qarnot.

Sequence alignment

Definition

DeoxyriboNucleic Acid, also known as DNA, is the basic information of any living organism. This information is also defined as the genetic code, and works as an orchestrator for the other system levels, from the proteins to cells, tissues and organs. This is why the DNA, RNA or protein sequences analysis became a key challenge in biology, and a first evident problem held in comparing huge amounts of sequencing data.

A sequence alignment is a bioinformatics method allowing to rearrange and compare two sequences, mostly of the same kind (DNA, RNA or protein). In common cases, we have two datasets in input, containing both one or more sequences. The first dataset contains the query, which means the sequence(s) we need to analyse. The second one is called the reference, or database, which is the set of sequences that get compared with the query.

The final output is globally a human-interpretable text file , showing the mismatches of gaps between the queries and the reference sequence(s). A score is attached to each alignment result, based on the similarity and sequence complexity.

Objectives

The DNA sequence alignment allows to interpret the results as point mutations, insertions or deletions, such as Single Nucleotide Polymorphism (SNP) or Single Nucleotide Variant (SNV). The alignment is used with High-Throughput Sequencing (HTS) data, to match the query sequences with a known sequence, or de novo. In this way, the RNA sequence alignment can also be used to quantify the genes expression. Finally, the protein sequences alignment allows to visualize the conserved regions and motifs, giving a functional point of view from the most representative amino acids. Another representation of this kind of alignment is a sequence logo.

BLAST

Basic Local Alignment Search Tool (BLAST) is initially an online web-based tool allowing to find regions of similarity between biological sequences. The program compares nucleotide sequences to sequence databases and computes statistical significance. Depending on the sequencing data type, there are different specific tools, but in this article, we focus on the usage of blastn (which means the alignment of nucleotide sequences).

BLAST on Qarnot

In this part we describe a simple example of using BLAST, and more particularly the tool blastn, on Qarnot, using the python SDK. We will align a list of query DNA sequences against another list of reference DNA sequences.

Tutorial

Before we start, you need to create a Qarnot account, we offer 15€ worth of computation on your subscription.

First, in a Qarnot_blastn_example folder, create a folder named blastn_resources and save inside the following data which contains two local sequences.

Download the resources here.

Now, let’s use Qarnot Python SDK to launch the distributed calculation. Save the following script as run.py in your Qarnot_blastn_example folder. In this script, you need to enter your Qarnot Token linked to your account to use our platform.

#!/usr/bin/env python3
import sys
import qarnot

client_token = "<<<XXX put your token here XXX>>>"
docker_repo = "ncbi/blast"
docker_tag = "2.10.1"

def workflow():

	task = conn.create_task('blastn-createdb', 'docker-batch', 1)
    # Create the input bucket
    input_bucket = conn.create_bucket('blastn-ref')
    input_bucket.sync_directory("blastn_resources/dataset-blastn")
    # Create the results bucket
    output_bucket = conn.create_bucket('blastn-db')
    # Create the task and attach the previous buckets
    task.constants['DOCKER_REPO'] = docker_repo
    task.constants['DOCKER_TAG'] = docker_tag
    # Append the buckets tu the task
    task.resources.append(input_bucket)
    task.results = output_bucket
    
    task.constants['DOCKER_CMD'] = "makeblastdb -in chr6.fna -dbtype nucl -parse_seqids -out chr6"
    
    # Launch the task on qarnot before the sequencing, wait until it's finished
    error_happened = False
    try:
        task.submit()
        # Periodically polling the task, by retrieving the output
        last_state = ''
        done = False
        while not done:
            if task.state != last_state:
                last_state = task.state
                print("** {}".format(last_state))
            done = task.wait(5)
            sys.stdout.write(task.fresh_stdout())
            sys.stderr.write(task.fresh_stderr())
        
        # In case of failure, display the error(s)
        if task.state == 'Failure':
            print("** Errors: %s" % task.errors[0])
            error_happened = True
        finished_task = True
    finally:
        if error_happened:
            return("An error happened")
        else:
            launch_blastn(output_bucket)

def launch_blastn(db_bucket):

	task = conn.create_task('blastn-sequencing', 'docker-batch', 1)
    # Create the input bucket
    input_bucket = conn.create_bucket('blastn-ref')
    # Create the results bucket
    output_bucket = conn.create_bucket('blastn-align')
    # Create the task and attach the previous buckets
    task.constants['DOCKER_REPO'] = docker_repo
    task.constants['DOCKER_TAG'] = docker_tag
    # Append the buckets tu the task
    task.resources = [ input_bucket, db_bucket ]
    task.results = output_bucket
    
    task.constants['DOCKER_CMD'] = "blastn -db chr6 -query hla-b.fsa -out results.out"
    finished_task = False
    error_happened = False
    try:
        task.submit()
        # Periodically polling the task, by retrieving the output
        last_state = ''
        done = False
        while not done:
            if task.state != last_state:
                last_state = task.state
                print("** {}".format(last_state))
            done = task.wait(5)
            sys.stdout.write(task.fresh_stdout())
            sys.stderr.write(task.fresh_stderr())
        
        # In case of failure, display the error(s)
        if task.state == 'Failure':
            print("** Errors: %s" % task.errors[0])
            error_happened = True
        task.download_results("results")
        finished_task = True
    finally:
        if error_happened:
            return("An error happened")
        else:
            return("The sequence alignment finished successfully")

conn = qarnot.connection.Connection(client_token=client_token)
workflow()

In the Qarnot_blastn_example folder, follow these steps to set up a Python virtual environment. Then, you can run the Python script from your terminal by typing chmod +x run.py and then ./run.py.

To summarize, the workflow provided by this project allows to submit two consecutive tasks:

Build a database based on a reference sequence;
Align a sequence (query) from a fasta file to the previously built database.

The database is built and transferred temporarily with a Qarnot bucket, allowing to work in a stateless mode.

You can then see the tasks details on your own console or on the Qarnot console by clicking on your task. When it’s finished the alignment results (results.out) will be downloaded on your computer.

Going further

The exercise here is straightforward, and allows you to get the first steps of submitting a simple bioinformatics job on Qarnot through the python SDK. You could also take the advantage of directly using Qarnot jobs to manage the tasks dependencies instead of the python code introduced here. Don’t hesitate to read further the documentation, or to contact us to have some advice or accompaniment for your needs and projects.

Privacy Preferences