Generating DNA barcdes with Numpy

Posted on January 04, 2018 in Python

Recently as part of a collaboration with another lab at MSKCC, we are constructing a library of DNA oligonucleotides. The aim of the projet itself is hush-hush, but I'd like to take a minute today and write about a less glamorous job that underpins the library construction, which is generating short random DNA sequences to identify specific groups of subsequences. These are known as DNA barcodes.

In contemporary sequencing experiments, it is common for technological reasons to pool samples from different conditions or even from different experiments together as they are processed by an NGS sequencer. DNA barcodes are used for de-multiplexing reads from these experiments post-sequencing. It's surprising how far this technology can be pushed; a recent paper from Cusanovich et al. shows how to use combinatorial indexing with barcodes to get very cost-effective single-cell ATAC-seq.

So, people need barcodes to do sequencing. While there does exist software to generate them, those solutions I did find were either intended to generate a very small number of codes (on the order of 10-500), or provided only codes of an unsuitable fixed length. Some just plain did not work (source not cited here to avoid embarrasement).

What I needed was a barcode generator which could produce:

approximately 80k barcodes
12 base pairs in length
each of which should be Hamming distance 2 or greater from the other members of the set
each of which should have GC content within a specified range
each of which should not have runs of the same base in length or three or greater

This is one of those jobs which began as a chore I had to get through on the way to something more rewarding. So I set about trying to Just Get It Over With, and wrote the simplest generator I could imagine: generate all possible 12mers, and then winnow away at subsequent chunks to find a set of barcodes that satisfies the constraints. In the process of doing this I quickly found that my naive solution, while seemily correct, was far too slow. So, I made some computational lemonade, and ended up exploring some fun technologies for accelerating Python code. This post takes liberal inspiration from Jake Vanderplas' excellent blog, where he sometimes blogs about cool Numpy tricks & moves, as well as his talk at PyCon video where he talks about exactly this type work. Have a watch, it's time well spent.

The first task was to whip up a little script to generate all N-mers. This part was easy enough:

In [1]:

def allstrings(alphabet, length):
    """Find the list of all strings of 'alphabet' of length 'length'"""

    if length == 0: 
        return []

    c = [[a] for a in alphabet[:]]
    if length == 1: 
        return c

    c = [[x,y] for x in alphabet for y in alphabet]
    if length == 2: 
        return c

    for l in range(2, length):
        c = [[x]+y for x in alphabet for y in c]
        
    return c

In [2]:

length = 12
alphabet = ['A', 'C', 'G', 'T']

my_strings = allstrings(alphabet, length)

Presto, we've got a load of strings. Here's an example from the file I generated on disk:

lski1877:barcodes zamparol$ head -n 10 all_codes.txt
AAAAAAAAAAAA
AAAAAAAAAAAC
AAAAAAAAAAAG
AAAAAAAAAAAT
AAAAAAAAAACA
AAAAAAAAAACC
AAAAAAAAAACG
AAAAAAAAAACT
AAAAAAAAAAGA
AAAAAAAAAAGC

In [3]:

my_strings[0:10]

Out[3]:

[['A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A'],
 ['A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'C'],
 ['A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'G'],
 ['A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'T'],
 ['A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'C', 'A'],
 ['A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'C', 'C'],
 ['A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'C', 'G'],
 ['A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'C', 'T'],
 ['A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'G', 'A'],
 ['A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'G', 'C']]

Let's convert this list of lists into a list of strings, with a bit of code thrown in to randomize the read input. Without this, the code spends a lot of time not doing much save for rejecting strings beginnign with AAA..

In [20]:

random.shuffle(my_strings)
my_codes = []
for l in my_strings:
    my_codes.append(''.join(l))

Now a little code to read each of these and build a set with the desired criteria. The first attempt is a pure Python implementation which works with strings and list comprehensions to specify the criteria for eliminating candidates.

In [12]:

def verify_mismatch_constraints_python(barcode, barcode_set, threshold=3):
    ''' Test if the current barcode will be at least 3 mismatches away from any 
 element of the barcode set '''
    
    if len(barcode_set) == 0:
        return True
    mismatch_counts = [_mismatches(bc, barcode) for bc in barcode_set] 
    return min(mismatch_counts) >= threshold


def _mismatches(bc, barcode):
    counts = [1 if barcode[p] != bc[p] else 0 for p in range(len(barcode))]
    return sum(counts)


def verify_content_constraints_python(barcode, low = 0.10, high = 0.90):
    ''' Make sure the GC content of the barcode lies in [low,high] 
    and also that there are no repetitive triples of any nucleotides '''
    triples = ['AAA','TTT','GGG','CCC']
    gc_score = [1 if b == 'C' or b == 'G' else 0 for b in barcode]
    gc_content = sum(gc_score) / (1.0 * len(barcode))
    
    triple_test_list = [0 if t not in barcode else 1 for t in triples]
    triple_test_sum = sum(triple_test_list)
    return (low < gc_content) & (gc_content < high) & (triple_test_sum == 0)

def process_batch_python(batch, barcodes):
    ''' Apply validation rules to each potential barcode in the batch '''
    for barcode in batch:
        if verify_content_constraints_python(barcode) and \
           verify_mismatch_constraints_python(barcode, barcodes):
            barcodes.append(barcode)

In [24]:

import time

def get_codes_python(batch_size=10000, max_codes=5000):
    ''' grow the set of barcodes one by one, processing in batches of 10000'''

    batch = []
    codes = []
     
    for barcode in my_codes:
        if len(codes) >= max_codes:
            break
            
        batch.append(barcode)
        if len(batch) == batch_size:
            t0 = time.time()
            process_batch_python(batch, codes)
            t1 = time.time()
            delta_t = t1 - t0
            batch = []
            print("result of batch processing is ", len(codes), " qualified codes")
            print("took ", delta_t)
        
            
    return codes

Let's try running this a few times below:

In [25]:

batch_size = 10000
max_codes = 5000
%timeit get_codes_python(batch_size, max_codes)

result of batch processing is  5008  qualified codes
took  37.605226039886475
result of batch processing is  5008  qualified codes
took  38.993963956832886
result of batch processing is  5008  qualified codes
took  38.834331035614014
result of batch processing is  5008  qualified codes
took  44.860349893569946
result of batch processing is  5008  qualified codes
took  38.340277910232544
result of batch processing is  5008  qualified codes
took  37.76196217536926
result of batch processing is  5008  qualified codes
took  36.844419956207275
result of batch processing is  5008  qualified codes
took  37.73810791969299
39.1 s ± 2.47 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

When trying to generate the full set of barcodes, the bottleneck encountered in verify_mismatch_constraints was serious enough to discourage me after about 30m. I tried to recast this as an opportunity to become more familiar with some of the technologies under the umbrella of numeric Python. Inspired by Jake Vanderplas' talk at PyCon, my first step was to recode the strings into Numpy arrays, and to re-express the constraints as operations on the arrays accordingly. This way, without having to come up with a better algorithm to generate compliant codes, I could just re-express my computation so that faster Numpy routines, rather than the Python interpreter, could do the heavy lifting.

In [26]:

import random
import numpy as np
import numba

str_to_num = {'A': 1, 'T': 2, 'G': 3, 'C': 4}
num_to_str = {1: 'A', 2: 'T', 3: 'G', 4: 'C'}

def _str_to_num(barcode):
    converted_list = [str_to_num[b] for b in barcode]
    return np.asarray(converted_list,dtype='int64')

def _num_to_str(array):
    converted_list = list(array)
    return [num_to_str[b] for b in converted_list]
    
def dna_to_numeric(batch):
    ''' convert list of barcode strings in batch to a numpy array '''
    numeric_batch = np.zeros((len(batch),12),dtype=np.int64)
    for i in range(len(batch)):
        numeric_batch[i,:] = _str_to_num(batch[i])
    return numeric_batch

def numeric_to_dna(numeric_batch):
    ''' convert numpy array of barcodes in batch to a list of strings '''
    batch = np.apply_along_axis(_num_to_str, axis=1, arr=numeric_batch)
    batch_list = batch.tolist()
    return [''.join(b) for b in batch_list]

def verify_mismatch_constraints_numpy(barcode, barcode_set, threshold=2):
    ''' test if the current barcode will be at least 2 mismatches away from any 
 element of the barcode set '''
    
    # broadcast substraction of barcode from each code in the barcode set
    if barcode_set.shape[0] > 0:
        mismatch_array = barcode_set - barcode
        mismatch_counts = np.count_nonzero(mismatch_array,axis=1)
        return mismatch_counts.min() >= threshold
    else:
        return True

def verify_content_constraints_numpy(barcode, low = 0.10, high = 0.90):
    ''' Make sure the GC content of the barcode lies in [low,high] '''
    gc_content = barcode[np.logical_or(barcode == 3, barcode == 4)].shape[0] / (1.0 * 12)
    return (low < gc_content) & (gc_content < high)

@numba.autojit    
def _verify_rep_constraint(barcode, run):
    for i in barcode:
        result = True
        for j in range(run.shape[0]):  
            result = result and (barcode[i+j] == run[j])
        if result:
            return False    # found a run, reject this barcode
    return True             # did not find a run, accept this barcode

def verify_no_rep_constraints(barcode):
    ''' Ensure that there are no repetitive triples of any nucleotides '''
    result = True
    for i in range(1,5):
        result = result & _verify_rep_constraint(barcode, np.array([i,i,i]))
    return result

def process_batch_numpy(numeric_batch, barcodes, last_index=0, max_codes=80000):
    ''' apply validation rules to each potential barcode in the batch '''
    np.random.shuffle(numeric_batch)
    for barcode in numeric_batch:
        
        if last_index == max_codes:
            break
        
        if verify_mismatch_constraints_numpy(barcode, barcodes[0:last_index,:]) and \
           verify_content_constraints_numpy(barcode) and \
           verify_no_rep_constraints(barcode):
            barcodes[last_index,:] = barcode
            last_index += 1
            
    return last_index

In [31]:

import time

def get_codes_numpy(batch_size, max_codes):
    ''' grow the set of barcodes one by one, ensuring they are 3 mismatches apart '''
    
    # bookkeeping for code array
    last_index = 0
    batch = []
    codes = np.empty((max_codes,12),dtype=np.int64)
    
    for barcode in my_codes:
        if last_index >= max_codes:
            break
            
        batch.append(barcode)
        if len(batch) == batch_size:
            print("beginning with ", last_index, " barcodes...")
            print("read ", len(batch), " lines")

            # convert to numpy
            batch_array = dna_to_numeric(batch)
            t0 = time.time()
            last_index = process_batch_numpy(batch_array, codes, last_index, max_codes)
            t1 = time.time()
            delta_t = t1 - t0
            batch = []
            print("result of batch processing is ", last_index, " qualified codes")
            print("time to process was: ", delta_t)

        

    str_codes = numeric_to_dna(codes[0:last_index,:])
    return str_codes

In [32]:

batch_size = 10000
max_codes = 5000
%timeit get_codes_numpy(batch_size, max_codes)

beginning with  0  barcodes...
read  10000  lines
result of batch processing is  5000  qualified codes
time to process was:  1.1384248733520508
beginning with  0  barcodes...
read  10000  lines
result of batch processing is  5000  qualified codes
time to process was:  1.0833053588867188
beginning with  0  barcodes...
read  10000  lines
result of batch processing is  5000  qualified codes
time to process was:  1.063485860824585
beginning with  0  barcodes...
read  10000  lines
result of batch processing is  5000  qualified codes
time to process was:  1.070199966430664
beginning with  0  barcodes...
read  10000  lines
result of batch processing is  5000  qualified codes
time to process was:  1.1409730911254883
beginning with  0  barcodes...
read  10000  lines
result of batch processing is  5000  qualified codes
time to process was:  1.7324938774108887
beginning with  0  barcodes...
read  10000  lines
result of batch processing is  5000  qualified codes
time to process was:  1.1075289249420166
beginning with  0  barcodes...
read  10000  lines
result of batch processing is  5000  qualified codes
time to process was:  1.0884857177734375
1.32 s ± 223 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Notes about the (mostly) Numpy solution¶

The computation time is estimated at 39.1 s ± 2.47 s for the Python solution versus 1.32 s ± 223 ms per loop for the Numpy solution, for a speedup of about 30X. Pretty decent! This may not be a precise estimate, since the cost of verify_mismatch_constraints is $O(N^2)$ with $N$ as the intended size of the set of barcodes. By the end of my search, I was spending much more time validating codes than the few seconds here. This also assumes you can load all your strings into memory before processing.

The expensive operation is still computing the Hamming distance of each potential new barcode to all existing barcodes. But where in the Python implementation this happens in two Python loops via list comprehensions, now it is pushed down into compiled Numpy code via broadcasting and a nonzero_counts + .min() reduction.

The constraints on GC content are mostly the same, but the constraint on finding and rejecting barcodes with runs of 3 monomers is actually a bit of a hack using Numba. Maybe there is a snappy way to express this in Numpy, but I had a hard time finding a way (though I didn't experiment with any convolution operations in scipy). Especially since the size of the domain is so small. So, I got to try out another fun technology to accelerate Python code. The downside to Numba is that to get the full benefit of generated LLVM byte-code to replace your Python code, you need to be very careful about how you express computations. The vectorization inherent to Numpy code is not very well supported yet, and neither are some of the syntactic sugary ways that Python & Numpy support iterator creation. This version here is maybe my third or fourth attempt, each subsequent attempt stripped out more Pythonic idioms that the Numba JIT compiler does not support (yet). Getting to the final iteration of for j in range(run.shape[0]): felt ... odd to have in Python code. But Numba is in active development, and is getting better all the time.

A way to improve this further would be to double down on Numpy in verify_mismatch_constraints_numpy. Instead of comparing potential barcodes against the set of existing codes one by one, we could sample a set of possible strings, compute their Hamming distance at a matrix, then select those which are further than a given threshold away from all other strings in the set. This would also provide an easy way to validate the final set.

Feel free to grab this code, or you can just clone the repo I made for it here

In [ ]: