Sequence Generator Project (Python)

, , 1900

I developed a Python program to generate synthetic promoter sequences with specific characteristics to simulate and study promoter regulation. The project aimed to help in understanding how different motifs within a promoter region influence transcription.

Project Details: The program creates 100 DNA sequences, each 500 nucleotides long. These sequences include:

  • A core promoter with a TATA-box and Inr motif, positioned randomly within a specific range upstream of the transcription start site (TSS), which is fixed at nucleotide 450.
  • Specific start and end sequences: the first 4 nucleotides are always “AGGT” and the last 4 are “CGAA” to allow cloning using Golden Gate assembly.
  • Between nucleotides 50 and 400, random placement of transcription factor binding sites for PDX1, MAFA, and/or SP1 (between 1 and 6 binding sites per sequence).
  • The rest of the sequence is randomly filled with nucleotides (A, C, G, and T) with equal probability.
  • By simulating these promoter sequences, the project allows for experimentation with different motif combinations, helping to explore their regulatory potential in transcription processes. These sequences are then written to a file named promoters.txt.

This project contributes to synthetic biology by enabling the creation of large libraries of synthetic promoters, allowing for experimentation on how different motifs and their positions influence gene expression. It could also help in designing effective gene regulation tools in biological research.


#First, we import random to generate random sequences
#We then import os.path to access the files on the computer
import random
import os.path

#First, we use a function to randomly generate a sequence 
def create_start_sequence():
    nucleotides = ['A','C','G','T'] #we define the nucleotides in order to generate the random sequences
    sequence_to_return = 'AGGT'
    i=0
    while i < 492: #we use a counter to generate the sequence nt by nt until we reach 492 (the 8 remaining nucleotides are
        #the two 4 nt golden gate long sites either side of the sequence
        sequence_to_return += nucleotides[random.randint(0,3)]
        i+=1 #this continues the loop
    sequence_to_return += 'CGAA' #this adds the final 4 nts of the golden gate on the end
    return sequence_to_return #this stores the sequence and runs the function


#Below we run code to keep generating sequences until the sequence_to_return meets our criteria of no TATA motifs and none of our 
#desired TF binding motifs so we have full control of the presence of these sites within the specified range (1 TATA motif and 1-6 binding motifs)
def generate_sequence():
    #we specify the binding sites that we dont want in the original sequence
    binding_sites = ['TCTAAT','TGCA','CCGCCC'] 
    # We first define the string that we will add to, this will be returned by the function.
    # Here, we also define all possible nucleotides and lnr motifs (4 combinations) within our sequence.
    generated_sequence = ''
    nucleotides = ['A','C','G','T']
    lnr_motifs = ['CA','CG','TA','TG']
    #this runs the function above to create a 'sequence_to_return' 
    generated_sequence = create_start_sequence() 
    n = 0
    while n<3: #n is set to 2 max as there are 3 binding site options
        if binding_sites[n] in generated_sequence or ('TATAA' in generated_sequence): #TATAA is also checked
            generated_sequence = create_start_sequence() #the sequence_to_return generated by our previous function becomes the main 
            #starting sequence if it meets our criteria of NO TATAA motif and NO binding sites
            n = 0
        else:
            n += 1 #keep checking every binding site until all are checked


    # Now we choose a random lnr motif by using random.randint(0,3) and inputting that as the index to access in
    # our previously defined list lnr_motifs.
    lnr_motif_chosen = lnr_motifs[random.randint(0,3)]
    #We use string splicing here to place lnr_motif_chosen in the correct position - nucleotide 450.
    generated_sequence = generated_sequence[:448] + lnr_motif_chosen + generated_sequence[450:] #the range up to and includes the nucleotide 448 
    #but does not include the nucleotide 450 so it goes from 450 onwards.
    
    # We now generate the random position for the TATA box, it must be 24-31 nucleotides upstream of the 2nd 
    # nucleotide of the lnr motif sequence, hence 450-random.randint(24,31)   
    tata_box_position = 450-random.randint(24,31)
    # We use string splicing here to place the TATA in the random position determined above, tata_box_position.
    generated_sequence = generated_sequence[:tata_box_position] + 'TATAA' + generated_sequence[tata_box_position+5:] 

    # Now, we use random.ranint(1,6) to randomly pick a number between 1 and 6, corresponding to how many binding
    # sites there will be between nucleotides 50 and 450 - we include the 50 and 450 nt positions.
    no_of_binding_sites = random.randint(1,6)
    # Here, we make a list of the binding sites corresponding to the TFs PDX1, MAFA, SP1 that are allowed in our 
    # sequence
    binding_sites = ['TCTAAT','TGCA','CCGCCC']
    # We now use an inline for loop to create a list of random integers between 0 and 2 (inclusive). The number of 
    # integers we create corresponds to the random number of binding sites determined above. This numbers in the
    # list will be used as the index to access the list binding_sites - we are essentially choosing a random element
    # from the binding_sites list.
    binding_sites_included = [random.randint(0, 2) for i in range(no_of_binding_sites)]
    # Now we create a list called binding_sites_positions to store the positions of the binding sites. The reason we
    # do this will become more clear in the next part of the code.
    binding_sites_positions = []

    # We now make a for loop that iterates no_of_binding sites number of times.
    for i in range(no_of_binding_sites):
        # The current_binding_site variable stores the binding site we will put into the string. The binding site
        # we use in this iteration is determined by the random integer binding_sites_included[i], which is then used
        # as the index for current_binding_site.
        current_binding_site = binding_sites[binding_sites_included[i]]
        # The position of the current_binding_site is determined by random.randint() which chooses a number between
        # 50 and 394/396 depending on the length of the current_binding site. We do 450-len(current_binding_site) so
        # that the end of the binding site does not go past the 450 nucleotide limit.
        current_binding_site_pos = random.randint(50,400-len(current_binding_site)) 
        #inclusive of site 50 and 400 when using randint!
        
        # We now use a while loop to go through the list binding_sites_positions to make sure that there are no
        # conflicts with the placement of the current binding site. For example, if in the previous iteration a 
        # binding site was put in the 250th nucleotide, we want to make the binding site in the current iteration will
        # not overwrite previous binding sites by being in the 252nd nucleotide for example.
        j=0
        while j < len(binding_sites_positions):
            # To accomplish this, we check to see if the random place for the current_binding_site that we chose,
            # current_binding_site_pos, is not within 6 nucleotides of any other previous binding site positions that
            # are in the list binding_site_positions.
            if (current_binding_site_pos<=(binding_sites_positions[j]+6)) and (current_binding_site_pos>=binding_sites_positions[j]-6):
                # However, if there is a conflict, a new random current_binding_site_pos is chosen and j is set to
                # zero so that all positions in binding_sites_positions can be checked again.
                current_binding_site_pos = random.randint(50,400-len(current_binding_site))
                j=0
            # j is not incremented by one so that we go onto the next binding site positions
            j+=1

        # After we have got rid of any potential conflicts, we use string splicing to place the current binding site
        # in its correct random position - current_binding_site_pos.
        generated_sequence = generated_sequence[:current_binding_site_pos] + current_binding_site + generated_sequence[(current_binding_site_pos+len(current_binding_site)):]  
        
        # current_binding_site_pos is then added to the binding_sites_positions list so that conflicts with binding
        # sites in future iterations can be checked.
        binding_sites_positions.append(current_binding_site_pos)

        # The final string generated_sequence is now returned
    return generated_sequence
    
# Now to write to a file, we use the following code. First, we check if a file called promoters.txt already exists
# or not using os.path.isfile("promoters.txt"). If it does exist, we simply overwrite the file. If it doesn't, we
# create a new file.
if os.path.isfile("promoters.txt"):
    f = open("promoters.txt", "w")
else:
    f = open("promoters.txt", "x") #the new file is created/overritten in the same directory as the notebook
    
# We now use a while loop to iterate a 100 times, and write the string outputted by generate_sequence to the file
# in each iteration. Hence, a 100 sequences are created and the task is finished.

#we use a counter to run the function 100 times
k=0 
while k < 100:
    f.write('Random sequence number ') #We add a quick description for each sequence
    f.write(str(k+1)) #we do k+1 to label each sequence from 1 to 100
    f.write('   ') 
    f.write(generate_sequence()+'\n\n') #we generate with two \n to create a break between each generated sequence
    k+=1
# We now close the file stream.
f.close()