+ indep. WoS citations

Python and Networks // Homework 2017-02-21 // Nucleotide pair frequency

Problem
Read the fasta (.fa) formatted sequence of Chromosome I of the human genome. List which bases are seen adjacent to each other: AT and TA should be both counted as AT. For each pair count how many times it appears. For example: AC 123456, AT 234567, etc.

Solution (example)

1.  Python code (hist.py)

import sys

# === read arguments ===

_, infile_fasta, outfile = sys.argv

# infile:  ftp://ftp.ncbi.nlm.nih.gov/genomes/ 
#          Homo_sapiens/Assembled_chromosomes/seq/hs_alt_CHM1_1.1_chr1.fa.gz
# outfile: for each base pair the number of times it is in the sequence

# === function definitions ===

def read_dna_fasta(infile_fasta):

    # open the input fasta file for reading
    with open(infile_fasta,"r") as f:
        # discard the first line
        tmp = f.readline()
        # read all lines into a single string, remove newlines
        return f.read().replace("\n","")

# -------------------

def compute_hist(dna,pair2use):

    # loop through the list of adjacent base pairs
    for pos in (range(len(dna)-1)):
        # the two bases in sorted order
        pair = ''.join(sorted(dna[pos:pos+2]))
        # IF we have not yet seen this sorted pair,
        if pair not in pair2use:
            # THEN set its usage to zero
            pair2use[pair] = 0
        # in all cases: increment its usage by 1
        pair2use[pair] += 1

# -------------------

def print_hist_basePairs(outfile,pair2use):

    # open the output file for writing
    with open(outfile,"w") as f:
        # print file header
        f.write("# Base pair (two bases in sorted order)\n")
        f.write("#\tUsage (number of times it is seen) in Human ChrI\n")
        f.write("\n") # this blank line separates the header from the data
        # print base pairs in descending order of usage
        for pair in sorted(pair2use,key=pair2use.get,reverse=True):
            f.write("%s\t%d\n" % (pair,pair2use[pair]))

# === main ===

# DNA sequence
dna = read_dna_fasta(infile_fasta)

# compute histogram, for example,
# pair2use{AT} is the number of times that either AT or TA appears in the sequence
pair2use = {}; compute_hist(dna,pair2use)

# print the histogram
print_hist_basePairs(outfile,pair2use)

2.  How to use the Python code

gunzip hs_alt_CHM1_1.1_chr1.fa.gz
python3 hist.py hs_alt_CHM1_1.1_chr1.fa hist.txt

3.  Output (hist.txt)

# Base pair (two bases in sorted order)
#       Usage (number of times it is seen) in Human ChrI

AT      31344667
CT      29827530
AG      29791927
GT      27980567
AC      27947710
NN      23304170
TT      21627108
AA      21583959
CC      12387013
GG      12372989
CG      12348719
AN      2140
NT      2076
GN      1071
CN      1017