+ indep. WoS citations

Python and networks - 2016-05-10

1.  Problem: Characteristic length in text and DNA

Download the plain text version of “The Origin of Species by Means of Natural Selection” by Charles Darwin. Replace any sequence of consecutive whitespaces with a single space character. Save the "Contents" section of the book in a text variable. Each group of \(n\) characters appearing in this text is called an \(n\)-gram. For \(n=1,2,\ldots,20\) compute \(f_\mathrm{txt}(n)\), which is the number of occurrences of the most frequently used \(n\)-gram. Here "txt" means text. Similarly, compute \(f_\mathrm{dna}(n)\) for the first 10,000 nucleotides in Chromosome 1 of baker's yeast (Saccharomyces cerevisiae).

Plot \(f_\mathrm{txt}(n)\) and \(f_\mathrm{dna}(n)\) together. How and why do they differ?

2.  Solution

2.1  Python code (top-ngram.py)

# number of occurrences of the most frequently used n-gram 
# as a function of n in two data sets
import random; import sys; import re

script, infileTXT, infileDNA, groupSizeMax, outfile = sys.argv

# === function defitions ===

def find_top_ngram_usage(data,n,n2top):

    # the frequency (number of times) that each n-grams of length n is used
    nGramFreq = {}

    # take the starting position of each substring of length n
    for i in (range(0, len(data)-int(n)+1)):
        # extract the substring (group) that starts at this position
        subStr = data[i:i+int(n)]
        # if we have already seen this substring, then increment its counter
        if subStr in nGramFreq:
            nGramFreq[subStr] += 1
        else:
        # else: set its counter to 1
            nGramFreq[subStr] = 1

    # result: save the number of occurrences of the most frequently used n-gram
    n2top[n] = max(nGramFreq.values())

# === main ===

# read text file into a single string 
txt = open(infileTXT,'rbU').read()
# save the "Contents" section of the book
txt = re.findall('CONTENTS\.\s+?([\d\D]+?INDEX\.)', txt).pop(0)
# replace any sequence of whitespaces with a single space character
txt = ' '.join(txt.split())
# txt_n2top{n}: most frequently used n-gram of size n
txt_n2top = {}

# read file containing the DNA sequence
dna = open(infileDNA,'rbU').read()
# remove first line
dna = re.sub(r'^.+?\n', r'', dna)
# remove newlines and spaces
dna = re.sub(r'\s+', r'', dna)
# use only the first 10,000 nucleotides of the sequence
dna = dna[0:10000]
# dna_n2top{n}: most frequently used n-gram of size n
dna_n2top = {}

# open outfile unbuffered, print file header
o = open(outfile, "w", 0)
o.write("# Group size (n)\n")
o.write("#\tNumber of occurrences of the most frequently used n-gram of size n\n")
o.write("#\tText\n")
o.write("#\t\tDNA\n")
o.write("\n")

# list all possible group sizes, n
for n in range( 1, 1+int(groupSizeMax) ):

    # find most frequently used n-gram of size n
    find_top_ngram_usage(txt,n,txt_n2top)
    find_top_ngram_usage(dna,n,dna_n2top)

    # print output data
    o.write("%d\t%d\t%d\n" % (n,txt_n2top[n],dna_n2top[n]))

2.2  Usage of python code

The two input files are 2009.txt.utf-8 and chr01.fsa

python top-ngram.py 2009.txt.utf-8 chr01.fsa 10 top-ngram.txt

2.3  Output (top-ngram.txt)

# Group size (n)
#       Number of occurrences of the most frequently used n-gram of size n
#       Text
#               DNA

1       1071    3268
2       169     1206
3       131     473
4       128     189
5       78      80
6       32      45
7       32      28
8       29      17
9       15      12
10      14      10
11      9       9
12      9       8
13      9       7
14      8       6
15      8       5
16      8       4
17      8       3
18      8       2
19      5       2
20      5       1

2.4  Gnuplot command file (top-ngram.gnu)

# settings
se term posts lands color enh dash "Helvetica" 22
se xlab "n  [size of n-gram]"
se ylab "Number of occurrences of the\nmost frequently used n-gram"
se grid
se o 'top-ngram.ps'
se key top right
se log y

# plot
p [-1:21][.7:5000] \
\
'top-ngram.txt' u 1:2 w p pt 1 ps 2 lt 3 lw 3 ti 'Text', \
\
''              u 1:3 w p pt 2 ps 2 lt 1 lw 3 ti 'DNA'

2.5  Using gnuplot and converting the image

gnuplot top-ngram.gnu
convert -geometry 320 -rotate 90 -sharpen 5 top-ngram.ps top-ngram.png

2.6  Output (top-ngram.png)