+ indep. WoS citations

Python and Networks // Homework 2017-03-07 // Degree distribution of a computed network

Problem

  1. Take a ~1MB text file from the web or your computer. Examples:
     \(\rightarrow\)  sudo ls -lR / |head -25000 > dat.txt
     \(\rightarrow\)  wget http://www.gutenberg.org/cache/epub/1661/pg1661.txt -O dat.txt
  2. Consider the 3-character substrings of the entire text file. For each 3-char substring (let's call it a node) consider the set of all different substrings that this node is adjacent to. Example: The string "apple orange" (without quotes) contains 10 different 3-char substrings (nodes): "app", "ppl", etc. Within this string the substring "ora" has 2 different neighbors: "le " and "nge".
  3. For each node the number of its neighbors is the degree of this node.
  4. Plot the c.c.d.f. of node degrees.

Solution (example)

1.  Python code (ccdf.py)

import sys
import matplotlib.pyplot as plt

# === read command-line arguments ===

_, inFile, outFileTxt, outFileImg = sys.argv

# === function definitions ===

def saveNei(node2neiSet,node_A,node_B):
    # save node_B as a neighbor of node_A

    # IF we have no neighbors saved for node_A yet,
    if node_A not in node2neiSet:
        # THEN declare the set of its neighbors
        node2neiSet[node_A] = set()
    # In all cases: add node_B to the set of neighbors
    # of node_A. IF node_B is already a neighbor of
    # node_A, THEN the set remains unchanged.
    node2neiSet[node_A].add(node_B)

# ---------------

def inFile2nw(inFile,node2neiSet):

    # open the input file for reading
    with open(inFile,"r") as f:
        # read into a single string
        txt = f.read()
        # move a sliding window of length 3+3=6 over "txt"
        # "pos" is the index of the first char of the window
        # for example, if len(txt)=7, then pos = 0, 1
        for pos in range(len(txt)-5):
            # s1: the 3-char substring starting at the
            # character that has the index "pos"
            s1 = txt[pos:pos+3]
            # s2: the right neighbor of s1
            s2 = txt[pos+3:pos+6]
            # save s2 as a neighbor of s1
            saveNei(node2neiSet,s1,s2)
            # save s1 as a neighbor of s2
            saveNei(node2neiSet,s2,s1)

# ---------------

def nw2ccdf(node2neiSet,ccdf):

    # histogram: h(d) is the number of nodes with degree d
    h = {}
    # node degree: length of the set of neighbors
    for d in map(lambda node: len(node2neiSet[node]), node2neiSet):
        # IF we have not seen this node degree yet,
        if d not in h:
            # THEN set its number to zero
            h[d] = 0
        # in all cases: remember that we have found
        # one more node with node degree d
        h[d] += 1

    # ---------------------------------------------------------
    # !!! we assume here that each node has at least 1 neighbor
    #               in other words: there are no isolated nodes
    # ---------------------------------------------------------
    # with this assumption the number of nodes is the number
    # of those nodes that do have at least one neighbor
    n = len(node2neiSet)

    # set the ccdf, starting from the bottom
    ccdf_now = 1.0
    # loop through the list of node degrees in ascending order
    for d in sorted(h):
        ccdf[d] = ccdf_now
        ccdf_now -= 1.0 * h[d] / n

    # test: after the for loop "ccdf_now" should be 0.0
    #print(ccdf_now)

# ---------------

def writePlotCcdf(ccdf,outFileTxt,outFileImg):

    # open text output file
    with open(outFileTxt,"w") as f:
        # print file header
        f.write("# Node degree\n")
        f.write("#\Probability of nodes with this or a higher degree\n")
        f.write("\n") # blank line to separate header from data
        # print data
        for d in sorted(ccdf):
            f.write("%d\t%g\n" % (d,ccdf[d]))

    # initialize the output figure
    fig, ax = plt.subplots()
    # set log-log axes, clip non-positive values
    ax.set_xscale("log", nonposx='clip')
    ax.set_yscale("log", nonposy='clip')
    # set limits of the axes
    ax.set_xlim(left=0.7,right=2000)
    ax.set_ylim(bottom=0.00005,top=2)
    # set axis labels and plot title
    ax.set_xlabel('d: number of neighbors of a node (character triplet)',labelpad=2)
    ax.set_ylabel('CCDF(d): probability of nodes with at least d neighbors',labelpad=2)
    plt.title("Node degree CCDF in the network of\nadjacent character triplets in 0.5MB English text")
    # plot data points: ccdf values (y) vs ccdf keys (x)
    x = sorted( ccdf.keys( ) )
    y = [ ccdf[ k ] for k in x ]
    plt.loglog(x, y, marker='o', markeredgecolor='r', markerfacecolor='w', linestyle='None')
    # export figure and close the figure
    fig.savefig(outFileImg)
    plt.close(fig)

# === main ===

# node2neiSet[n]: the set of the neighbors of node <n>
node2neiSet = {}; inFile2nw(inFile,node2neiSet)

# ccdf{d} is the number of nodes that have a node degree equal to or above d
ccdf = {}; nw2ccdf(node2neiSet,ccdf)

# write and plot the cdf
writePlotCcdf(ccdf,outFileTxt,outFileImg)

2.  How to use the python code

python3 dat.txt ccdf.txt ccdf.png

3.  Output text file (ccdf.txt)

Download the data file ccdf.txt:

4.  Output image: CCDF of node degrees (ccdf.png)

Output image for the Sherlock Holmes text: