+ indep. WoS citations

Python and Networks // Homework 2017-03-07 // Degree distribution of a computed network

Problem

1. Take a ~1MB text file from the web or your computer. Examples:
$$\rightarrow$$  sudo ls -lR / |head -25000 > dat.txt
$$\rightarrow$$  wget http://www.gutenberg.org/cache/epub/1661/pg1661.txt -O dat.txt
2. Consider the 3-character substrings of the entire text file. For each 3-char substring (let's call it a node) consider the set of all different substrings that this node is adjacent to. Example: The string "apple orange" (without quotes) contains 10 different 3-char substrings (nodes): "app", "ppl", etc. Within this string the substring "ora" has 2 different neighbors: "le " and "nge".
3. For each node the number of its neighbors is the degree of this node.
4. Plot the c.c.d.f. of node degrees.

Solution (example)

## 1.  Python code (ccdf.py)

import sys
import matplotlib.pyplot as plt

# === read command-line arguments ===

_, inFile, outFileTxt, outFileImg = sys.argv

# === function definitions ===

def saveNei(node2neiSet,node_A,node_B):
# save node_B as a neighbor of node_A

# IF we have no neighbors saved for node_A yet,
if node_A not in node2neiSet:
# THEN declare the set of its neighbors
node2neiSet[node_A] = set()
# In all cases: add node_B to the set of neighbors
# of node_A. IF node_B is already a neighbor of
# node_A, THEN the set remains unchanged.

# ---------------

def inFile2nw(inFile,node2neiSet):

# open the input file for reading
with open(inFile,"r") as f:
# read into a single string
# move a sliding window of length 3+3=6 over "txt"
# "pos" is the index of the first char of the window
# for example, if len(txt)=7, then pos = 0, 1
for pos in range(len(txt)-5):
# s1: the 3-char substring starting at the
# character that has the index "pos"
s1 = txt[pos:pos+3]
# s2: the right neighbor of s1
s2 = txt[pos+3:pos+6]
# save s2 as a neighbor of s1
saveNei(node2neiSet,s1,s2)
# save s1 as a neighbor of s2
saveNei(node2neiSet,s2,s1)

# ---------------

def nw2ccdf(node2neiSet,ccdf):

# histogram: h(d) is the number of nodes with degree d
h = {}
# node degree: length of the set of neighbors
for d in map(lambda node: len(node2neiSet[node]), node2neiSet):
# IF we have not seen this node degree yet,
if d not in h:
# THEN set its number to zero
h[d] = 0
# in all cases: remember that we have found
# one more node with node degree d
h[d] += 1

# ---------------------------------------------------------
# !!! we assume here that each node has at least 1 neighbor
#               in other words: there are no isolated nodes
# ---------------------------------------------------------
# with this assumption the number of nodes is the number
# of those nodes that do have at least one neighbor
n = len(node2neiSet)

# set the ccdf, starting from the bottom
ccdf_now = 1.0
# loop through the list of node degrees in ascending order
for d in sorted(h):
ccdf[d] = ccdf_now
ccdf_now -= 1.0 * h[d] / n

# test: after the for loop "ccdf_now" should be 0.0
#print(ccdf_now)

# ---------------

def writePlotCcdf(ccdf,outFileTxt,outFileImg):

# open text output file
with open(outFileTxt,"w") as f:
f.write("# Node degree\n")
f.write("#\Probability of nodes with this or a higher degree\n")
f.write("\n") # blank line to separate header from data
# print data
for d in sorted(ccdf):
f.write("%d\t%g\n" % (d,ccdf[d]))

# initialize the output figure
fig, ax = plt.subplots()
# set log-log axes, clip non-positive values
ax.set_xscale("log", nonposx='clip')
ax.set_yscale("log", nonposy='clip')
# set limits of the axes
ax.set_xlim(left=0.7,right=2000)
ax.set_ylim(bottom=0.00005,top=2)
# set axis labels and plot title
ax.set_xlabel('d: number of neighbors of a node (character triplet)',labelpad=2)
ax.set_ylabel('CCDF(d): probability of nodes with at least d neighbors',labelpad=2)
plt.title("Node degree CCDF in the network of\nadjacent character triplets in 0.5MB English text")
# plot data points: ccdf values (y) vs ccdf keys (x)
x = sorted( ccdf.keys( ) )
y = [ ccdf[ k ] for k in x ]
plt.loglog(x, y, marker='o', markeredgecolor='r', markerfacecolor='w', linestyle='None')
# export figure and close the figure
fig.savefig(outFileImg)
plt.close(fig)

# === main ===

# node2neiSet[n]: the set of the neighbors of node <n>
node2neiSet = {}; inFile2nw(inFile,node2neiSet)

# ccdf{d} is the number of nodes that have a node degree equal to or above d
ccdf = {}; nw2ccdf(node2neiSet,ccdf)

# write and plot the cdf
writePlotCcdf(ccdf,outFileTxt,outFileImg)


## 2.  How to use the python code

python3 dat.txt ccdf.txt ccdf.png


## 3.  Output text file (ccdf.txt)

Download the data file ccdf.txt:

## 4.  Output image: CCDF of node degrees (ccdf.png)

Output image for the Sherlock Holmes text: