+ indep. WoS citations

Python and Networks // Homework 2017-02-28 // Distribution of property values

Problem
Download the 2016 Property Tax List data file for Atlantic county, which is a county in the state of New Jersey, US. On the download page go to "Raw data \(\rightarrow\) 2016 \(\rightarrow\) Atlantic". Each data line contains data about a single property. Within each data line characters 439 to 447 show the net value of the given property. Plot the cumulated complementary histogram (CCH) of net values. Horizontal axis: \(n\), net value of property. Vertical axis: number of properties with a net value above \(n\).

Example: \(CCH(n=3000)\) is the number of properties that have a net value equal to or above 3,000 USD.

Solution (example)

1.  Python code (atlantic.py)

# Compute p.d.f. of property values from a single county of New Jersey
import sys

# === Function definitions ===

def read_values(inFile,values):

    # Open data file for reading
    with open(inFile,"r") as f:
        # Read the data file line by line
        for line in f:
            # extract the net value: characters 439 to 447
            value = line[438:447]
            # remove leading zeroes from the value
            value = value.lstrip("0")
            # IF the value is a non-empty string
            if( 0 < len(value) ):
                # THEN save it as an integer
                values.append( int(value) )
            # ELSE: save it as a zero
            else:
                values.append(0)

# ---------------

def computePdf_printToStdout(numbers):

    # --- Compute the histogram ---
    # declare the histogram of numbers
    h = {}
    # fill up the histogram
    for number in numbers:
        # IF we have NOT yet seen the current number as a key
        if number not in h.keys():
            # THEN declare this key and set the value to zero
            h[number] = 0
        # in all cases: add one to the value
        h[number] += 1

    # --- Compute the p.d.f. (probability density function) ---
    # define all values of the p.d.f.
    pdf = { int(_):(1.0*h[_]/len(numbers)) for _ in h }

    # --- Print the p.d.f. and the c.c.d.f. ---
    # File header
    print("# Property value (an integer)")
    print("#\tSame property value, but zero replaced with \"-\"")
    print("#\t\tHistogram (number of properties with this value)")
    print("#\t\t\tCCH (complementary cumulated histogram: properties reaching this value)")
    print("#\t\t\t\tPDF (probability density function)")
    print("#\t\t\t\t\tCCDF (complementary cumulated probability density function)")
    print("") # single empty line to separate header from data
    # Initialize CCH and CCDF for printing
    cch = len(numbers)
    ccdf = 1
    # Print data -- Assuming that the property value is an integer
    for prop_value in sorted(pdf.keys(),key=float):
        # Print property value
        sys.stdout.write("%d\t" % prop_value)
        # Print again, or (if zero) print "-"
        if 0 == prop_value:
            sys.stdout.write("-\t")
        else:
            sys.stdout.write("%d\t" % prop_value)
        # Print Histogram, CCH, PDF, CCDF
        sys.stdout.write("%d\t%d\t%g\t%g" % (h[prop_value],cch,pdf[prop_value],ccdf))
        # Change CHF and CCDF
        cch -= h[prop_value]
        ccdf -= pdf[prop_value]
        # Print newline character
        sys.stdout.write("\n")

# === main ===

# Read the list of property values, data source:
# http://www.state.nj.us/treasury/taxation/lpt/TaxListSearchPublicWebpage.shtml
values = []; read_values("Atlantic16.txt",values)

# Compute the p.d.f. of values, Print it to stdout
computePdf_printToStdout(values)

2.  How to run the python code

The program needs this input data file in the current directory: Atlantic16.txt

python3 atlantic.py > atlantic-16-dist.txt

3.  Output text file (atlantic-16-dist.txt)

Download the data file: atlantic-16-dist.txt (0.5MB)

First 10 lines of the data file:

# Property value (an integer)
#       Same property value, but zero replaced with "-"
#               Histogram (number of properties with this value)
#                       CCH (complementary cumulated histogram: number of properties reaching this value)
#                               PDF (probability density function)
#                                       CCDF (complementary cumulated probability density function)

0       -       741     155923  0.00475235      1
100     100     2352    155182  0.0150844       0.995248
200     200     1178    152830  0.00755501      0.980163
...

Same data file for 2010: atlantic-10-dist.txt (0.5MB)

4.  Gnuplot command file (atlantic.gnu)

se term post col enh "Helvetica,20"
se o "atlantic.ps"
se log xy
se key samplen .1
se xlab "Property value (USD) in Atlantic county, NJ"
se ylab "Number of properties reaching this value"
se xtic ("100" 100, "10^4" 1e+4, "10^6" 1e+6, "10^8" 1e+8) 
se ytic ("1" 1, "100" 100, "10^4" 1e+4, "10^6" 1e+6) 

p [50:2e+9][.5:1e+6] \
\
'atlantic-10-dist.txt' u 2:4 ti '2010' w p ps 1.5 pt 1 lw 3 lt 1, \
'atlantic-16-dist.txt' u 2:4 ti '2016' w p ps 1.5 pt 6 lw 2 lt 3

# converting ps to png
# convert -rotate 90 -geometry 400 -sharpen 5 atlantic.ps atlantic.png

5.  Output image: Cumulated Complementary Histogram