computer science python

Curse Yi
pa6.pdf

CptS 111 — PA #6 Due Monday, Apr. 9, 2018

A SMALL VENTURE INTO FINANCIAL FORENSICS

Please read through this lab carefully because some of the code you’ll need to use is explained.

Header information. Your program must include the following information in the header.

• Name

• CptS 111, Spring 2018

• Programming Assignment #6

• Date

• Name of program

• Brief description and/or purpose of the program; include sources

Background. In this assignment you’ll develop a program to inspect data sets to determine the pos-

sibility that they’ve been forged. A program such as the one you’ll write can actually be used in the

field of investigation known as financial forensics or forensic accounting.

Suppose you have a set of numerical data that represents several orders of magnitude of numbers: the

diameters of spheres in the universe, the populations of every town and city in the world, the number

of hairs on every human head in Pullman, the heights of all human beings in the world in centimeters,

the voting counts for every county in the United States—you name it. Now let’s assume you have N

datapoints in your set, each associated with a number. Let’s further assume that the leading digit of

this number can be 1, 2, and so on, up to 9, but not zero, and that the number can’t be negative.

You’re asked “How many times is the leading digit in your set 1? 2? 3? 9?” You might say something

like, “I have no idea” or “I’ll have to write a Python program to figure it out.” However, the person

doesn’t want you to look at the data. She just wants to know the approximate number of times a

particular number is the leading digit. You think about this for a minute and decide that a reasonable

guess would be N/9 because all nine digits probably occur as the leading digit the same number of

times. Good thinking! You now feel quite pleased with yourself, right?

In fact, if you had checked your data set and found that each of the nine digits occurred N/9 times

(i.e., with uniform probability), then it’s very likely that your data set is either fake or has been fiddled

with! The correct answer is that the probability P that a leading digit is n, where 1 ≤ n ≤ 9, is

P(n) = log 10

(

1 + 1

n

)

. (1)

Multiplying this by 100 gives the approximate percentage of the data set that should have the leading

digit n. The following table shows the leading digit n, the probability of this digit being the leading

digit for any datapoint, and the percent of a data set that will have this leading digit:

1

n log 10

(

1 + 1 n

)

percent

1 0.30103 30.1

2 0.17609 17.6

3 0.12494 12.5

4 0.09691 9.7

5 0.07918 7.9

6 0.06695 6.7

7 0.05799 5.8

8 0.05115 5.1

9 0.04576 4.6

To get a sense of what this means, suppose your data set contains the numbers of friends a million

different Facebook users have. For about 301, 030 users, their numbers of friends will start with a 1,

but only about 45, 760 users will have numbers of friends starting with a 9. And if you think this is

a fluke, you can look at 10 million users. You’ll get about the same percentages. This behavior of

leading digits is called Benford’s law,1, and you can use Benford’s law to check the legitimacy of a

data set. In fact, Benford’s law can be extended to predict the distribution of digits beyond the first

digit. Thus, even if forgers know about Benford’s law, it won’t be easy for them to forge numbers

while simultaneously satisfying the reason for their forgery in the first place!

How do you implement Benford’s law for financial forensics? First you need to create a tally list with

nine counters, one for each possible digit, all initialized to zero. The easiest way to do this in Python

is to use operator overloading and multiply a list with a single zero by 9 as in the following:

1 >>> # Create a tally list with nine zeros.

2 >>> tally = [ 0 ] * 9

3 >>> print(tally)

4 [0, 0, 0, 0, 0, 0, 0, 0, 0]

In line 2, operator overloading is used to create the list. Lines 3 and 4 show that this really works.

Next you have to figure out how to identify the leading digit of the datapoint. Datapoints might be

floats, integers, or even strings, and you need to be able to deal with any of these three types. One

way to do this is to convert all the datapoints to strings (and you know how to do this; also, note that

if the datapoints are already strings, it doesn’t hurt, i.e., str() with a string argument just returns a

string). Now, regardless of their original form, the datapoints are all strings. Furthermore, it’s easy to

identify the leading digit; it’s simply the first character of the string. A string can be indexed just like

a tuple or list, so the first character of a string named s is simply s[0].

But wait. There’s a slight flaw in this thinking. Suppose the datapoints are originally strings as in the

following:

1 >>> # Data given as strings.

1en.wikipedia.org/wiki/Benford’s_law

2

2 >>> data = [’ 998.64’, ’1965.01’, ’ 394.66’, ’2027.22’, ’1234.97’]

3 >>> for datapt in data:

4 ... s = str(datapt) # Ensure datapoint is a string.

5 ... lead = s[0] # Get first character of string.

6 ... print(lead) # Note: Leading digit is a string.

7 ...

8

9 1

10

11 2

12 1

There are 5 datapoints in our data as shown in line 2, but we only end up with 3 leading digits. What

has gone wrong? If you look closely at the data in line 2, you’ll see that there are four spaces for

the integer portion of the datapoints. However, for 998.64 and 394.66, the integer portion has only

3 characters and the fourth position is a blank space. Thus, when we select the first character for

these two datapoints, we end up with a blank space! Clearly we need to remove blank spaces because

a blank space is a character. To remove blank spaces, we use the string method strip(). This

removes whitespace from both sides of the datapoint. Let’s see how it works.

1 >>> # Data given as strings.

2 >>> data = [’ 998.64’, ’1965.01’, ’ 394.66’, ’2027.22’, ’1234.97’]

3 >>> for datapt in data:

4 ... # Ensure datapoint is a string and strip whitespace.

5 ... s = str(datapt).strip()

6 ... lead = s[0] # Get first character of string.

7 ... print(lead) # Note: Leading digit is a string.

8 ...

9 9

10 1

11 3

12 2

13 1

Voila! It worked, so now we know how to handle blank spaces (actually, because we’re only interested

in the leading digit, we could have just used lstrip() to remove whitespace to the left of the

datapoint). All that remains is to figure out how to increment the tally for each occurrence of a

leading digit. Let’s consider how this might be done. Recall that we have a list of nine counters,

each counter representing a digit between 1 and 9 and each addressable by a unique index. We can

associate tally[0] with the leading digit 1, tally[1] with the leading digit 2, and so on up to

tally[8] with the leading digit 9. Notice that the relationship between a leading digit and an index

is the same regardless of which digit we consider, i.e., tally[lead digit - 1] uses the correct

index for each leading digit. Thus, we can tally leading digits without knowing what the leading digits

are.

3

Program: Using Benford’s Law to Detect Falsified Data

The program you must write will prompt the user for the name of a file that contains the data to be

analyzed. You want to determine how well the data satisfy Benford’s law. It is assumed that every line

in the file starts with some text followed by a comma. There can be an arbitrary amount of additional

text, but any text ultimately ends with a comma. It is further assumed that the last entry in the line

is a numeric datapoint, i.e., this value follows the last comma in the line. However, there may be an

arbitrary number of spaces between the comma and the datapoint. For example, a line might contain

the name of a lake, a comma, the country in which the lake exists, a comma, and then the area of the

lake. We don’t care at all about the text in the file; we only care that there is some text and that it

and our numeric datapoint are separated by a comma. Why is this? Because we can split our line on

commas; split() converts a string into a list of elements, and we can use the proper index to select

the last element in this list which will contain the numeric datapoint we want to analyze.

You’ll need to open the data file using the open() function. Recall that open() takes two argu-

ments: the name of the file to be opened and the character that specifies whether the file is to be

opened for reading or writing (’r’ for reading, ’w’ for writing). You’ll then use the opened file

as the iterable in a for-loop. In fact, you can nest the input() function used to prompt for the

filename inside the open() function and save yourself one line of code, e.g.:

file = open(input(’Enter filename: ’), ’r’)

for line in file:

# Do something with a line in the file at each loop iteration.

After you process the data file and complete the tally list, the next step is to display the contents

of the list in such a way that you can tell how well the data in the file satisfy Benford’s law. To do

this you can compare the measured tally of the occurrences of each digit to the theoretical number of

occurrences as predicted by Benford’s law. A function called show digits() is provided for you

on the programming assignment Web page so you don’t have to write a function yourself. The input

to this function is your tally list.

In an IDLE editor window, create a program called benford.py. This program mainly consists of

a main() function and a call to main(). It must do the following:

• Create a tally list that has nine elements, each of which is initially zero.

• Prompt the user for the name of a data file and open it.

• Use a for-loop to read each line in the file, splitting each line on the commas (line.split(’,’))

to create a list. The value after the final comma is the datapoint of which you want the lead-

ing digit. You should not make any assumptions about the number of spaces between the final

comma and the datapoint at the end of the line, i.e., use strip() as shown above. (Note

that you don’t have to convert the datapoint to a string because it’s already a string. The entire

4

data file is made up of ASCII characters!) For the given digit, increment the tally list as

appropriate. Don’t forget to convert your digit to an integer.

• Call the show digits() function after the tally has been completed. You may include this

function in your code or else import it. If you import it, don’t forget that it needs to be in your

working directory or else in the path searched by Python.

• Print tally.

Don’t forget to include docstrings in your code!!!

To test your code for a small data file, download pa6 test.dat from the homework Web page.

You should get the following results:

1 >>> import pa6

2 Enter file name: pa6_test.dat

3 Total number of datapoints = 13

4 Number of occurrences of leading digits:

5 digit measured predicted

6 1 1 4

7 2 1 2

8 3 1 2

9 4 1 1

10 5 1 1

11 6 1 1

12 7 2 1

13 8 2 1

14 9 3 1

15 [1, 1, 1, 1, 1, 1, 2, 2, 3]

This particular set of data is small and was created just to provide one way for you to test your code

(you’re encouraged to come up with other ways). However, the distribution of leading digits is rather

suspicious (according to Benford’s law) and thus it is likely (and altogether true) that this data was

forged rather than a result of some real process. Note that show digits() rounds the results of

Benford’s law to obtain integers, so the total number of predicted digits may not agree with the actual

number of datapoints. This is the case here: there are a total of 14 predicted values and yet there

are only 13 datapoints in the original data. We won’t worry about this discrepancy (which can be

corrected using conditional statements).

The programming assignment Web site provides a much larger data file. This file, called county pop.dat,

provides the population of every county in the United States according to the US Census Bureau.

There are a total of 3,141 counties reported in this file, some of which are shown below:

1 Alabama, Autauga County, 49960

2 Alabama, Baldwin County, 171769

5

3 ...

4 Alaska, Yakutat City and Borough, 689

5 ...

6 New York, Kings County, 2528050

7 ...

8 Washington, Whitman County, 41229

9 ...

10 Wyoming, Washakie County, 7816

11 Wyoming, Weston County, 6854

Download this file, and use it as the input data file for your program. You should get the following

results:

1 Enter file name: county_pop.dat

2 Total number of datapoints = 3141

3 Number of occurrences of leading digits:

4 digit measured theory

5 1 969 946

6 2 584 553

7 3 366 392

8 4 317 304

9 5 214 249

10 6 210 210

11 7 178 182

12 8 151 161

13 9 152 144

14 [969, 584, 366, 317, 214, 210, 178, 151, 152]

Rather amazingly the predicted number of times the data should have a leading digit of 6 is 210 and

the number of times there is actually a county population that starts with a leading digit of 6 is also

210. There are differences between the measured and predicted values for the other leading digits, but

the differences aren’t large. Thus, we can be fairly confident that the US Census Bureau is not forging

its data (although it isn’t likely that the data are entirely accurate).

Submission information. Use Blackboard to submit a zipped file called

<first initial+lastname> pa6.zip. You should know how to zip a file by now. If you

still don’t, find out from your TA. Submit PA #6 as a zip file up to 3 times before the deadline. Don’t

forget that style counts, and don’t forget docstrings.

6