Mathematica Software Expert needed asap

Denise-121

EIN-3235-Project-example-282129-v03.pdf

Home >Engineering homework help >Mathematica Software Expert needed asap

Note:

1) large number of input cells is hidden.

2) Some formatting in the PDF file may differ from the original report

A Statistical Analysis of the Latin and English

versions of the Aeneid by Virgil

EIN 3235 Evaluation of Engineering Data

Section U21, Winter 2026, Group No. 89

John Smith

PID: 1111111

Sam Johnson

PID: 2222222

Abstract

For this research project we wanted to use statistical analysis on both the English and

Latin versions of the epic poem Aeneid so that we may make comparisons between

versions using our findings, specifically focusing on the word lengths. We hope to use

this data to make accurate conclusions about the two different versions such as

complexity and length. To accomplish this we used the program Mathematica, a statistical

analysis tool that will do our calculations and graphical summaries. Using Mathematica,

our data shows that when comparing both versions of the poem, the Latin version use both

on average longer words than the English version and even demonstrates a higher level of

sophisticated language use.

January, 2026

1. Introduction

Aeneid is a Latin epic poem written by Virgil between 29 and 19 BC that tells the story of a Trojan named Aeneas

and how he became the ancestor of the Romans spanning across 12 books. This project is to use statistical analysis to

compare both the English and Latin versions of the Aeneid and make conclusions based on the data. Through this research,

one will not only learn how to conduct statistical analysis using a real world model, but also hone their skills as engineers to

make well thought out conclusions based on the evidence collected.

Rationale

In engineering, statistical analysis is an important aspect in all disciplines of engineering from Electrical to Civil and

honing these skills is just one of the many responsibilities engineering students have if they are to be successful in the field,

hence this research is aimed at practicing how to use different statistical analysis techniques using computer-based mathemat-

ical tools such as Mathematica.

Purpose

The purpose of this project is not only to make accurate and significant comparisons between the English and Latin

versions of the “Aeneid,” but more importantly to practice statistical analysis using Wolfram’s Mathematica.

The purpose of this project is not only to make accurate and significant comparisons between the English and Latin

versions of the “Aeneid,” but more importantly to practice statistical analysis using Wolfram’s Mathematica.

Research Questions

è How does one use Wolfram Mathematica?

è What is statistical analysis?

è What is the difference between a sample and the population?

è What do the different kinds of graphical summaries say about the data set?

è Which numerical summaries are appropriate to make conclusions about a data set?

è How is statistical analysis useful in engineering?

2. Statistics concepts in engineering

Engineering and Science are subjects that are deeply rooted in the measurement and analyzation of changing

variables from the electrical engineer measuring the resistance in different resistors to the chemist calculating the amount of

moles in a gaseous substance and even to the zoologist counting the amount of a certain species per square mile in the

African savannas. The problem lies where ever there is a measurement, there is also the chance of a measurement error and

that is where statistical analysis comes into play. Not only does statistical analysis assure an accurate measurement, but also

acts as a measurement of the likely error as well. Statistics allows for the accurate measurements of experimental variables

through repetition. This allows for conclusions to be accurate and not due to a coincidence. Statistical analysis also allows

for large amounts of data to be described easily using simple to read graphical summaries as well as representing the data

with a single numerical value, often described as the “central” measure of all the data. Using these graphical summaries as

well as the central numerical values, one could even use this information to make accurate predictions in the field of

engineering through the application of statistical probability. This is used often in manufacturing in combination with

sampling techniques to test different manufacturing techniques and deciding on the best one. Other examples of statistical

concepts in engineering and science fields are shown below:

è Quality Control applies the concepts of statistical analysis to the factors involving manufacturing to decide whether or

not a sample should be accepted

è Reliability engineering measures how a system can consistently performs a specific function under specified conditions

for a predetermined amount a time [3]

è Biostatistics uses statistical analysis to study biological phenomena and observations, extending as far as to medical

applications [4]

è Machine Learning is a subfield of computer science branching from artificial intelligence that uses statistical concepts

to recognize patterns and build algorithms according to data [5]

3. Problem statement

This project is focused on using statistical analysis to make conclusions about two versions of the Aeneid, the

English and Latin versions, while focusing on the analyzation of word lengths. An additional aspect of this project is to then

compare each version to its respective dictionary and make additional conclusions about the different versions of the

Aeneid. The use of the mathematical tool Mathematica will help in the calculation of the large amounts of data as well as

provide us with the different types of graphical summaries to help in making conclusions about the different versions of the

Aeneid.

4. Description of data

The library of Mathematica© includes the Latin and English versions of the Aeneid of Virgil. By using the com-

mand “ExampleData[]” and specifying the language, we can access both versions and store them in an array.

2 EIN-3235-Project-example-(!)-v03.nb

english = ExampleData@8"Text", "AeneidEnglish"<D; latin = ExampleData@8"Text", "AeneidLatin"<D;

To perform the analysis, we have to make these arrays a number array, being each number the length of each word.

Also we will divide some compound words to make them count as two and trim some words to get rid of possesives.

englishArray = StringSplit � english;

latinArray = StringSplit � latin;

H* Split Latin words with "---" in them e.g. "iussi---miserum!---septena" *L latinArray = StringSplit@ð, "---"D & �� latinArray �� Flatten;

H* Split English words with "-" in them e.g. "great-grandsire" *L englishArray = StringSplit@ð, "-"D & �� englishArray �� Flatten;

H* Trim words like ~*this?% *L englishArray = StringTrim@ð, R � "\\W+"D & �� englishArray; latinArray = StringTrim@ð, R � "\\W+"D & �� latinArray;

H* Get rid of possesive "'s" e.g. "father's" *L englishArray = StringTrim@ð, "'s"D & �� englishArray;

H* Remove cases of 0 word length since these aren't words *L englishArray = Select@englishArray, StringLength � ð ¹ 0 &D; latinArray = Select@latinArray, StringLength � ð ¹ 0 &D;

englishLengthArray = StringLength �� englishArray; latinLengthArray = StringLength �� latinArray;

In this project there are two main data sets that must be thoroughly analyzed, the English version of the Aeneid and

the Latin version of the Aeneid, which will be referred to as “English” and “Latin” respectively. For the data set of English,

the population consists of every single word spanning across all 12 books resulting in 106,578 data points to analyze in the

range of 1 and 15. Surprisingly the most frequent word length in this data set is only 3 characters with 24,785 words,

making up approximately 23% of all the data in English. In contrast, the population of Latin contains only 64,743 words

with the most frequent word length being 5 characters with 11,626 words, making up approximately 18% of all the data in

Latin. Additonal numerical summaries are provided below:

EIN-3235-Project-example-(!)-v03.nb 3

English Aeneid Latin Aeneid

ð of Words 106 578. 63 743.

Mean 4.45272 5.76154

95% Mean CI 84.44112, 4.46432< 85.74442, 5.77866<

Std.Dev. 1.93219 2.20522

Q1 3. 4.

Q2 4. 6.

Q3 6. 7.

IQR 3. 3.

Skewness Coeff. 0.718498 0.226205

Just by taking a look to the mean of both arays we can conlude that words in Latin are a little bit longer than in

English, more than 1 letter. Also by comparing both Std. Dev. we can state that the words in Latin tend to have a little bit

more of variation in number of letters.

5. Relevant statistical and graphical summaries

Mathematica allows us to perform a deeper level of data analysis, and therefore we will take advantage of its tools.

Lets start with some graphics:

4 EIN-3235-Project-example-(!)-v03.nb

English

F r

e q

u e

n c

2 4 6 8 10 12 14 16

5000

10 000

15 000

20 000

25 000

Word Length

Latin

F r

e q

u e

n c

5 10 15

2000

4000

6000

8000

10 000

12 000

Word Length

With these two graphic we see the distribution of the words by its length. As we stated before, Latin words are

usually longer: the three lengths more frequently used in English are 4, 5, and 6, whereas in Latin are 6, 7, and 8. By

making boxplots, we can make additional conclusions about the datasets.

Aeneid Word Lengths

èè

èèèèè

èèè

èè

èèèèè

èèèèèèèèèèè

èèè

èèèèèèèèèè

èèè

èè

èèè

èè

èèè

èèèèèè

èè

èèèèèèèèè

èèèèè

èè

èèèèè

èèèèèè

èè

èèèè

èèè

èèèè

èèèèèèèèè

èè

èèè

èèèèèè

èèè

èè

èèèèèèèèè

èèèè

èè

èèèèè

èè

èèèè

èèèèè

èèèèèè

èè

èèèèèèè

èèèèè

èè

èèèè

èè

èèèèè

èè

èèè

èèèè

èèè

èèèèèèè

èèèè

èèèèèèèèè

èèèèèèèè

èèè

èèèèèè

èè

èèè

èèèè

èèèèè

èèèè

èèèèèèèèèèèèèèè

èèèèèè

èèèè

èèè

èè

èèè

èè

èèèèèèèè

èèèèèèèèèèè

èèèè

èè

èèèèèè

èè

èèèè

èè

èèè

èèèèèè

èèèèèèè

èè

èèè

èèèèèèè

èèè

èèèè

èèè

èè

èèèèè

èèèèèèèè

èèèèèèè

èè

èèèèèèèè

èèè

èè

èèè

èè

èèèèèèèèè

èè

èèèèèè

èèèèèèèèè

èèèèèèèèèè

èè

èèè

èèèèèè

èèèè

èèèèèèè

èè

èèè

èè

èèèèèèè

èè

èèèèèè

èèèèèèèè

èèèè

èè

èèèèèèèè

W o rd

L e n

g th

English Aeneid

Latin Aeneid

Again, Latin usually uses longer words, but with this type of graphical analysis we can make additional conclusions.

The max word length in english is 15 letters, while in Latin it is 16. Also the median in English is 4, which is less than the

mean 4.45, while the median in Latin is 6, which is more than the mean 5.76.

EIN-3235-Project-example-(!)-v03.nb 5

Again, Latin usually uses longer words, but with this type of graphical analysis we can make additional conclusions.

The max word length in english is 15 letters, while in Latin it is 16. Also the median in English is 4, which is less than the

mean 4.45, while the median in Latin is 6, which is more than the mean 5.76.

A good way to save time in analyzing a population (in this case the words of the entire book) is to extract a random

part of the text. We will extract a sample of 1000 random words and compare it to the entire population to see how well it

represents the words lengths of the book’s respective versions.

English Sample v. Population

R e

l a

t i

v e

F r

e q

u e

n c

2 4 6 8 10 12 14 16

0.05

0.10

0.15

0.20

0.25

Population

Sample

Word Length

Latin Sample v. Population

R e

l a

t i

v e

F r

e q

u e

n c

5 10 15

0.05

0.10

0.15

0.20

Population

Sample

Word Length

The shape of both English and Latin graphics when compared to its simple random sample graphics look to be

extremely similar. Therefore, it is ok to state initially that the sample is a good representation of the total population. We

cannot forget the slight differences between the sample graphics and the entire population graphic, but this difference is due

to what we call “sampling variation”.

So that we may make better and more accurate conclusions based on evidence, we can compare the word lengths

used in the books with the word lengths in the dictionaries, both English and Latin. In doing this we can make some other

observations.

6 EIN-3235-Project-example-(!)-v03.nb

English Dictionary vs. Aeneid

R e

l a

t i

v e

F r

e q

u e

n c

5 10 15 20

0.05

0.10

0.15

0.20

0.25

Dictionary

Aeneid

Word Length

Latin Dictionary vs. Aeneid

R e

l a

t i

v e

F r

e q

u e

n c

5 10 15 20

0.05

0.10

0.15

0.20

Dictionary

Aeneid

Word Length

With this new comparison, we see the words used in the Aeneid in English version are unusually shorter than the

population in the English dictionary. On the other hand, words used in the Aeneid in the Latin version is more closly similar

to the population of the dictionary, even though words on the Aeneid are also shorter than in the dictionary. This difference

could be caused because, while the latin language is ancient and there wasn’t a previous translation from other version, the

English version had to be translated from Latin; hence, the words used in English could have resulted shorter in an effort of

using ancient English words.

We can appreciate this same difference with boxpolts.

Dictionaries v. Aeneids

èè

èèèèè

èèè

èè

èèèèè

èèèèèèèèèèè

èèè

èèèèèèèèèè

èèè

èè

èèè

èè

èèè

èèèèèè

èè

èèèèèèèèè

èèèèè

èè

èèèèè

èèèèèè

èè

èèèè

èèè

èèèè

èèèèèèèèè

èè

èèè

èèèèèè

èèè

èè

èèèèèèèèè

èèèè

èè

èèèèè

èè

èèèè

èèèèè

èèèèèè

èè

èèèèèèè

èèèèè

èè

èèèè

èè

èèèèè

èè

èèè

èèèè

èèè

èèèèèèè

èèèè

èèèèèèèèè

èèèèèèèè

èèè

èèèèèèèèèèèèèèèèèèèèèèèèèèèèèèèèèèèèèèèèèèè

èèèèèèèèèèèèèèèèèèèèèèèèèèèèèèèèèèèèèèèèèèèèèèèèèèèèèèèèèèèèèèèèèèèèèèèèèèè

èèèèèè

èèè

èè

èèè

èè

èèè

èè

èèèèè

èè

èèè

èè

èèèè

èè

èèè

èèèèè

èè

èèè

èè

èèèè

èèè

èèèèèèèèè

èèè

èè

èèèè

èè

èèè

èè

èèèèè

èè

èèèèèè

èè

èèè

èè

èèè

èèèèè

èèèè

èè

èèèèè

èèè

èè

èèèèè

èè

èèèèè

èè

èèèè

èèè

èèèèèèèè

èè

èèè

èèèè

èèè

èè

èèè

èè

èèè

èèèè

èèèèèèèèè

èè

èèè

èè

èèè

èè

èèèè

èè

èèèèèèè

èè

èèè

èè

èèèèè

èè

èèèèèèèèèèèèè

èè

èèèèè

èèè

èè

èèè

èè

èèè

èè

èèè

èèèè

èèèèèè

èè

èèè

èèèè

èèè

èè

èèèèèè

èèè

èè

èèèè

èè

èèè

èèèè

èè

èèè

èè

èèè

èè

èèèèè

èè

èèè

èèèè

èè

èèè

èè

èèè

èè

èèè

èèèè

èèè

èè

èèè

èè

èèè

èèèèèè

èè

èèè

èè

èèè

èè

èèèèèèèè

èèèèè

èèèèèèè

èèèè

èè

èèèè

èèèèè

èèè

èè

èèèè

èè

èèèèè

èè

èèèèèèèèèè

èè

èèè

èèèè

èè

èèè

èè

èèè

èè

èèè

èè

èèèèèèèèèèè

èèèèè

èèè

èè

èèèèèèè

èè

èèèèè

èè

èèè

èèèè

èè

èèèèèèèèè

èè

èèèè

èèè

èè

èèè

èè

èèè

èè

èèè

èèèèèèèèèèèèè

èè

èèè

èè

èèèè

èèè

èèèèèè

èè

èèèè

èè

èèèè

èè

èèèè

èèèèèè

èè

èèè

èèèè

èèèèè

èèèè

èèèèèèèèèèèèèèè

èèèèèè

èèèè

èèè

èè

èèè

èè

èèèèèèèè

èèèèèèèèèèè

èèèè

èè

èèèèèè

èè

èèèè

èè

èèè

èèèèèè

èèèèèèè

èè

èèè

èèèèèèè

èèè

èèèè

èèè

èè

èèèèè

èèèèèèèè

èèèèèèè

èè

èèèèèèèè

èèè

èè

èèè

èè

èèèèèèèèè

èè

èèèèèè

èèèèèèèèè

èèèèèèèèèè

èè

èèè

èèèèèè

èèèè

èèèèèèè

èè

èèè

èè

èèèèèèè

èè

èèèèèè

èèèèèèèè

èèèè

èè

èèèèèèèè

èè

èèè

èè

èèè

èèèèè

èèèèèè

èè

èèèèè

èèèè

W o rd

L e n

g th English Aeneid

English Dictionary

Latin Aeneid

Latin Dictionary

Another strategy for comparing the Latin and English versions of the Aeneid to their corresponding dictionaries, is

to first remove all duplicates of words in the texts. Our reasoning for doing this is that the Aeneid represents a usage of

language in a natural environment, whereas a dictionary is essentially a catalogue of words. We feel that the following is a

more fair comparison to make, or at least gives us another perspective from which to look at the data.

EIN-3235-Project-example-(!)-v03.nb 7

Another strategy for comparing the Latin and English versions of the Aeneid to their corresponding dictionaries, is

to first remove all duplicates of words in the texts. Our reasoning for doing this is that the Aeneid represents a usage of

language in a natural environment, whereas a dictionary is essentially a catalogue of words. We feel that the following is a

more fair comparison to make, or at least gives us another perspective from which to look at the data.

English Dictionary vs. Aeneid HDuplicates words removedL R

e l

a t

i v

e F

r e

q u

e n

c y

5 10 15 20

0.05

0.10

0.15

0.20

Dictionary

Aeneid

Word Length

Latin Dictionary vs. Aeneid HDuplicates words removedL

R e

l a

t i

v e

F r

e q

u e

n c

5 10 15 20

0.05

0.10

0.15

0.20

Dictionary

Aeneid

Word Length

We can appreciate this same difference with boxplots.

Dictionaries v. Aeneids HDuplicates words removedL

èèèèèè

èèèèèèèèèèèèèèèèèèèèèèèèèèèèèèèèèèèèèèèèèèè

èèèèèèèèèèèèèèèèèèèèèèèèèèèèèèèèèèèèèèèèèèèèèèèèèèèèèèèèèèèèèèèèèèèèèèèèèèè

èèèèèè

èèè

èè

èèè

èè

èèè

èè

èèèèè

èè

èèè

èè

èèèè

èè

èèè

èèèèè

èè

èèè

èè

èèèè

èèè

èèèèèèèèè

èèè

èè

èèèè

èè

èèè

èè

èèèèè

èè

èèèèèè

èè

èèè

èè

èèè

èèèèè

èèèè

èè

èèèèè

èèè

èè

èèèèè

èè

èèèèè

èè

èèèè

èèè

èèèèèèèè

èè

èèè

èèèè

èèè

èè

èèè

èè

èèè

èèèè

èèèèèèèèè

èè

èèè

èè

èèè

èè

èèèè

èè

èèèèèèè

èè

èèè

èè

èèèèè

èè

èèèèèèèèèèèèè

èè

èèèèè

èèè

èè

èèè

èè

èèè

èè

èèè

èèèè

èèèèèè

èè

èèè

èèèè

èèè

èè

èèèèèè

èèè

èè

èèèè

èè

èèè

èèèè

èè

èèè

èè

èèè

èè

èèèèè

èè

èèè

èèèè

èè

èèè

èè

èèè

èè

èèè

èèèè

èèè

èè

èèè

èè

èèè

èèèèèè

èè

èèè

èè

èèè

èè

èèèèèèèè

èèèèè

èèèèèèè

èèèè

èè

èèèè

èèèèè

èèè

èè

èèèè

èè

èèèèè

èè

èèèèèèèèèè

èè

èèè

èèèè

èè

èèè

èè

èèè

èè

èèè

èè

èèèèèèèèèèè

èèèèè

èèè

èè

èèèèèèè

èè

èèèèè

èè

èèè

èèèè

èè

èèèèèèèèè

èè

èèèè

èèè

èè

èèè

èè

èèè

èè

èèè

èèèèèèèèèèèèè

èè

èèè

èè

èèèè

èèè

èèèèèè

èè

èèèè

èè

èèèè

èè

èèèè

èèèèèè

èèèè

èèèèè

èèèèèè

èèèèèèèèèèèè

èèèèè

èè

èèè

èè

èèè

èèèèè

èèèèèè

èè

èèèèè

èèèè

W o rd

L e n

g th English Aeneid

English Dictionary

Latin Aeneid

Latin Dictionary

6. Statistical model of the sample, model limitations, and error estimation

Now we will use another tool to compare the word length distribution with standard distribution models.

8 EIN-3235-Project-example-(!)-v03.nb

[email protected], 1.56917D, [email protected], 1.93218D, [email protected], 2.2096D<

English: [email protected], 1.56917D R

e l

a t

i v

e F

r e

q u

e n

c y

2 4 6 8 10 12 14 16

0.05

0.10

0.15

0.20

0.25

Word Length

English: [email protected], 1.93218D

R e

l a

t i

v e

F r

e q

u e

n c

2 4 6 8 10 12 14 16

0.05

0.10

0.15

0.20

0.25

Word Length

EIN-3235-Project-example-(!)-v03.nb 9

English: [email protected], 2.2096D R

e l

a t

i v

e F

r e

q u

e n

c y

2 4 6 8 10 12 14 16

0.05

0.10

0.15

0.20

0.25

Word Length

Qualitatively, we can see that of the Extreme Value, Normal, and Gumbel distributions, Extreme value provides the

closest fit to the English word length data. We need more objective measures, however, so we turn to Mathematica’s

automated hypothesis testing routines.

ExtremeValueDistribution

Statistic P-Value

Anderson-Darling 1534.59 0.

Cramér-von Mises 271.816 1.22125 ´ 10 -14

Pearson Χ 2

3.32422 ´ 10 6

1.770295561886984 ´ 10 -721 374

NormalDistribution

Statistic P-Value

Anderson-Darling 2284.24 0.

Cramér-von Mises 403.19 1.46549 ´ 10 -14

Pearson Χ 2

3.32578 ´ 10 6

2.360941648595737 ´ 10 -721 714

GumbelDistribution

Statistic P-Value

Anderson-Darling 4710.37 0.

Cramér-von Mises 787.407 3.07532 ´ 10 -14

Pearson Χ 2

106 578. 4.547810396561363 ´ 10 -22 825

Here we show the statistics and p-values for three well-known hypothesis testing metrics, Anderson-Darling,

Cramér-von Mises, and Pearson Χ2. The p-values are very low so that means good fits, right?

Anderson-Darling Cramér-von Mises Pearson Χ2

Extreme Value Distribution Reject Reject Reject

Normal Distribution Reject Reject Reject

Gumbel Distribution Reject Reject Reject

According to Mathematica, all of our fits are rejected, despite our extremely low p-values. It turns out that on very

large data sets, measures like p-values are less than helpful in determining whether or not your data follows a certain

distribution [2]. Even though it looked like Extreme Value was a good fit, there is not enough objective evidence to con-

clude that it is in fact the underlying distribution for our dataset by the selected metrics. Intuitively this makes sense, since

the more data you have, the less likely it is that any deviations from a particular distribution are simply due to chance. In

reality getting a “Do Not Reject” in one of these tests is simply telling you that there is not enough evidence to reject the

null hypothesis that the data is distributed according to a certain distribution. It’s not so much an affirmation as it is a non-

negation...

10 EIN-3235-Project-example-(!)-v03.nb

According to Mathematica, all of our fits are rejected, despite our extremely low p-values. It turns out that on very

large data sets, measures like p-values are less than helpful in determining whether or not your data follows a certain

distribution [2]. Even though it looked like Extreme Value was a good fit, there is not enough objective evidence to con-

clude that it is in fact the underlying distribution for our dataset by the selected metrics. Intuitively this makes sense, since

the more data you have, the less likely it is that any deviations from a particular distribution are simply due to chance. In

reality getting a “Do Not Reject” in one of these tests is simply telling you that there is not enough evidence to reject the

null hypothesis that the data is distributed according to a certain distribution. It’s not so much an affirmation as it is a non-

negation...

Let’s do the same thing for the Latin Aeneid.

Latin: [email protected], 2.01059D

R e

l a

t i

v e

F r

e q

u e

n c

5 10 15

0.05

0.10

0.15

0.20

Word Length

Statistic P-Value

Anderson-Darling 1037.77 0.

Cramér-von Mises 170.284 1.74305 ´ 10 -14

Pearson Χ 2

1.30118 ´ 10 6

3.407713710437922 ´ 10 -282 193

Latin: [email protected], 2.2052D

R e

l a

t i

v e

F r

e q

u e

n c

5 10 15

0.05

0.10

0.15

0.20

Word Length

EIN-3235-Project-example-(!)-v03.nb 11

Statistic P-Value

Anderson-Darling 599.336 0.

Cramér-von Mises 103.911 8.54872 ´ 10 -15

Pearson Χ 2

1.30138 ´ 10 6

8.197044723339416 ´ 10 -282 235

Latin: [email protected], 2.29716D

R e

l a

t i

v e

F r

e q

u e

n c

5 10 15

0.05

0.10

0.15

0.20

Word Length

Statistic P-Value

Anderson-Darling 1400.08 0.

Cramér-von Mises 220.964 1.23235 ´ 10 -14

Pearson Χ 2

63 743. 1.902569727444612 ´ 10 -13 594

As one can imagine, the null hypothesis is rejected for all of these models.

Let’s try a different method. This time instead of using the above testing methods we’ll use probability plots and the

Kolmogorov-Smirnov test.

English: ExtremeValueDistribution

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Kolmogorov-Smirnov Test: 5.82245506223 ´ 10 -1507

12 EIN-3235-Project-example-(!)-v03.nb

English: NormalDistribution

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Kolmogorov-Smirnov Test: 1.459174069333 ´ 10 -2516

English: GumbelDistribution

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Kolmogorov-Smirnov Test: 4.15762291105 ´ 10 -3002

Latin: ExtremeValueDistribution

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Kolmogorov-Smirnov Test: 2.877367316234 ´ 10 -1135

EIN-3235-Project-example-(!)-v03.nb 13

Kolmogorov-Smirnov Test: 2.877367316234 ´ 10 -1135

Latin: NormalDistribution

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Kolmogorov-Smirnov Test: 6.745649140213 ´ 10 -559

Latin: GumbelDistribution

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Kolmogorov-Smirnov Test: 1.391285073208 ´ 10 -1146

A lower Kolmogorov-Smirnov value indicates a better fit. We therefore conclude that although no distribution fits perfectly, the optimal

distributions for the English and Latin data are the Extreme Value Distribution and the Normal Distribution, respectively.

Next, we’ll try to create a model of the probability transitions from one character given the previous character. This

is known as a bigram, and it’s formulation is as follows: [1]

(1)PHwi wi-1L = CountHwi-1, wiL

CountHwi-1L

Simple enough, let’s do this for all of the unique character tuples in our dataset. First we simplify the problem by

making everything lower case...

engLower = ToLowerCase � english;

Then we find the unique characters.

14 EIN-3235-Project-example-(!)-v03.nb

engUnique = Union � Characters � engLower

8!, H, L, -, ., ,, ;, ", ?, ', :, , a, b, c, d, e, f, g, h, i, j, k, l, m, n, o, p, q, r, s, t, u, v, w, x, y, z<

Next we construct our bigram transitional probability matrix. This takes a second...

engBigram = Table@StringCount@engLower, pre ~~ proD � StringCount@engLower, preD, 8pre, engUnique<, 8pro, engUnique<D;

And now let’s visualize our results.

Bigram transitional probability

F r

o m

t h

i s

c h

a r

a c

t e

r .

. .

! H L - . , ; " ? ' : a b c d e f g h i j k l m n o p q r s t u v w x y z

! H L

- . , ; " ? ' :

a b c d e f g

h i j

k l

m n o p q

r s t u v w x y

z 0

-6-4-20246

...To this character

What can we gleam from this information? We can see, for example, that punctuation like “!”, “?”, “,”, etc. are very

likely to be followed by a space. We can also see that it is almost certain that if you see a “q”, it will be followed by “u”!

This kind of analysis reveals things not necessarily specific to the Aeneid, but to the English language in general.

Before we do the Latin version, let’s try to generate our own English sentences based on these transitional probabili-

ties. If we do the following...

engAdjBigram =

NA ¶ ¥ ð � 0 ð True

& �� Flatten � engBigram �� Partition@ð, Length � engUniqueD &E;

EIN-3235-Project-example-(!)-v03.nb 15

0.023

0.0045

0.12

0.023

0.83

0.021

0.13

0.021

0.014

0.043

0.0071

0.093

0.0071

0.021

0.043 0.021

0.014

0.086

0.036

0.021

0.0071

0.11

0.19

0.021

0.086

0.021

0.093

0.036

0.85

0.52

0.011

0.12

0.039

0.032

0.018

0.028

0.021

0.0035 0.014

0.014

0.032

0.028

0.011

0.014

0.028

0.032

0.0035

0.018

0.0014

0.00028

0.065

0.012

0.92

0.0034

0.000075

0.000075 0.0035

0.0009

0.99

0.0033

0.00041

1.0.0013

0.015

0.46

0.055

0.027

0.0077

0.0064

0.0051

0.015

0.013

0.019

0.026

0.0013

0.01

0.028

0.023

0.063

0.0077

0.0064

0.032

0.091

0.0064

0.0026

0.064

0.014

0.0066

0.0033

0.072

0.026

0.89

0.00044

0.00015

0.00029

0.12

0.0012

0.00044

0.00029

0.54

0.028

0.00029

0.00044

0.00087

0.0046

0.00073

0.054

0.0013

0.13

0.099

0.016 0.00015

0.0023

0.00073

0.00075

0.00130.004

9.4 ´ 10 -6

0.0016

0.11

0.048

0.036

0.03

0.016

0.064

0.022

0.073 0.037 0.0044

0.0041

0.03

0.032

0.02

0.047

0.034

0.0018

0.03

0.087

0.17 0.0082

0.0093

0.067

0.000094

0.012 0.0000750.000082

0.00014

0.00052

0.0017

0.00041

0.0022

0.000082

0.043

0.013

0.038

0.0056

0.006

0.019

0.00035 0.063

0.00057

0.013

0.08

0.035

0.25

0.00035

0.017

0.13

0.068

0.1 0.015

0.027

0.0094

0.0006

0.02

0.0032

0.00015

0.0015

0.00029

0.01

0.00029

0.0042

0.072

0.0042

0.0015

0.25

0.001 0.034 0.0025

0.12

0.0013

0.00044

0.15

0.12

0.025

0.008

0.11

0.00058

0.091

0.0001

0.016

0.0086

0.12

0.014

0.24

0.16

0.045

0.043

0.047

0.0002

0.15

0.077

0.0014

0.041 0.031

0.008

0.0021

0.00054

0.0013

0.032

0.1

0.021

0.0024

0.0052

0.013

0.47

0.028

0.0084

0.11

0.0016

0.0023

0.00025 0.049 0.00021

0.009

0.0018

0.001

0.021

0.000042

0.02

0.069

0.000084 0.0066

0.0025

0.00034

0.01

0.0023

0.00023

0.00087

0.016

0.046

0.01

0.0016

0.0024

0.0055

0.31

0.077

0.0024

0.014

0.034 0.029

0.013

0.0041

0.0029 0.028

0.00033

0.0025

0.032

0.015

0.079

0.0019

0.01

0.0031

0.088

0.099

0.022 0.0027

0.0092

0.013

0.0053

0.016

0.00043

0.000086

0.00077

0.0014

0.0079

0.00095

0.000086

0.0043

0.00069

0.21

0.12

0.077

0.023

0.1

0.067

0.17

0.12

0.0054

0.035 0.061

0.0018

0.00021

0.00062

0.005

0.024

0.0025

0.0001

0.017

0.0014

0.28

0.059

0.0016

0.1

0.00021

0.0069

0.16 0.046

0.024

0.0018

0.028

0.08

0.089

0.022

0.013 0.031

0.0037

0.00054

0.000091

0.0012

0.01

0.0016

0.00015

0.021

0.00054

0.1

0.092

0.00061

0.00024

0.46

0.0012

0.000061

0.14 0.0013

0.00027

0.000061

0.074

0.00003

0.018

0.0036

0.039

0.0220.00021

0.012

0.000065

0.00072

0.000065

0.000033

0.019

0.03

0.0036

0.027

0.042

0.065

0.018

0.05

0.00026

0.0051

0.047

0.027

0.26

0.026

0.0062

0.00088

0.078

0.15

0.11 0.0061

0.023

0.0056

0.0035

0.42

0.068

0.28

0.23 0.0015

0.0015

0.0078

0.034

0.01

0.034

0.0037

0.19

0.3

0.00037

0.17

0.033

0.073

0.00037

0.1

0.00074

0.0059

0.041

0.00091

0.000053

0.002

0.0032

0.012

0.0031

0.00021

0.017

0.00075

0.13

0.12

0.00069

0.0034

0.062

0.15

0.008

0.0013

0.11

0.0028

0.13

0.006

0.00032

0.12

0.0021

0.0014

0.033

0.016 0.016

0.0097

0.00043

0.049

0.0002

0.00029

0.000098

0.0014

0.012

0.0017

0.0002

0.025

0.00069

0.16

0.12

0.04

0.18

0.00029

0.089

0.00039

0.025

0.0076

0.1

0.064

0.058

0.00049 0.02

0.0002

0.082

0.0013

0.00022

0.00078

0.0091

0.032

0.0063

0.0013

0.023

0.004

0.2

0.018

0.00099

0.032

0.26

0.058

0.0056

0.12

0.0027 0.024 0.0012

0.0068

0.0042

0.00087

0.0027

0.052

0.00053

0.0034

0.00099

0.045

0.055 0.011

0.0058

0.0018

0.0009

0.0042 0.000031

0.00043

0.000031

0.00012

0.00021

0.0029

0.000061

0.014

0.000092

0.099

0.015

0.0063

0.007

0.037

0.01

0.071

0.0011

0.013

0.01

0.013

0.035

0.064

0.1

0.04

0.017

0.000031

0.14

0.03

0.03 0.14

0.02

0.067

0.00055

0.0087

0.00015

0.00024

0.0023

0.0083

0.0016

0.012

0.00048

0.031

0.12

0.00072

0.00024

0.14

0.00012

0.03 0.081

0.12

0.00072

0.11

0.066

0.17

0.033

0.024 0.0360.0022

0.019

0.00047

0.00025

0.00056

0.0061

0.022

0.0034

0.000031

0.00044

0.024

0.0018

0.19

0.058

0.0031

0.017

0.025

0.21

0.0032

0.007

0.0016 0.081 0.00019

0.0047

0.0048

0.023

0.027

0.13

0.0053

0.00016

0.011

0.06

0.034 0.019

0.0037

0.00081

0.019

0.0019

0.0003

0.00085

0.014

0.084

0.012

0.00082

0.021

0.0062

0.35

0.029

0.00041

0.016

0.00099

0.082

0.00071

0.00055

0.067 0.034

0.000082

0.0065

0.0079

0.0038

0.0017

0.036

0.025

0.00058 0.038

0.12 0.027

0.0093

0.0023 0.00076

0.00023

0.00058

0.0092

0.032

0.0054

0.00071

0.0046

0.0028

0.14

0.033

0.0045

0.078

0.0019

0.43 0.037

0.0072

0.00048

0.00094

0.086

0.048

0.026

0.01

0.02

0.0045

0.013

0.00029

0.0063

0.00015

0.000073

0.0013

0.00015

0.026

0.017

0.013

0.03

0.028

0.04

0.004

0.032

0.03

0.00015

0.084

0.029

0.17

0.0022

0.02

0.19

0.089

0.00044

0.00022

0.17

0.00046

0.12

0.45

0.17

0.076

0.0064

0.013

0.0014

0.011

0.027

0.0068

0.00018

0.034

0.004

0.077

0.14

0.00018

0.0059

0.078

0.0022

0.17

0.26

0.00027

0.0062

0.00018

0.057

0.072

0.013

0.033

0.00027

0.0012

0.021

0.0018

0.2

0.066

0.041

0.032

0.077

0.0089

0.10.0054

0.011

0.2

0.22

0.02

0.0032

0.00037

0.0015

0.028

0.052

0.013

0.0026

0.018

0.0075

0.58

0.016

0.0016

0.0051

0.0022

0.057

0.0015

0.0078

0.015 0.0063

0.0067

0.0017

0.13

0.0022

0.014

0.024

0.003 0.00012

0.0011

0.26

0.052

0.53

0.07 0.0074

0.022

0.011

0.03 0.022

;

y z

engDMP = DiscreteMarkovProcess@First � First � Position@engUnique, "q"D, WeightedAdjacencyGraph@engAdjBigramDD;

We can construct and visualize a discrete Markov process model of our data.

Now let’s generate some text...

engUniquePLast �� RandomFunction@engDMP, 80, 50<D@"Path"DT �� StringJoin

qup,'arva-ssoqu."mv't,'ezl?L,. nyevy; mixe. f-vo,.

Turns out a bigram model doesn’t produce very interesting character strings. This makes sense, we’re only using

information about the character directly preceding the one we’re trying to predict. English doesn’t really work that way. But

look, the “qu” happened!

Anyway let’s make that bigram transitional probability matrix for the Latin version of the Aeneid all in one go with a

very “succinct” line of code.

16 EIN-3235-Project-example-(!)-v03.nb

Bigram transitional probability

F r

o m

t h

i s

c h

a r

a c

t e

r .

. .

! H L - ` @ D . , ; ? ' : a b c d e f g h i k l m n o p q r s t u v x y z

! H L

- ` @ D . , ; ? ' :

a b c d e f g

h i k l

m n o p q

r s t u v x y

z 0

-6-4-20246

...To this character

Nice. With this we can see there’s a few interesting similarities and differences between the Latin and English

languages. In both Latin and English, “q” is almost certainly followed by “u”, but quite unlike in English, the Latin lan-

guage prescribes that a “k” is almost certainly followed by an “a”. As non-Latin speakers, this something we did not know!

7. Conclusions

In this study we used a number of different methods to analyze both the English and Latin versions of the Aeneid by

Virgil, and tried to uncover interesting ways of comparing and contrasting the two texts from a statistical and probabilistic

point of view.

Research Methods

We used a variety of methods implemented in Mathematica to statisically analyze both the word length, and the

inter-character probabilistic dependencies of our two datasets. Various graphical summaries including histograms and

boxplots of the word length were produced and analyzed, and a graphical probabilistic model of the bigram transitional

probabilities of individual characters in the two datasets were constructed and simulated as discrete Markov processes.

Findings

Our study included a number of interesting findings. We found the mean word length of the English version of the

Aeneid to be lower than its Latin counterpart with 4.45 being the mean of the English version and 5.67 being the mean of

the Latin version, though on the other hand the Latin version included far fewer words overall to tell the same story as the

English version, 63,743 compared to English’s 106,578. This suggests that the Latin language in its typical usage consists

of fewer words of greater length encoding on average more information than the English language. These findings are also

born out by the comparisons made to the English and Latin dictionaries, which show a greater sophistication in the usage of

Latin vocabulary when compared to English.

We showed the the world lengths of the English and Latin versions of the Aeneid could not justifiably be said to

follow any particular distribution of the Extreme Value, the Normal, and the Gumbel distributions. Probability plots were

constructed and Kolmogorov-Smirnov values computed to show that of these three distributions, the Extreme Value

distribution most closesly fit the English world length data, and a Normal distribution most closely fit the Latin word length

data.

EIN-3235-Project-example-(!)-v03.nb 17

We showed the the world lengths of the English and Latin versions of the Aeneid could not justifiably be said to

follow any particular distribution of the Extreme Value, the Normal, and the Gumbel distributions. Probability plots were

constructed and Kolmogorov-Smirnov values computed to show that of these three distributions, the Extreme Value

distribution most closesly fit the English world length data, and a Normal distribution most closely fit the Latin word length

data.

We also found interesting probabilistic dependencies between the characters of English and Latin as they are used in

the Aeneid. For example, we found that some similiarites in the bigram transitional probabilities between characters are

preserved between the languages, such as the high likelihood of a “q” being followed by a “u”, while others are poignantly

different, such as the near certain probability in Latin that a “k” is followed by an “a” where in English it is almost certain

that the opposite is true. These facts hint at a subtle and interesting relationship between the two languages.

Implications

Our study revealed a number of interesting statistical relationships between the English and Latin versions of the

Aeneid that help shed light on the relationship between the two languages which are detailed above.

8. Learning outcomes

In response to our research questions, we have arrived at the following learning outcomes:

è How does one use Wolfram Mathematica?

è Mathematica is an interesting and varied language useful for high-level general programming. In creating this report

we made extensive use of Wolfram’s references and guides.

è What is statistical analysis?

è Statistical analysis involves quantifying various aspects of a dataset such as the mean, standard deviation, and

quartiles. These single-number summaries are useful for making far-reaching conclusions about the data one is

working with and help abstract away what can often amount to millions of individual data points.

è What is the difference between a sample and the population?

è Samples are groups of data points drawn from a larger dataset known as the population. In our analysis of the Latin

and English versions of the Aeneid, we found that a sample of a mere 1,000 words was good enough to accurately

characterize many important aspects of the full population dataset of more than 100,000 and 63,000 words for the

English and Latin versions, respectively.

è What do the different kinds of graphical summaries say about the data set?

è Our study made extensive use of graphical summaries to capture essential aspects of our datasets in a way that was

both informative and accurate. Chief among these methods are the Histogram and the Box-Plot, which help visualize

important properties of the dataset such as the mean and standard deviation.

è Which numerical summaries are appropriate to make conclusions about a data set?

è We use a wide variety of numerical summaries to characterize and draw conclusions from our dataset including the

mean and standard deviation of the word lengths in each version of the text as well as the bigram transitional

probabilities of characters in the text.

è How is statistical analysis useful in engineering?

è Engineering makes extensive use of statistical analysis for characterizing the uncertainty inherent in a quantitative

analysis of the natural world. Our study continues in this vein, attempting to make sense of one of the most essential

aspects of human behavior, language, in a precise and informative way.

9. Individual contribution of each team member

John Smith (50%)

1. Description of Datasets

2. Relevant statistical and graphical summaries

3. Statistical model of the sample, model limitations, and error estimation

4. Conclusions

5. Learning Outcomes

18 EIN-3235-Project-example-(!)-v03.nb

Sam Johnson (50%)

1. Abstract

2. Introduction

3. Problem Statement

4. Statistics concepts in Engineering

5. Description of Datasets

10. Acknowledgement

We’d like to acknowledge FIU for providing computing resources.

References

[1] D. Jurafsky, C. Manning, “Estimating N-gram probabilities” (Video lecture), Natural Language Processing, Stanford

University, Coursera, 2014. URL: https://class.coursera.org/nlp/lecture/128

[2] P. Runkel, “Large samples: Too much of a good thing?”, The Minitab Blog, 2012. URL: http://blog.minitab.com/blog/s-

tatistics-and-quality-data-analysis/large-samples-too-much-of-a-good-thing

[3] Institute of Electrical and Electronics Engineers, IEEE Standard Computer Dictionary: A Compilation of IEEE Standard

Computer Glossaries. New York, NY ISBN 1-55937-079-3, 1990

[4] A. Indrayan, Medical Biostatistics. CRC Press. ISBN 978-1-4398-8414-0, 2012

[5] R. Kohavi; Foster Provost, “Glossary of terms”. Machine Learning 30: 271–274, 1998

EIN-3235-Project-example-(!)-v03.nb 19