Symbolic regression using genetic programming

profileopiya
requirementdoc.pdf

Assignment 2 Due Monday Nov 4th 11:55pm

Your base assignment is to implement symbolic regression. Refer to the rough mark breakdown below.

Genetic Programming The first portion of the assignment is to implement the GP for symbolic regression. I strongly encourage

you to use an already created GP system for this assignment (although, if you want, by all means make

your own from scratch). Two popular ones are ECJ1 (Java) and DEAP2 (Python), but feel free to use any

system you like, but please do check with me if you want to use one implemented in a language other

than Java, Python, C, C++, C#. If you want, you can use mine3, but note that it will only work on symbolic

regression and it is not that well documented (but it is very good at symbolic regression).

Any academic misconduct will be investigated fully and I will push for the maximum allowable penalty.

I have provided you with a collection of data in CSV format. For the most part, this data was generated

by me by randomly generating data points, pumping it through a function, and then adding a little bit of

noise to the output. See if you can reverse engineer the functions I used to create the data. All the data

is formatted such that the first n-1 columns are the independent variables and the nth (last) column is

1 https://cs.gmu.edu/~eclab/projects/ecj/ 2 https://deap.readthedocs.io/en/master/ 3 https://github.com/jameshughes89/jGP

the dependent variable. For example, in ‘d1.csv’ there are two columns. The first column we will call x

and the second we will call y. We need to find some function of x that will predict y. So, y ≈ f(x). If we

had 3 columns, we would want z ≈ f(x,y). Note I have approximately equal to because you may not find

the exact functions, but you will likely still get a close approximation. Ultimately, your goal is to use

symbolic regression to try to find f.

Assuming you get everything working and you don’t have any serious problems, you will automatically

get +10 on your assignment.

Typed Problem If you choose to do a typed problem you can gain an additional +4 points. You will have to pick your own

problem, so go find your own data to play with. I recommend checking out the UCI machine learning

repository4. You must demonstrate to me that you have it working and that the problem is sufficiently

challenging enough (and typed) to obtain the additional points.

Modifications If you implement modifications for +1 each (max +2), be sure to make them obvious to me. For this

assignment you may not use elitism as a modification. I do not care if you go out and find modifications

or if you invent your own, just be sure to convince me that you deserve the extra points. If your

improvement is not obvious to me, or if I deem it as not significant enough, you will not get the marks. If

you choose to do the writeup then explain these modifications within the writeup. If you do not do the

writeup, at least include a text file with a description of your modifications so I know what they are.

Writeup You can obtain a maximum of +9 for doing a writeup with +5 being from the base report. WARNING:

The writeup is not trivial to do well and will take some time to write. This writeup will be marked more

qualitatively by a marker. There is no precise best way to structure a writeup and it is difficult to know

exactly what should be included. A portion of these marks will be dedicated to prose, understandability,

continuity, spelling, grammar, content, and effectiveness. You can find an example of an article I wrote

for publication this year to get a sense of what is good, but there is no one right way to do it and I would

not recommend making your report look like mine (I’m simply giving it to you for an example, but you

can find a lot more online). Below is a list of ideas on what to include:

• Introduction

o What are you doing?

o Small literature review?

▪ What has worked well in the past on this problem?

• Explain the problem/data

o What is GP?

o What is symbolic regression?

▪ How is it different from basic linear regression?

o What is typed GP?

4 https://archive.ics.uci.edu/ml/index.php

• Explain your algorithm

o How did you implement your GP exactly? Enhancements?

• Explain your analysis methodology

o What will you compare it to?

o How will you compare?

▪ Means? Distributions? P-values? Interquartile ranges? Other statistics?

• Explain the results and discuss them

o What happened? How did they compare to random? Other algorithms? Comparison to

known best? Summary statistics?

o You’ll want visualizations here if you choose to do them.

▪ Plot given data along with your models’ predictions.

• Conclusions and possible future directions

o How good was it?

• Bibliography

o References, if you use them.

Again, do note that the marks for this portion will be more qualitative and it will be difficult to know

what’s good beforehand. The content is up for you to descide and your decision making on what to

include is part of the assignment and course learning objectives. There is no required length for the

report, but please do NOT make it longer than 8 pages double column format.

Note that falsfying results is an academic offense and it will be investigaged fully.

Latex You can obtain +1 if you complete your report in Latex. I will not teach you how to use it, however there

are many tutorials available online.

I recommend using Overleaf (https://www.overleaf.com?r=aaca39d4&rm=d&rs=b)5. It’s an online editor

that takes care a lot of the annoying setup legwork you’d have to do. If you do want to have a local copy

on your computer, I recommend using MiKTeX and TexStudio.

If going for the latex marks, you must use the IEEE or ACM conference templates (IEEE is probably

easier).

If you are doing references, it is recommended that you use BibTeX.

References/Citations You can obtain +1 if you include a sufficient literature review and have proper references/citations.

Don’t worry too much about your formatting. There is no correct number of references to include, just

do what makes sense in your situation. It is up to the marker to determine if you will be awarded the +1.

5 Please use this referral link.

If using LaTeX, BibTeX will make your life easier. Google Scholar also REALLY makes your life easy. If you

search for an article/book on Scholar, select the blue quote image ( ” ), and then in the bottom of the

popup ‘Cite’ window you will see ‘BibTeX’. Click this and copy it into your BibTeX (.bib) file. Perfect

citations and references every time (assuming Google has them right)!

Figures and Tables Include effective use of figures, tables, etc. in your writeup for +1 mark. Examples include an algorithm

flow diagram, a table of parameter settings, tables of results, learning curves, distributions, etc. The

marker determines if you receive the mark or not.

Statistics Include proper statistics and comparisons for +1 mark. Given the stochastic nature of the algorithm,

typically people do at least 30 runs of any experiment (run 30x with the same settings on the same

problem instance) to get statistical significance. If you wanted to compare parameter settings, compare

the distributions of errors from the 30 runs with two sets of settings. What statistics should you use?

What should you actually be comparing? That’s up to you to decide. The marker determines if you

receive the mark or not.

Submission Details - Submit via Moodle by 11:55pm on the due data

- Include all your code and special running instructions if necessary

- Include your writeup

- Include anything else you think the marker needs.