Symbolic regression using genetic programming
Assignment 2 Due Monday Nov 4th 11:55pm
Your base assignment is to implement symbolic regression. Refer to the rough mark breakdown below.
Genetic Programming The first portion of the assignment is to implement the GP for symbolic regression. I strongly encourage
you to use an already created GP system for this assignment (although, if you want, by all means make
your own from scratch). Two popular ones are ECJ1 (Java) and DEAP2 (Python), but feel free to use any
system you like, but please do check with me if you want to use one implemented in a language other
than Java, Python, C, C++, C#. If you want, you can use mine3, but note that it will only work on symbolic
regression and it is not that well documented (but it is very good at symbolic regression).
Any academic misconduct will be investigated fully and I will push for the maximum allowable penalty.
I have provided you with a collection of data in CSV format. For the most part, this data was generated
by me by randomly generating data points, pumping it through a function, and then adding a little bit of
noise to the output. See if you can reverse engineer the functions I used to create the data. All the data
is formatted such that the first n-1 columns are the independent variables and the nth (last) column is
1 https://cs.gmu.edu/~eclab/projects/ecj/ 2 https://deap.readthedocs.io/en/master/ 3 https://github.com/jameshughes89/jGP
the dependent variable. For example, in ‘d1.csv’ there are two columns. The first column we will call x
and the second we will call y. We need to find some function of x that will predict y. So, y ≈ f(x). If we
had 3 columns, we would want z ≈ f(x,y). Note I have approximately equal to because you may not find
the exact functions, but you will likely still get a close approximation. Ultimately, your goal is to use
symbolic regression to try to find f.
Assuming you get everything working and you don’t have any serious problems, you will automatically
get +10 on your assignment.
Typed Problem If you choose to do a typed problem you can gain an additional +4 points. You will have to pick your own
problem, so go find your own data to play with. I recommend checking out the UCI machine learning
repository4. You must demonstrate to me that you have it working and that the problem is sufficiently
challenging enough (and typed) to obtain the additional points.
Modifications If you implement modifications for +1 each (max +2), be sure to make them obvious to me. For this
assignment you may not use elitism as a modification. I do not care if you go out and find modifications
or if you invent your own, just be sure to convince me that you deserve the extra points. If your
improvement is not obvious to me, or if I deem it as not significant enough, you will not get the marks. If
you choose to do the writeup then explain these modifications within the writeup. If you do not do the
writeup, at least include a text file with a description of your modifications so I know what they are.
Writeup You can obtain a maximum of +9 for doing a writeup with +5 being from the base report. WARNING:
The writeup is not trivial to do well and will take some time to write. This writeup will be marked more
qualitatively by a marker. There is no precise best way to structure a writeup and it is difficult to know
exactly what should be included. A portion of these marks will be dedicated to prose, understandability,
continuity, spelling, grammar, content, and effectiveness. You can find an example of an article I wrote
for publication this year to get a sense of what is good, but there is no one right way to do it and I would
not recommend making your report look like mine (I’m simply giving it to you for an example, but you
can find a lot more online). Below is a list of ideas on what to include:
• Introduction
o What are you doing?
o Small literature review?
▪ What has worked well in the past on this problem?
• Explain the problem/data
o What is GP?
o What is symbolic regression?
▪ How is it different from basic linear regression?
o What is typed GP?
4 https://archive.ics.uci.edu/ml/index.php
• Explain your algorithm
o How did you implement your GP exactly? Enhancements?
• Explain your analysis methodology
o What will you compare it to?
o How will you compare?
▪ Means? Distributions? P-values? Interquartile ranges? Other statistics?
• Explain the results and discuss them
o What happened? How did they compare to random? Other algorithms? Comparison to
known best? Summary statistics?
o You’ll want visualizations here if you choose to do them.
▪ Plot given data along with your models’ predictions.
• Conclusions and possible future directions
o How good was it?
• Bibliography
o References, if you use them.
Again, do note that the marks for this portion will be more qualitative and it will be difficult to know
what’s good beforehand. The content is up for you to descide and your decision making on what to
include is part of the assignment and course learning objectives. There is no required length for the
report, but please do NOT make it longer than 8 pages double column format.
Note that falsfying results is an academic offense and it will be investigaged fully.
Latex You can obtain +1 if you complete your report in Latex. I will not teach you how to use it, however there
are many tutorials available online.
I recommend using Overleaf (https://www.overleaf.com?r=aaca39d4&rm=d&rs=b)5. It’s an online editor
that takes care a lot of the annoying setup legwork you’d have to do. If you do want to have a local copy
on your computer, I recommend using MiKTeX and TexStudio.
If going for the latex marks, you must use the IEEE or ACM conference templates (IEEE is probably
easier).
If you are doing references, it is recommended that you use BibTeX.
References/Citations You can obtain +1 if you include a sufficient literature review and have proper references/citations.
Don’t worry too much about your formatting. There is no correct number of references to include, just
do what makes sense in your situation. It is up to the marker to determine if you will be awarded the +1.
5 Please use this referral link.
If using LaTeX, BibTeX will make your life easier. Google Scholar also REALLY makes your life easy. If you
search for an article/book on Scholar, select the blue quote image ( ” ), and then in the bottom of the
popup ‘Cite’ window you will see ‘BibTeX’. Click this and copy it into your BibTeX (.bib) file. Perfect
citations and references every time (assuming Google has them right)!
Figures and Tables Include effective use of figures, tables, etc. in your writeup for +1 mark. Examples include an algorithm
flow diagram, a table of parameter settings, tables of results, learning curves, distributions, etc. The
marker determines if you receive the mark or not.
Statistics Include proper statistics and comparisons for +1 mark. Given the stochastic nature of the algorithm,
typically people do at least 30 runs of any experiment (run 30x with the same settings on the same
problem instance) to get statistical significance. If you wanted to compare parameter settings, compare
the distributions of errors from the 30 runs with two sets of settings. What statistics should you use?
What should you actually be comparing? That’s up to you to decide. The marker determines if you
receive the mark or not.
Submission Details - Submit via Moodle by 11:55pm on the due data
- Include all your code and special running instructions if necessary
- Include your writeup
- Include anything else you think the marker needs.