Programming Java
CIS 573 Software Engineering Fall 2017 Homework #1 Introduction In this assignment, you are asked to implement and test a Java application that attempts to decrypt a document that has been encrypted using a substitution cipher. The algorithm that you will use is described below, and then your program will estimate the accuracy of this approach using a technique called cross-validation. This should not be particularly challenging for you: by now you should feel comfortable writing reasonably large programs in Java, specifically designing classes, doing file I/O, using data structures, manipulating Strings, implementing basic algorithms, understanding whether your code works correctly, etc. The purpose of this assignment is to get you thinking about some of the concepts we will cover in this course, particularly as they relate to “software quality.” Anyone with even a little bit of Java experience should be able to write this program: you, however, are being asked to write a “good” program, and to justify why you think it’s good. Although you are free to discuss the intent and purpose of this assignment with your classmates, the work you submit must be your own: you should not be sharing or discussing code with anyone else, nor should you be sharing answers to the writeup questions. Likewise, you may not submit code you found online, in textbooks, etc. See “Academic Honesty and Collaboration” below for more explanation. If you are uncertain of the policies, please see the course syllabus in Canvas or ask a member of the instruction staff. This assignment is due in Canvas on Friday, Sept 15, at 5:00pm; see submission instructions below. Late submissions will be subject to a penalty of 10% per day. Background As you know, a document can be encrypted in order to hide its contents from unauthorized viewers. A simple encryption algorithm is substitution cipher, in which each letter of the “plaintext” document is replaced by another letter to produce the “ciphertext,” and all instances of that (plaintext) letter are always transformed to the same (ciphertext) letter. See http://en.wikipedia.org/wiki/Substitution_cipher The ciphertext can then be decrypted by doing the reverse transformation, thus resulting in the original plaintext.
If the mapping/transformation of plaintext characters to ciphertext characters is not known, frequency analysis can be used to attempt to guess the mapping and decrypt the document. The intuition is that letters that appear frequently in the ciphertext are most likely mapped to letters that appear frequently in the language, whereas letters that appear frequently in the ciphertext are unlikely to be mapped to letters that don’t appear frequently in the language. See: http://en.wikipedia.org/wiki/Frequency_analysis To determine the frequency with which letters appear in the (plaintext) language, a corpus of documents can be analyzed and a model can be created, which ranks the letters according to their frequency. The model can then be applied during decryption in order to guess the mapping. For instance, in general English the most frequently appearing letters are E, T, A, O, I, and N, in that order. Thus, using this approach, whichever letter appears most frequently in the ciphertext can be mapped to E, the next most frequently appearing letter can be mapped to T, and so on. There are, of course, more complex approaches – for instance using bigram and trigram analysis, looking for most common words, etc. – but this is a simple approach that works reasonably well. How well does it really work, though? A simple way of validating this approach would be to encrypt a document, use a corpus of other documents to create a model, apply frequency analysis to decrypt the ciphertext, and compare the decrypted ciphertext to the original plaintext. Then we could use some metric (such as the percentage of letters correctly decrypted) to measure success. To prevent bias from measuring the correctness for just one document, we could use cross-validation to measure each document in the corpus. However, to prevent overfitting, the document that we’re measuring should be removed from the corpus when building the model. See: http://en.wikipedia.org/wiki/Cross-validation_(statistics) That is, for each document T in the corpus:
1. Encrypt T to create ciphertext C 2. Remove T from the corpus and use the remaining plaintext documents to build a
frequency model 3. Apply the model to the ciphertext C to get the decrypted document D 4. Compare the original plaintext document T to the decrypted document D
The overall metric is then the percentage of letters correctly decrypted across all documents in the corpus. Specification This section describes the specification that your program must follow. Some parts are intentionally under-specified; you are free to interpret those parts any way you like,
within reason, but you should ask a member of the instruction staff if you have any questions. Input File Specification The documents in the corpus can be assumed to be plain-text files consisting of lowercase letters, uppercase letters, digits, punctuation, whitespace, etc. You may assume that the characters in the files are all ASCII encoded but you may not assume that the documents are necessarily in English, i.e. you cannot assume an a priori frequency distribution of the characters. You can, of course, create your own corpus as you develop your program, but the correctness of your program will be evaluated using the documents in the corpus we provide you. This corpus is available in the Files section of Canvas under “Homework Files” and then “Homework #1.” The corpus consists of Shakespeare’s comedies. Functional Specification The program must be implemented in Java and must adhere to the following specification: 1. The class containing your program’s “main” method must be called “Main” and must be in the “edu.upenn.cis573.hwk1” package. This convention is used to make the TAs’ lives easier! 2. The runtime argument to the program (i.e. argument to the “main” method) should be the name of the directory containing the documents; the user should be able to specify either an absolute path or a relative path, and you can assume that you do not need to traverse into subdirectories to find files. Do not prompt the user for this information! This should be specified when the program is run, e.g. using a Run Configuration in your Eclipse project. The program should display an appropriate error message and immediately terminate upon any of the following conditions:
• the number of runtime arguments is not correct • the specified directory does not exist • the specified directory cannot be opened be opened for reading (e.g. because of
file permissions) • the specified directory is empty; hint: take a look at the java.io.File class if you
don’t know how to determine this 3. Once the program starts, it should use cross-validation on the files in the specified directory (as described above), to estimate the accuracy of the letter frequency analysis approach to decrypting text files. Although you are free to organize your code however you see fit (though you will need to
justify your design after implementing the program), keep in mind that your program needs to be able to:
• encrypt a plaintext document using a substitution cipher of your choosing • build a frequency model from a corpus of documents (minus the document you’ve
encrypted) – note that you may not use the published distribution of letters in English; you must build the model from the corpus of text
• apply the frequency model to the ciphertext to attempt to decrypt it • compare the decrypted ciphertext to the original plaintext • repeat these steps for all documents in the corpus
If there is a tie in your frequency model, i.e. two or more letters occur the same number of times, you may break the ties when performing decryption however you like. For the encryption, you may not use an “identity cipher” that simply transforms a letter to itself. You may want to use something simple like a Caesar cipher (https://en.wikipedia.org/wiki/Caesar_cipher) or create something more complicated. Also, although the documents in the corpus may consist of uppercase and lowercase letters, encryption should be case-insensitive, so that ‘A’ and ‘a’ both get transformed to the same letter, e.g. ‘k’. This means that your ciphertext should not have any uppercase letters. For simplicity, you only need to encrypt characters in the ranges ‘a’-‘z’ and ‘A’-‘Z’, i.e. the 26 lowercase and 26 uppercase letters used in English. That is, you do not need to encrypt punctuation, whitespace, digits, other characters that are not letters used in English, etc. and you should leave those “as-is” in the ciphertext. However, you should not include those characters when measuring the effectiveness of the approach, since they were never encrypted in the first place! 4. As your program performs cross-validation, it should print to the console (using System.out.println) the number of correctly and incorrectly decrypted letters for each document, followed by the total number and then finally the overall accuracy to two decimal places. The output must be formatted like this: apple.txt: 5372 correct, 2644 incorrect banana.txt: 73021 correct, 44292 incorrect carrot.txt: 8201 correct, 3322 incorrect Total: 86594 correct, 50258 incorrect Accuracy: 63.27% The program output will be graded by an automatic grading program so it must adhere to this convention. The order in which you print the files is not important, but the filenames must not include the directory, and you must have the right information, in this order,
with the appropriate punctuation. Likewise, your program must not print out any debugging info, intermediate calculations, etc., otherwise the autograder will be confused and that would be bad. 5. If any unexpected error occurs during the execution of the program, an appropriate error message should be shown and the program should terminate gracefully (no stack traces!). But what about…? You may have found a number of places in which this specification is ambiguous or under-specified. That is somewhat intentional as it simplifies things like error handling and corner cases. If there is something that appears to be unspecified or ambiguous, you may handle those situations any way you like, within reason. As long as your program adheres to what’s written above, you will be fine. We’re not trying to trick you, we promise! But if you’re unsure, please ask a member of the instruction staff for help. One last thought: Please keep in mind that whether or not this is a “good” approach to solving the problem of decrypting text is not the point of the assignment. So although you may not agree with this approach or you may think of a better one, please adhere to this specification. Thank you. J Testing In addition to implementing the program described above, you must also implement automated tests to show that you have attempted to determine whether your code works correctly. We recommend (but do not require) that you use a testing framework such as JUnit. If you have never used JUnit before, this may be a good time to learn (https://www.tutorialspoint.com/junit/ is a good place to start), but we will discuss JUnit in more detail later in the course. If you choose to write tests outside of JUnit, e.g. by having your own classes with “main” methods, that is fine, too, as long as they are somehow automating the process of checking whether the results of the code are correct. You may choose to test individual methods as you would in unit testing, or test the whole program all at once (i.e., perform “system testing”), as long as the process is automated. Your test code should be in the “edu.upenn.cis573.hwk1” package or a sub-package below that.
There are no other specifications for the tests themselves per se: you only need to demonstrate that you have attempted to automatically test your program, and in the write- up below you will have to justify why you think you have “enough” tests. But in theory, the more testing you do, the more likely it is that your code is correct, and thus the higher your score will be on this assignment. In theory. J Before you begin I know, I know, you want to start coding! But before you begin, think about the design of your program, as you will need to justify it after you’re done implementing it. In particular, think about all of the things that could potentially change in this program: • the format of the input files (e.g. JSON instead of plain-text) • the source of the input data (e.g. a database or web service instead of local text
files) • the appearance of the output (e.g. a Java Swing GUI instead of a command-line UI) • the encryption algorithm that is used (e.g. something more complex than a
substitution cipher) • the decryption algorithm that is used (e.g. bigram frequency instead of letter
frequency)
Then try to design your program in such a way that it would be easy for someone else to change the code if any of these things changed. You may be thinking to yourself “yeah, but I know that no one else will ever read this code and that it will never need to change!” And, well, that's probably true. But maybe it’s not? Maybe you’ll have to revise this code in a future assignment? Heh heh heh... And even if you don’t, in this class we’re asking you to at least consider “what if?” you do need to reuse or change the code in the future (because this is certainly a concern in “the real world”), and to design/write it in such a way that makes that easier. Analysis Implementing the program is the easy part of this assignment! What we’re more interested in is getting you to think about your own notions of what is “good” software. In order for you to get full credit on this assignment, you must create a PDF document in which you answer all of the following questions. There are no “right” answers (yet), but you should answer these honestly and show that you have thought about what makes your code “good.”
1. Describe four specific and distinct ways in which the design of your classes is “good.” In particular, why would it be easy to modify your code if any of the above things (format of data, format of input file, etc.) were to change? 2. Describe four specific and distinct ways (not related to design) in which the style of your code is “good.” In particular, why is it easy to read or understand your code? 3. Describe three specific and distinct ways in which your code is efficient in terms of execution time and/or memory usage. Note that you may not merely describe ways in which it is “not inefficient,” like “I could have made it worse.” Rather, you should describe decisions you actively made in order to make the code efficient. 4. Briefly (3-4 sentences) describe the testing that you performed for this assignment, e.g. what framework/tools you used, what parts of your code you tested, how you determined the inputs to use and the expected/correct output, etc. Additionally, answer these two questions: (1) What made you think that you had done “enough” testing, i.e. why did you stop testing? (2) On a scale of 0-100%, how certain are you that your code is correct? Explain/justify your answer. Yes, this is a trick. J Please be sure to include your name in the PDF as well! For #1-3, please note that each “way” you state needs to be distinct; it is not sufficient to simply provide separate examples of the same thing. For instance, for #2 you may say that you used good variable names; that only counts as one “way” and you cannot use more than one example toward the four overall “ways.” Help! I’m stuck! If you are having trouble with the algorithm that you’re supposed to use (i.e., the steps that you need to go through in order to solve the problem), then we suggest doing the following (in this order) to try to figure it out:
1. Walk away for 30 minutes or so. Sometimes it’s best to let your brain rest for a bit, and then come back to the problem somewhat refreshed.
2. If you’re still stuck, then try to go see a member of the instruction staff during office hours, which are posted in Piazza.
3. If you can’t make it to office hours, look on Piazza to see if anyone else has asked a similar question (and gotten a response). If not, post a question there. If you’re afraid that asking your question would reveal the solution, then make it a “private question” that only the instruction staff can see.
Note that “look on Internet” is not on this list!!! And neither is “ask a friend”!!! Although the purpose of this assignment is not primarily to measure your ability to think algorithmically (after all, we’re giving you the algorithm!), you should be coming up with the solution on your own to whatever extent possible.
If you are having trouble with the implementation (e.g., a bug you can't fix, some code that won’t compile, etc.), then we suggest doing the following (in this order) to try to figure it out:
1. Look in your notes from your previous Java courses. Pretty much all the Java you need to know should have been covered in a previous course.
2. If that’s not helping, check out the online Java documentation at http://docs.oracle.com/javase/tutorial/. This is not the same as “go on stackoverflow”!! You should be striving for a mastery of Java, not just a solution to this one particular problem.
3. If you’re still stuck, then try to go see a member of the instruction staff during office hours.
4. If you can’t make it to office hours, look on Piazza and post your question if you don’t see a similar one already there, though see the note below about what’s appropriate and what’s not.
The moral of the story: try to figure it out by yourself first, then use reference material, then ask a member of the instruction staff or use Piazza. But do not go into “the wild” for help! If you are really struggling with this assignment (e.g. it’s taking more than 10-12 hours), particularly if you do not have much Java experience, please speak with the instructor. Which brings us to…. Getting help on Piazza Although we encourage you to use Piazza to get help from the instruction staff, keep in mind that this is a graduate-level course and that proficiency in Java is a prerequisite. As such, members of the instruction staff will not answer Piazza questions along the lines of “here’s my code, can you tell me what’s wrong with it?” Unless requested by a member of the instruction staff, you should never post large pieces of code (either as an attachment or inline) in Piazza and ask us to debug it for you. Rather, if you need help debugging your code, you should describe the steps you have taken so far to debug it, along with your speculation as to what’s wrong with it, and only include code snippets as part of your description. If you truly have no idea what’s wrong with your code and cannot do the above, then you may want to consider rewriting it (if even you don’t understand it, how can you expect someone else to?) or go discuss it with a member of the instruction staff during office hours.
One last thing about Piazza... Although you are encouraged to post questions on Piazza if you need help or if you need clarification, please be careful about accidentally revealing solutions. For instance, please do not post questions along the lines of “My program says that the accuracy is 51.4% for tamingoftheshrew.txt; is that right?” It’s important that all students determine for themselves whether their program is working correctly. If you think your question might accidentally reveal too much, please post it as a private question and we will redistribute it if appropriate to do so. Which brings us to…. Academic Honesty and Collaboration Put simply, you must work on this assignment alone. The point of the assignment is for you to think about your own notions of “good” software, and you can’t do that if someone else is writing the code for you! You may discuss the assignment specification with other students, as well as your general implementation strategy and observations, but you absolutely must not discuss or share code, or the program output, or your answers to the writeup questions. As described in the course syllabus in Canvas, this includes but is not limited to:
• co-authoring code, either through pair programming or distributing the work • sharing and distributing code, i.e., one student writes it and allows other students
to see it (even if the other students further modify it) • sketching out code together on paper, a whiteboard, etc., even if you type it up
separately • reviewing another student's code “just to see how they did it” • helping another student debug/troubleshoot his or her code
Although you can, of course, look online for help with the Java API and things like that, you should not be receiving any help from outside sources, including students not taking this course, and you definitely should not be copy/pasting code from the Internet. You are expected to implement all of this on your own and then contemplate your own notions of “good” software. Also, on this assignment you may not use third-party libraries other than the standard Java API. That is, although you are of course free to use classes in the java.lang, java.util, java.io, etc. packages, you may not use libraries from other packages, e.g. org.apache.commons, without permission from the instructor. Suspected violation of these policies will be referred to the Office of Student Conduct. Students who are found to have violated the policies will receive a score of 0 on this assignment, in addition to other sanctions as deemed appropriate by the OSC.
If you run into problems, please ask a member of the teaching staff for help before trying to find help online or asking your friends! I’d rather give you an extension on the assignment than have to send you to OSC! Submission All deliverables are due by Friday, September 15, at 5:00pm. Late submissions will be penalized by 10% if 0-24 hours late, 20% for 24-48 hours late, and so on, up to one week, after which they will no longer be accepted. To submit your assignment, put all of your code (including tests) and your writeup into a single zip file. Please be sure that your code is in a folder called “src”, and that you follow the instructions regarding the package and class names as described above. Also, in your PDF, please include a section describing any known bugs, e.g. features that are not completely implemented, potentially incorrect outputs, or corner cases that were not addressed. The penalty for these bugs will be slightly reduced if you admit them in advance. Submit the zip file into the “Homework #1” assignment in Canvas. You may submit as many times as you’d like; only the last version will be graded. Grading This assignment is worth a total of 100 points (which is the standard point total for most assignments in this course). 60% of the score is related to adhering to the specifications listed above. As mentioned, although the assignment is somewhat under-specified, we are not trying to trick you with weird corner cases or error handling. If your program matches what’s written above, you should be fine, regardless of how you handle the unspecified parts. The remaining 40% of the score is related to your answers to the analysis questions. Although there are no “right” answers at this point (after all, we haven’t taught you any of this yet!), the grade will mostly be determined based on the amount of insight and level of detail and specificity that you provide. The more you write, the better! Updated: 26 August 2017, 3:30pm