Creating a shell script
CSCI 132: Practical Unix and Programming Adjunct: Trami Dang Assignment 4 Fall 2018
Assignment 41 This set of exercises will strengthen your ability to write relatively simple shell scripts using various filters. As always, your goals should be clarity, efficiency, and simplicity. It has two parts.
1. The background context that was provided in the previous assignment is repeated here for your convenience. A DNA string is a sequence of the letters a, c, g, and t in any order, whose length is a multiple of three2. For example, aacgtttgtaaccagaactgt is a DNA string of length 21. Each sequence of three consecutive letters is called a codon. For example, in the preceding string, the codons are aac, gtt, tgt, aac, cag, aac, and tgt.
Your task is to write a script named codonhistogram that expects a file name on the command line. This file is supposed to be a dna textfile, which means that it contains only a DNA string with no newline characters or white space characters of any kind; it is a sequence of the letters a, c, g, and t of length 3n for some n. The script must count the number of occurrences of every codon in the file, assuming the first codon starts at position 13, and it must output the number of times each codon occurs in the file, sorted in order of decreasing frequency. For example, if dnafile is a file containing the dna string aacgtttgtaaccagaactgt, then the command
codonhistogram dnafile should produce the following output: 3 aac 2 tgt 1 cag 1 gtt because there are 3 aac codons, 2 tgt, 1 cag, and 1 gtt. Notice that frequency comes first, then the codon name. 1 This is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. To view a copy of this license, visit http://creativecommons.org/licenses/by-nc-sa/4.0/. 2 This is really just a simplification to make the assignment easier. In reality, it is not necessarily a multiple of 3. 3 Those of you who know a little about genomics know that the open reading frame can be shifted to get a different set of codons. I want any of you who know this much to assume that there is only one open reading frame – the one starting at position 1.
CSCI 132: Practical Unix and Programming Adjunct: Trami Dang Assignment 4 Fall 2018 Important: If two or more codons have the same frequency, your script should break the tie using alphabetical order of the codons. In this example, cag and gtt each occur just once, but because c precedes g, cag comes before gtt above.
Error checking: The script should check that it has at least one argument. If it is missing an argument, it should exit with the usage message
codonhistogram <dnafile> If it has an argument, it must check that it is the name of an ordinary file that it can read. If it cannot, it must exit with the message
codonhistogram: cannot open file <filename argument> for reading It must check that the file has a number of characters that is a multiple of 3 and that it has only the characters a, c, g, and t and no others. If the file ends with a newline character, then the length must be a multiple of 3 plus 1. For any file not satisfying these constraints, it must exit with an error message.
2. Write a script called atomcoordinates that will accept the name of a PDB file as its only command line argument. Given this PDB file, it will find all lines that start with the word ATOM and will display, for each line that it finds, a line of output containing the atom's serial number and coordinates. For example, a line in the PDB file that looks like this:
ATOM 18 CB GLN A 3 83.556 52.126 45.080 1.00 26.06 C
would result in the following output line being displayed:
18 83.556 52.126 45.080 because the atom's serial number is 18 and its coordinates are 83.556, 52.126, and 45.080. How do you know where this information is? In a PDB file, the data is in specific columns. In particular, the atom's serial number is always in columns 7 through 11, and the three coordinates start in column 31 and end in column 54. Therefore, your script has to extract the serial number and the coordinates from these columns and display them. Your job is to decide which filters can achieve this. This will take some research. Figure out which filters will work the best. Error checking: Your script must check that it has at least one command line argument, and that it is a file that it can read. It must display a message if either of these is not true.
CSCI 132: Practical Unix and Programming Adjunct: Trami Dang Assignment 4 Fall 2018 Grading Rubric This homework is graded on a 100-point scale. Each script is worth 50 points. Each script will be graded on its correctness foremost. This means that it does exactly what the assignment states it must do, in detail. Correctness is worth 70% of the grade. Then it is graded on its clarity, simplicity, and efficiency, as described above. These qualitative measures are worth 30% of the grade.
Submitting the Homework Due Date: This assignment is due by the end of the day (i.e. 11:59PM, EST) on Wednesday, October 31st. I will update the class accordingly of when this particular assignment is to be submitted to Blackboard as an assignment submission.
If you complete the assignment before I announce the post of Blackboard assignment submission, you may post your assignment to my email, only as a zip archive.
Submission details In PDF format of your actions of command input with screenshots of all output; or as a zip file. For remote logins: ssh to eniac.cs.hunter.cuny.edu with your valid username and password, and then ssh into any cslab host.
1. In your own home directory, create a directory named assignment4_username where username is your Linux Lab account username.
2. Put copies of the two scripts that you have written into this directory. Make sure they are named codonhistogram and atomcoordinates.
3. Run the commands:
$ zip -r assignment4_username.zip assignment4_username/ $ chmod 755 assignment4_username.zip This will create the file assignment4_username.zip. For Linux Lab users: once you have made the zip file, navigate to its location in the file- system and upload to Blackboard. For anyone working on the assignment remotely, use the scp command to securely copy it to your local computer, and then upload the file to
CSCI 132: Practical Unix and Programming Adjunct: Trami Dang Assignment 4 Fall 2018 Blackboard. $ scp <[email protected]>:<path_of_zip_file> <desired_path> There is no whitespace on either side of the colon. Your login, Your.Username @eniac.cs.cuny.edu is named before the colon. The <path_of_zip_file> is absolute path on the remote machine, named after the colon. Then type a whitespace and specify the <desired_path> on your local file-system that you would like to put your zip file. If you run the command properly, it should bring up a password prompt from eniac.cs.hunter.cuny.edu. The zip file will be placed in your specified location. Now you are ready to upload your zip file to Blackboard.