Due Friday, June 30 at 10pm

This first lab is relatively quick, and should get you up to speed working with the command line, basic shell commands, an editor, and a small shell program.

Materials

You will need these materials for this assignment.

  • Alice in Wonderland, which you will find as plaintext in file ~cs50/public_html/Labs/Lab1/alice-gutenberg.txt (Credit: Project Gutenberg)

  • Shakespeare’s sonnets, which you will find as plaintext files in directory ~cs50/public_html/Labs/Lab1/sonnets/ (Credit: Project Gutenberg)

Curious how we produced the sonnets directory from Gutenberg’s sonnets.html file? Look at the bottom of this page.

Preparation

Set up for your work in this course, if you have not already:

$ cd
$ mkdir -p cs50/labs/lab1
$ chmod go-rwx cs50
$ cd cs50/labs/lab1

These commands create a directory ~/cs50/labs/lab1, prevent others from peeking at your work, and change the working directory to lab1 so you’re ready for the work below.

If you would prefer to work out the initial solutions on your laptop, use the scp command to “secure copy” the files to your laptop:

[MacBook ~]$ scp flume:~cs50/public_html/Labs/Lab1/alice-gutenberg.txt .
alice-gutenberg.txt                           100%  170KB 139.2KB/s   00:01
[MacBook ~]$ scp -r flume:~cs50/public_html/Labs/Lab1/sonnets .
XI                                            100%  717    11.9KB/s   00:00
CXLIII                                        100%  669    10.5KB/s   00:00
...

Later, use scp to push your solutions back to your Linux account, test them there, and then submit them from there.

Assignment

  1. Write a single bash command/pipeline that will extract the text of Alice in Wonderland into a file alice.txt in your lab1 directory. (Note that the provided file has header and footer material added by Project Gutenberg. We want only the stuff Lewis Carroll wrote.)

  2. Write a single bash command/pipeline that will read alice.txt and print a table of contents.

  3. Write a single bash command/pipeline that will read alice.txt and print the words, in order, exactly one word per line, into alicewords.txt. (A word is a sequence of letters or a single letter.)

  4. Write a single bash command/pipeline that will read alicewords.txt and print the number of times the word “Alice” appears, regardless of how it is capitalized.

  5. Write a single bash command/pipeline that will read alice.txt and print the number of times the word “wonder” appears, regardless of how it is capitalized; note that “wonder” can be part of a word, e.g., wondering, wonderland, wondered, etc. These appearances count too.

  6. Write a single bash command/pipeline that will read alicewords.txt and print the top-10 most-common words, regardless of capitalization.

  7. Write a single bash command/pipeline that will read alicewords.txt and print the number of unique words, regardless of capitalization, that are not stop words. Use the sorted list of stop words in ~cs50/public_html/Labs/Lab1/stopwords.txt. > We’re going to keep things simple, here; the words “account”, “accounts”, and “accounting” are all unique words for our purposes, though a fancier solution would stem them all to “account”. Given our definition of word, above, the word “account’s” would appear in alicewords.txt as “account” and “s”, and the latter will be stripped out as a stopword. > > You may find comm useful here.

  8. Write a bash script called shake.sh that allows the user to search for a word in all of Shakespeare’s sonnets.

    • For each matching sonnet, the script prints one line: the sonnet number, a colon, a space, the first line of the sonnet, then elipses.
    • If the user provides too few or too many arguments, it should print “usage: shake.sh searchword” and exit with status 1.
    • If the script cannot find the sonnets directory in the expected location, it should print “cannot find sonnets directory” and exit with status 2.
    • Write three test cases; write each test’s command in a separate single-line file shaketest#. Run each test and save its output in a separate file, i.e., bash shaketest1 > shaketest1.out.

For example,

$ ./shake.sh spring
CII:   My love is strengthen'd, though more weak in seeming;...
CIV:   To me, fair friend, you never can be old,...
I:   From fairest creatures we desire increase,...
LIII:   What is your substance, whereof are you made,...
LXIII:   Against my love shall be as I am now,...
XCVIII:   From you have I been absent in the spring,...
$ ./shake.sh computer
$

How many hits will you get if you shake love? :-)

What to hand in, and how

Even if you’ve worked out the solutions on your laptop, you must place them in your cs50/labs/lab1 directory on the department Linux servers. (Use scp to copy files between your laptop and the servers.) You should test them there before submitting.

For the command/pipeline questions, write all your answers in a single file commands.txt. Include each command and its output. > One approach: If you use Apple Terminal, open a fresh Terminal window, ssh to Linux, cd to your directory, type each command in sequence; then, use Terminal’s menu “Shell… Export Text as…” and save it to file commands.txt. If you’ve worked out all the solutions in other windows, you can copy-paste the commands into this new window and you’ll have a nice, neat file to turn in.

When finished, you should have the following files:

$ ls -1
alice.txt
alicewords.txt
commands.txt
shake.sh
shaketest1
shaketest1.out
shaketest2
shaketest2.out
shaketest3
shaketest3.out

Then, you can submit your lab:

~cs50/labs/submit 1

Make sure it confirms success.

If you wish to use one of your 24-hour extensions, run this command before the deadline:

~cs50/labs/submit 1 extension

This command deletes any submission you may have made previously, and leaves a single file “extension” there as an indication you are requesting an extension. When you are later ready to submit your work, do so as above:

~cs50/labs/submit 1

If you submit after the deadline (the original deadline, or your postponed deadline if you used an extension), any pre-deadline submission will be overwritten. We will grade the first submission present at 0h, 24h, 48h, or 72h after the original deadline. To avoid confusion, please blitz cs50 AT cs.dartmouth.edu if you want a late submission to be graded instead of your on-time submission.

Bursting an html file into many sonnets

This section is brought to you by curiosity.

How did we produce hundreds of files in the sonnets directory from Gutenberg’s sonnets.html file? A little sed and a little awk:

mkdir sonnets
sed \
 -e '1,/^\*\*\* START/d' \
 -e '/^End of Project Gutenberg/,$d' \
 -e 's/<br \/>//' \
 -e 's/&nbsp;/ /g' \
 -e '/<\/p>/d' \
 -e '/<\/h3>/d' \
 -e '/class="poem"/d' \
 -e '/<pre>/d' \
 -e '/^$/d' sonnets.html \
| awk '\
 /<h3>/ {printing=0; header=1; next;}\
 header {sonnet="sonnets/" $1; print sonnet; printing=1; header=0; next;}\
 printing {print > sonnet}'

I don’t expect you to learn awk and you will not need it for this assignment. But it’s a great little language!