HW 1: Fun and Games with Regular Expressions

Assigned: 10/1/2014

Due: 10/8/2014

In this homework, you will be getting some practice at working with regular expressions. There are five parts to the assignment. Please turn in a .zip or .tar.gz file containing your solutions to each part via email. In your email, please use "CS655 Homework" somewhere in the title.

As always, please do not hesitate to email me if you have any questions!

Part 1: FSA to Regular Expression

For the following two FSAs, write their corresponding regular expression. Hint: remember that the double line represents a valid final (stopping) state.

Part 2: Regular Expression to FSA

For the following two expressions, create a corresponding FSA. You may use whatever you want to do this, but I suggest using GraphViz and Dot. On a Mac, Homebrew will install both (along with some dependencies: "brew install cairo pango graphviz").

  1. d*bc(ca*)*
  2. (b+a+c)*

Part 3: Counting Sonnets

William Shakespeare wrote a number of sonnets. This file from Project Gutenberg contains all of them, along with some other content. Your task is to write a program (in the language of your choice) to count the number of sonnetes contained in the file. Hint: you'll note that although the document is mostly unstructured, there are patterns to its formatting that can be used to identify when a sonnet starts and stops.

To complete this part of the assignment: Write your program, and determine the number of sonnets in the file. Include your source code in the submission .zip/.tar.gz file, and also write me a paragraph telling me how many sonnets your program identified and also a little bit about how you tackled this problem.

Part 4: Collecting and Counting Surnames

The characters in Jane Austen's Pride and Prejudice lived in a more formal era than our own, and often referred to one another by their surnames (family names). For example, the female protagonist is often referred to as "Miss Bennet," and the male protagonist as "Mr. Darcy."

Write a program using regular expressions to extract as many of the surnames as you can from the story, and compute a frequency table with how many times each one occurs. To complete this part of the assignment, turn in your program's source code, your frequency table, and a paragraph describing your approach to solving this problem.

If you have fun doing this, here are a couple of ideas for extending this part of the assignment (and getting extra credit!):

Part 5: Regex Golf

Regex Golf is a game wherein you try to construct a regular expression that matches one set of words while not matching a second set (i.e., by minimizing false positives). There are many levels, and they get difficult quickly. For this part of the assignment, please attempt the first five levels (up through Abba— though, of course, if you want, you should feel free to go for as many levels as you like, if you are having fun!). Turn in each level's expressions, along with a paragraph describing your experience. Was it fun? Frustrating? What new regex tricks did you learn?