Counting Words with Java 8

Sadly, in my day job, I am not yet able to use the awesomeness that is Java 8. However, from time to time, I like to kill a little time solving programming challenges, and I try to use Java 8 for those.

Today’s challenge came from /r/dailyprogrammer on Reddit. It was a pretty straightforward challenge – given a text file, count the number of occurrences for each word.┬áThis turns out to be very easy to do with streams!

We need to do the following operations:

  1. Read in all the lines from the file.
  2. Break up each line into words.
  3. Count each occurrence of the word.
  4. Sort the result.
  5. Print it out.

For simplicity’s sake, let’s assume a word is defined as any group of characters separated by whitespace.

Here’s the code:

       Files.lines(Paths.get(args[0]))
            .flatMap(line -> Stream.of(line.split("\\s+")))
            .map(String::toLowerCase)
            .collect(Collectors.toMap(word -> word, word -> 1, Integer::sum))
            .entrySet()
            .stream()
            .sorted((a, b) -> a.getValue() == b.getValue() ? a.getKey().compareTo(b.getKey()) : b.getValue() - a.getValue())
            .forEach(System.out::println);

Let’s break it down.

Files.lines reads a file and returns a Stream of its lines. But we want words, not lines. No problem. Stream.flatMap takes a function, that returns a Stream, to apply to each element. This gives us a mini-stream of words for each line. Then, flatMap flattens all those Streams into one big Stream, containing all the words in the file. In this case, we want to split the line on whitespace to form our words. Then we pass it along to String::toLowerCase so that we’re doing a case-insensitive word count.

Now that we have a Stream of all the words in the file, we can start processing. What we want is a Map<String, Integer> that maps each word to the number of occurrences. Collectors.toMap does this for us. The first argument is a function that should return the key in the map. In this case, the key is just the word, which describes the somewhat pointless looking word -> word. The second argument is a function that returns the value in the map. Here’s where it gets tricky. We’re using the three-argument version of Collectors.toMap, which handles collisions in the value function. The third argument is a function that will combine two colliding values to form a new value.

To sum up the number of occurrences of each word, we start with a value of 1. Here’s what happens. Say the word “cat” appears 3 times in the input file. This call to Collectors.toMap will result in three mappings whose key is “cat”, and whose value is 1. To get the word count, we want to add the three values (of 1 each) in the event of a collision. So we use Integer::sum to do this for us.

The hard part is done, but we still need to sort and print the results. Because collect is a terminal operation, we’ll need a new stream to proceed. Calling stream() on the resulting Map’s keySet will give us the stream we need.

To do the sorting, our comparison function should first check the word counts. If the words have the same count, then they should be sorted in alphabetical order. Otherwise, they should be sorted based on the number of occurrences.

Lastly, we print the sorted stream to the console to get the output.

To summarize, Java 8 lambdas and streams are insanely cool and I hope I get more experience with them soon!

Word Chains and Graphs

Recently, I’ve been working through Dave Thomas’s Code Kata problems in an effort to sharpen my programming skills. The early ones were fairly easy, and I got through them pretty quickly. That changed once I got to Kata 19, Word Chains.

The idea sounds simple enough. The puzzle is to write a program that will transform a starting word to an ending word. One letter is changed at a time, and each word in the chain must be a real word.

I initially coded a seemingly easy solution, but it only worked in a small subset of cases. I didn’t think the problem through enough before jumping to code. In my original implementation, each word in the chain needed to have a letter in the proper position of the ending word. This worked for a case like “cat” to “dog” (cat, cot, cog, dog) but not in a more tricky case such as “the” to “end” (the, tie, die, did, aid, and, end) that had more intermediate steps.

Clearly, I was on the wrong path.

Finally, it occurred to me that I would have to calculate all the possible word transformations, then find the shortest path. Uh-oh. Sounds like a job for a graph. Now, I haven’t looked at graph theory since college. I spent a couple of days researching the topic, even watching some interesting MIT lectures on algorithms (which I highly recommend).

Finally, I set to coding my new solution. After loading the list of words from the dictionary file, a graph is constructed. The starting node is the starting word, then all possible edges are recursively calculated. In a classic case of over-engineering, I then implemented Dijkstra’s algorithm to find the shortest path.

The solution was plenty fast, finding any word chain I could throw at it in well under a second. But as I thought about it more, I realized Dijkstra was a bit of overkill. Part of Dijkstra’s algorithm involves measuring the weights of different edges to find the shortest path. However, in this case, all the edges are equal in weight (a weight of 1).

With all edge weights equal, it was essentially a breadth-first search. Realizing that, I wrote a new implementation to find the shortest path, using a simple breadth-first search.


(Image: Wikipedia)

After getting that working, I found another optimization. Due to the way the breadth-first search traverses the graph, the graph doesn’t need to be generated ahead of time. At each step of the search, when a node (a word) is visited, all the next words need to be generated. This can actually be calculated at each step, building the graph on the fly. This can reduce memory usage, since it will result in a smaller graph, only building what it needs to as it searches the graph.

The key to the breadth-first search is a Map that associates a node with its predecessor in the shortest path. This Map also is used to remember what words have already been visited:

		Map<String, String> previous = new HashMap<>();

		Queue<String> q = new LinkedList<>();
		q.add(fromWord);

		String word = null;
		while (!q.isEmpty() && !(word = q.remove()).equals(toWord)) {
			List<String> nextWords = getNextWords(word);
			for (String nextWord : nextWords) {
				if (!previous.containsKey(nextWord)) {
					previous.put(nextWord, word);
					q.add(nextWord);
				}
			}
		}

When the search is completed, the shortest path can be constructed using the gathered predecessor data. Starting with the ending word, the path is built in reverse by looking up each word’s precedessor. This continues until the starting word is reached:

		List<String> path = new ArrayList<>();
		String word = toWord;
		while (!word.equals(fromWord)) {
			path.add(0, word);
			word = previous.get(word);
		}

		path.add(0, fromWord);

With that, the shortest word chain is calculated. This was a tough challenge for me, but it was interesting to re-learn all that graph theory that I had forgotten.

You can see the full Java solution on my GitHub page here:
https://github.com/joeattardi/codekata/tree/master/wordchains