Counting Words with Java 8

Sadly, in my day job, I am not yet able to use the awesomeness that is Java 8. However, from time to time, I like to kill a little time solving programming challenges, and I try to use Java 8 for those.

Today’s challenge came from /r/dailyprogrammer on Reddit. It was a pretty straightforward challenge – given a text file, count the number of occurrences for each word.┬áThis turns out to be very easy to do with streams!

We need to do the following operations:

  1. Read in all the lines from the file.
  2. Break up each line into words.
  3. Count each occurrence of the word.
  4. Sort the result.
  5. Print it out.

For simplicity’s sake, let’s assume a word is defined as any group of characters separated by whitespace.

Here’s the code:

       Files.lines(Paths.get(args[0]))
            .flatMap(line -> Stream.of(line.split("\\s+")))
            .map(String::toLowerCase)
            .collect(Collectors.toMap(word -> word, word -> 1, Integer::sum))
            .entrySet()
            .stream()
            .sorted((a, b) -> a.getValue() == b.getValue() ? a.getKey().compareTo(b.getKey()) : b.getValue() - a.getValue())
            .forEach(System.out::println);

Let’s break it down.

Files.lines reads a file and returns a Stream of its lines. But we want words, not lines. No problem. Stream.flatMap takes a function, that returns a Stream, to apply to each element. This gives us a mini-stream of words for each line. Then, flatMap flattens all those Streams into one big Stream, containing all the words in the file. In this case, we want to split the line on whitespace to form our words. Then we pass it along to String::toLowerCase so that we’re doing a case-insensitive word count.

Now that we have a Stream of all the words in the file, we can start processing. What we want is a Map<String, Integer> that maps each word to the number of occurrences. Collectors.toMap does this for us. The first argument is a function that should return the key in the map. In this case, the key is just the word, which describes the somewhat pointless looking word -> word. The second argument is a function that returns the value in the map. Here’s where it gets tricky. We’re using the three-argument version of Collectors.toMap, which handles collisions in the value function. The third argument is a function that will combine two colliding values to form a new value.

To sum up the number of occurrences of each word, we start with a value of 1. Here’s what happens. Say the word “cat” appears 3 times in the input file. This call to Collectors.toMap will result in three mappings whose key is “cat”, and whose value is 1. To get the word count, we want to add the three values (of 1 each) in the event of a collision. So we use Integer::sum to do this for us.

The hard part is done, but we still need to sort and print the results. Because collect is a terminal operation, we’ll need a new stream to proceed. Calling stream() on the resulting Map’s keySet will give us the stream we need.

To do the sorting, our comparison function should first check the word counts. If the words have the same count, then they should be sorted in alphabetical order. Otherwise, they should be sorted based on the number of occurrences.

Lastly, we print the sorted stream to the console to get the output.

To summarize, Java 8 lambdas and streams are insanely cool and I hope I get more experience with them soon!

Advertisements