Example: Hitchhikers Guide
Suggest you use the assumed perfectly legal copy of the Hitchhickers book text using the
- Use a regular expression to create a collection of individual words - eg. #"[a-zA-Z0-9|']+"
- Convert all the words to lower case so they match with common words source -
Removethe common English words used in the book, leaving more context specific words
- Calculate the
frequenciesof the remaining words, returning a map of word & word count pairs
Sort-byword count values in the map
Reversethe collection so the most commonly used word is the first element in the map
(def book (slurp "http://clearwhitelight.org/hitch/hhgttg.txt")) (def common-english-words (-> (slurp "https://www.textfixer.com/tutorials/common-english-words.txt") (clojure.string/split #",") set)) ;; using a function to pull in any book (defn get-book [book-url] (slurp book-url)) (defn -main [book-url] (->> (get-book book-url) (re-seq #"[a-zA-Z0-9|']+") (map #(clojure.string/lower-case %)) (remove common-english-words) frequencies (sort-by val) reverse)) ;; Call the program (-main "http://clearwhitelight.org/hitch/hhgttg.txt")
Deconstructing the code in the repl
(defn -main [book-url] (->> (get-book book-url) #_(re-seq #"[a-zA-Z0-9|']+") #_(map #(clojure.string/lower-case %)) #_(remove common-english-words) #_frequencies #_(sort-by val) #_reverse))
-main function will only return the result of the
(get-book book-url) function. To see what each of the other lines do, simply remove the #_ character from the front of an expression and re-evaluate the
-main function in the repl
Off-line sources of Hitchhickers book and common English words
(def book (slurp "./hhgttg.txt")) (def common-english-words (-> (slurp "common-english-words.txt") (clojure.string/split #",") set))
Original concept from Misophistful: Understanding thread macros in clojure
slurpfunction holds the contents of the whole file in memory, so it may not be appropriate for very large files. If you are dealing with a large file, consider wrapping slurp in a lazy evaluation or use Java IO (eg.
java.io.FileReader.). See the Clojure I/O cookbook and The Ins & Outs of Clojure for examples.