A First Exercise in Natural Language Processing with Python: Counting Hapaxes

A first exercise

Counting hapaxes (words which occur only once in a text or corpus) is an easy enough problem that makes use of both simple data structures and some fundamental tasks of natural language processing (NLP): tokenization (dividing a text into words), stemming, and part-of-speech tagging for lemmatization. For that reason it makes a good exercise to get started with NLP in a new language or library.

As a first exercise in implementing NLP tasks with Python, then, we’ll write a script which outputs the count and a list of the hapaxes in the following paragraph (our script can also be run on an arbitrary input file). You can follow along, or try it yourself and then compare your solution to mine.

Cory Linguist, a cautious corpus linguist, in creating a corpus of courtship correspondence, corrupted a crucial link. Now, if Cory Linguist, a careful corpus linguist, in creating a corpus of courtship correspondence, corrupted a crucial link, see that YOU, in creating a corpus of courtship correspondence, corrupt not a crucial link.

To keep things simple, ignore punctuation and case. To make things complex, count hapaxes in all three of word form, stemmed form, and lemma form. The final program (hapaxes.py) is listed at the end of this post. The sections below walk through it in detail for the beginning NLP/Python programmer.

Natural language processing with Python

There are several NLP packages available to the Python programmer. The most well-known is the Natural Language Toolkit (NLTK), which is the subject of the popular book Natural Language Processing with Python by Bird et al. NLTK has a focus on education/research with a rather sprawling API. Pattern is a Python package for datamining the WWW which includes submodules for language processing and machine learning. Polyglot is a language library focusing on “massive multilingual applications.” Many of its features support over 100 languages (but it doesn’t seem to have a stemmer or lemmatizer builtin). And there is Matthew Honnibal’s spaCy, an “industrial strength” NLP library focused on performance and integration with machine learning models.

If you don’t already know which library you want to use, I recommend starting with NLTK because there are so many online resources available for it. The program presented below actually presents several solutions to counting hapaxes, using both plain Python and the NLTK library.

  • Word forms - counts unique spellings (normalized for case). This uses plain Python (no NLP packages required)

  • NLTK stems - counts unique stems using a stemmer provided by NLTK

  • NLTK lemmas - counts unique lemma forms using NLTK’s part of speech tagger and interface to the WordNet lemmatizer

Installation

This tutorial assumes you already have Python installed on your system and have some experience using the interpreter. I recommend referring to each package’s project page for installation instructions, but here is one way using pip. As explained below, each of the NLP packages are optional; feel free to install only the ones you’re interested in playing with.

# Install NLTK:
$ pip install nltk

# Download reqed NLTK data packages
$ python -c 'import nltk; nltk.download("wordnet"); nltk.download("averaged_perceptron_tagger_eng"); nltk.download("omw-1.4")'

Optional dependency on Python modules

It would be nice if our script didn’t depend on any particular NLP package so that it could still run even if one or more of them were not installed (using only the functionality provided by whichever packages are installed).

One way to implement a script with optional package dependencies in Python is to try to import a module, and if we get an ImportError exception we mark the package as uninstalled (by setting a variable with the module’s name to None) which we can check for later in our code:

[hapaxes.py: 59-88]
### Imports
#
# Import some Python 3 features to use in Python 2
from __future__ import print_function
from __future__ import unicode_literals

# gives us access to command-line arguments
import sys

# The Counter collection is a convenient layer on top of
# python's standard dictionary type for counting iterables.
from collections import Counter

# The standard python regular expression module:
import re

try:
    # Import NLTK if it is installed
    import nltk

    # This imports NLTK's implementation of the Snowball
    # stemmer algorithm
    from nltk.stem.snowball import SnowballStemmer

    # NLTK's interface to the WordNet lemmatizer
    from nltk.stem.wordnet import WordNetLemmatizer
except ImportError:
    nltk = None
    print("NLTK is not installed, so we won't use it.")

Tokenization

Tokenization is the process of splitting a string into lexical ‘tokens’ — usually words or sentences. In languages with space-separated words, satisfactory tokenization can often be accomplished with a few simple rules, though ambiguous punctuation can cause errors (such as mistaking a period after an abbreviation as the end of a sentence). Some tokenizers use statistical inference (trained on a corpus with known token boundaries) to recognize tokens.

In our case we need to break the text into a list of words in order to find the hapaxes. But since we are not interested in punctuation or capitalization, we can make tokenization very simple by first normalizing the text to lower case and stripping out every punctuation symbol:

[hapaxes.py: 90-109]
def normalize_tokenize(string):
    """
    Takes a string, normalizes it (makes it lowercase and
    removes punctuation), and then splits it into a list of
    words.

    Note that everything in this function is plain Python
    without using NLTK (although as noted below, NLTK provides
    some more sophisticated tokenizers we could have used).
    """
    # make lowercase
    norm = string.lower()

    # remove punctuation
    norm = re.sub(r'(?u)[^\w\s]', '', norm) (1)

    # split into words
    tokens = norm.split()

    return tokens
1 Remove punctuation by replacing everything that is not a word (\w) or whitespace (\s) with an empty string. The (?u) flag at the beginning of the regex enables unicode matching for the \w and \s character classes in Python 2 (unicode is the default with Python 3).

Our tokenizer produces output like this:

>>> normalize_tokenize("This is a test sentence of white-space separated words.")
['this', 'is', 'a', 'test', 'sentence', 'of', 'whitespace', 'separated', 'words']

Instead of simply removing punctuation and then splitting words on whitespace, we could have used one of the tokenizers provided by NLTK. Specifically the word_tokenize() method, which first splits the text into sentences using a pre-trained English sentences tokenizer (sent_tokenize), and then finds words using regular expressions in the style of the Penn Treebank tokens.

# We could have done it this way (requires the
# 'punkt' data package):
from nltk.tokenize import word_tokenize
tokens = word_tokenize(norm)

The main advantage of word_tokenize() is that it will turn contractions into separate tokens. But using Python’s standard split() is good enough for our purposes.

Counting word forms

We can use the tokenizer defined above to get a list of words from any string, so now we need a way to count how many times each word occurs. Those that occur only once are our word-form hapaxes.

[hapaxes.py: 111-125]
def word_form_hapaxes(tokens):
    """
    Takes a list of tokens and returns a list of the
    wordform hapaxes (those wordforms that only appear once)

    For wordforms this is simple enough to do in plain
    Python without an NLP package, especially using the Counter
    type from the collections module (part of the Python
    standard library).
    """

    counts = Counter(tokens) (1)
    hapaxes = [word for word in counts if counts[word] == 1] (2)

    return hapaxes
1 Use the convenient Counter class from Python’s standard library to count the occurrences of each token. Counter is a subclass of the standard dict type; its constructor takes a list of items from which it builds a dictionary whose keys are elements from the list and whose values are the number of times each element appeared in the list.
2 This list comprehension creates a list from the Counter dictionary containing only the dictionary keys that have a count of 1. These are our hapaxes.

Stemming and Lemmatization

If we use our two functions to first tokenize and then find the hapaxes in our example text, we get this output:

>>> text = "Cory Linguist, a cautious corpus linguist, in creating a corpus of courtship correspondence, corrupted a crucial link. Now, if Cory Linguist, a careful corpus linguist, in creating a corpus of courtship correspondence, corrupted a crucial link, see that YOU, in creating a corpus of courtship correspondence, corrupt not a crucial link."
>>> tokens = normalize_tokenize(text)
>>> word_form_hapaxes(tokens)
['now', 'not', 'that', 'see', 'if', 'corrupt', 'you', 'careful', 'cautious']

Notice that ‘corrupt’ is counted as a hapax even though the text also includes two instances of the word ‘corrupted’. That is expected because ‘corrupt’ and ‘corrupted’ are different word-forms, but if we want to count word roots regardless of their inflections we must process our tokens further. There are two main methods we can try:

  • Stemming uses an algorithm (and/or a lookup table) to remove the suffix of tokens so that words with the same base but different inflections are reduced to the same form. For example: ‘argued’ and ‘arguing’ are both stemmed to ‘argu’.

  • Lemmatization reduces tokens to their lemmas, their canonical dictionary form. For example, ‘argued’ and ‘arguing’ are both lemmatized to ‘argue’.

Stemming with NLTK

In 1980 Martin Porter published a stemming algorithm which has become a standard way to stem English words. His algorithm was implemented so many times, and with so many errors, that he later created a programming language called Snowball to help clearly and exactly define stemmers. NLTK includes a Python port of the Snowball implementation of an improved version of Porter’s original stemmer:

[hapaxes.py: 127-143]
def nltk_stem_hapaxes(tokens):
    """
    Takes a list of tokens and returns a list of the word
    stem hapaxes.
    """
    if not nltk: (1)
        # Only run if NLTK is loaded
        return None

    # Apply NLTK's Snowball stemmer algorithm to tokens:
    stemmer = SnowballStemmer("english")
    stems = [stemmer.stem(token) for token in tokens]

    # Filter down to hapaxes:
    counts = nltk.FreqDist(stems) (2)
    hapaxes = counts.hapaxes() (3)
    return hapaxes
1 Here we check if the nltk module was loaded; if it was not (presumably because it is not installed), we return without trying to run the stemmer.
2 NLTK’s FreqDist class subclasses the Counter container type we used above to count word-forms. It adds some methods useful for calculating frequency distributions.
3 The FreqDist class also adds a hapaxes() method, which is implemented exactly like the list comprehension we used to count word-form hapaxes.

Running nltk_stem_hapaxes() on our tokenized example text produces this list of stem hapaxes:

>>> nltk_stem_hapaxes(tokens)
['now', 'cautious', 'that', 'not', 'see', 'you', 'care', 'if']

Notice that ‘corrupt’ is no longer counted as a hapax (since it shares a stem with ‘corrupted’), and ‘careful’ has been stemmed to ‘care’.

Lemmatization with NLTK

NLTK provides a lemmatizer (the WordNetLemmatizer class in nltk.stem.wordnet) which tries to find a word’s lemma form with help from the WordNet corpus (which can be downloaded by running nltk.download() from an interactive python prompt — refer to “Installing NLTK Data” for general instructions).

In order to resolve ambiguous cases, lemmatization usually requires tokens to be accompanied by part-of-speech tags. For example, the word lemma for rose depends on whether it is used as a noun or a verb:

>>> lemmer = WordNetLemmatizer()
>>> lemmer.lemmatize('rose', 'n') # tag as noun
'rose'
>>> lemmer.lemmatize('rose', 'v') # tag as verb
'rise'

Since we are operating on untagged tokens, we’ll first run them through an automated part-of-speech tagger provided by NLTK (it uses a pre-trained perceptron tagger originally by Matthew Honnibal: “A Good Part-of-Speech Tagger in about 200 Lines of Python”). The tagger requires the training data available in the 'averaged_perceptron_tagger.pickle' file which can be downloaded by running nltk.download() from an interactive python prompt.

[hapaxes.py: 145-166]
def nltk_lemma_hapaxes(tokens):
    """
    Takes a list of tokens and returns a list of the lemma
    hapaxes.
    """
    if not nltk:
        # Only run if NLTK is loaded
        return None

    # Tag tokens with part-of-speech:
    tagged = nltk.pos_tag(tokens) (1)

    # Convert our Treebank-style tags to WordNet-style tags.
    tagged = [(word, pt_to_wn(tag))
                     for (word, tag) in tagged] (2)

    # Lemmatize:
    lemmer = WordNetLemmatizer()
    lemmas = [lemmer.lemmatize(token, pos)
                     for (token, pos) in tagged] (3)

    return nltk_stem_hapaxes(lemmas) (4)
1 This turns our list of tokens into a list of 2-tuples: [(token1, tag1), (token2, tag2)…​]
2 We must convert between the tags returned by pos_tag() and the tags expected by the WordNet lemmatizer. This is done by applying the pt_to_wn() function (defined below) to each tag.
3 Pass each token and POS tag to the WordNet lemmatizer.
4 If a lemma is not found for a token, then it is returned from lemmatize() unchanged. To ensure these unhandled words don’t contribute spurious hapaxes, we pass our lemmatized tokens through the word stemmer for good measure (which also filters the list down to only hapaxes).

As noted above, the tags returned by pos_tag() are Penn Treebank style tags while the WordNet lemmatizer uses its own tag set (defined in the nltk.corpus.reader.wordnet module, though that is not very clear from the NLTK documentation). The pt_to_wn() function converts Treebank tags to the tags required for lemmatization:

[hapaxes.py: 168-199]
def pt_to_wn(pos):
    """
    Takes a Penn Treebank tag and converts it to an
    appropriate WordNet equivalent for lemmatization.

    A list of Penn Treebank tags is available at:
    https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html
    """

    from nltk.corpus.reader.wordnet import NOUN, VERB, ADJ, ADV

    pos = pos.lower()

    if pos.startswith('jj'):
        tag = ADJ
    elif pos == 'md':
        # Modal auxiliary verbs
        tag = VERB
    elif pos.startswith('rb'):
        tag = ADV
    elif pos.startswith('vb'):
        tag = VERB
    elif pos == 'wrb':
        # Wh-adverb (how, however, whence, whenever...)
        tag = ADV
    else:
        # default to NOUN
        # This is not strictly correct, but it is good
        # enough for lemmatization.
        tag = NOUN

    return tag

Make it a script

You can play with the functions we’ve defined above by typing (copy-and-pasting) them into an interactive Python session. If we save them all to a file, then that file is a Python module which we could import and use in a Python script. To use a single file as both a module and a script, our file can include a construct like this:

if __name__ == "__main__":
    # our script logic here

This works because when the Python interpreter executes a script (as opposed to importing a module), it sets the top-level variable __name__ equal to the string "__main__" (see also: What does if __name__ == “__main__”: do?).

In our case, our script logic consists of reading any input files if given, running all of our hapax functions, then collecting and displaying the output. To see how it is done, scroll down to the full program listing below.

Running it

To run the script, first download and save hapaxes.py. Then:

$ python hapaxes.py

Depending on which NLP packages you have installed, you should see output like:

               Count
     Wordforms   9
    NLTK-stems   8
   NLTK-lemmas   8

-- Hapaxes --
Wordforms:    careful, cautious, corrupt, if, not, now, see, that, you
NLTK-stems:   care, cautious, if, not, now, see, that, you
NLTK-lemmas:  care, cautious, if, not, now, see, that, you

Try also running the script on an arbitrary input file:

$ python hapaxes.py somefilename

# run it on itself and note that
# source code doesn't give great results:
$ python hapaxes.py hapaxes.py

hapaxes.py listing

The entire script is listed below and available at hapaxes.py.

hapaxes.py
  1"""
  2A sample script/module which demonstrates how to count hapaxes (tokens which
  3appear only once) in an untagged text corpus using plain python and NLTK.
  4It counts and lists hapaxes in five different ways:
  5
  6    * Wordforms - counts unique spellings (normalized for case). This uses
  7    plain Python (no NLTK required)
  8
  9    * NLTK stems - counts unique stems using a stemmer provided by NLTK
 10
 11    * NLTK lemmas - counts unique lemma forms using NLTK's part of speech
 12    * tagger and interface to the WordNet lemmatizer.
 13
 14The nltk module is optional. If it is not installed, only the plain python code
 15will be run.
 16
 17Usage:
 18
 19    python hapaxes.py [file]
 20
 21If 'file' is given, its contents are read and used as the text in which to
 22find hapaxes. If 'file' is omitted, then a test text will be used.
 23
 24Example:
 25
 26Running this script with no arguments:
 27
 28    python hapaxes.py
 29
 30Will process this text:
 31
 32    Cory Linguist, a cautious corpus linguist, in creating a corpus of
 33    courtship correspondence, corrupted a crucial link. Now, if Cory Linguist,
 34    a careful corpus linguist, in creating a corpus of courtship
 35    correspondence, corrupted a crucial link, see that YOU, in creating a
 36    corpus of courtship correspondence, corrupt not a crucial link.
 37
 38And produce this output:
 39
 40                Count
 41         Wordforms   9
 42             Stems   8
 43            Lemmas   8
 44
 45    -- Hapaxes --
 46    Wordforms:    careful, cautious, corrupt, if, not, now, see, that, you
 47    NLTK-stems:   care, cautious, if, not, now, see, that, you
 48    NLTK-lemmas:  care, cautious, if, not, now, see, that, you
 49
 50
 51Notice that the stems and lemmas methods do not count "corrupt" as a hapax
 52because it also occurs as "corrupted". Notice also that "Linguist" is not
 53counted as the text is normalized for case.
 54
 55See also the Wikipedia entry on "Hapex legomenon"
 56(https://en.wikipedia.org/wiki/Hapax_legomenon)
 57"""
 58
 59### Imports
 60#
 61# Import some Python 3 features to use in Python 2
 62from __future__ import print_function
 63from __future__ import unicode_literals
 64
 65# gives us access to command-line arguments
 66import sys
 67
 68# The Counter collection is a convenient layer on top of
 69# python's standard dictionary type for counting iterables.
 70from collections import Counter
 71
 72# The standard python regular expression module:
 73import re
 74
 75try:
 76    # Import NLTK if it is installed
 77    import nltk
 78
 79    # This imports NLTK's implementation of the Snowball
 80    # stemmer algorithm
 81    from nltk.stem.snowball import SnowballStemmer
 82
 83    # NLTK's interface to the WordNet lemmatizer
 84    from nltk.stem.wordnet import WordNetLemmatizer
 85except ImportError:
 86    nltk = None
 87    print("NLTK is not installed, so we won't use it.")
 88
 89
 90def normalize_tokenize(string):
 91    """
 92    Takes a string, normalizes it (makes it lowercase and
 93    removes punctuation), and then splits it into a list of
 94    words.
 95
 96    Note that everything in this function is plain Python
 97    without using NLTK (although as noted below, NLTK provides
 98    some more sophisticated tokenizers we could have used).
 99    """
100    # make lowercase
101    norm = string.lower()
102
103    # remove punctuation
104    norm = re.sub(r'(?u)[^\w\s]', '', norm) # <1>
105
106    # split into words
107    tokens = norm.split()
108
109    return tokens
110
111def word_form_hapaxes(tokens):
112    """
113    Takes a list of tokens and returns a list of the
114    wordform hapaxes (those wordforms that only appear once)
115
116    For wordforms this is simple enough to do in plain
117    Python without an NLP package, especially using the Counter
118    type from the collections module (part of the Python
119    standard library).
120    """
121
122    counts = Counter(tokens) # <1>
123    hapaxes = [word for word in counts if counts[word] == 1] # <2>
124
125    return hapaxes
126
127def nltk_stem_hapaxes(tokens):
128    """
129    Takes a list of tokens and returns a list of the word
130    stem hapaxes.
131    """
132    if not nltk: # <1>
133        # Only run if NLTK is loaded
134        return None
135
136    # Apply NLTK's Snowball stemmer algorithm to tokens:
137    stemmer = SnowballStemmer("english")
138    stems = [stemmer.stem(token) for token in tokens]
139
140    # Filter down to hapaxes:
141    counts = nltk.FreqDist(stems) # <2>
142    hapaxes = counts.hapaxes() # <3>
143    return hapaxes
144
145def nltk_lemma_hapaxes(tokens):
146    """
147    Takes a list of tokens and returns a list of the lemma
148    hapaxes.
149    """
150    if not nltk:
151        # Only run if NLTK is loaded
152        return None
153
154    # Tag tokens with part-of-speech:
155    tagged = nltk.pos_tag(tokens) # <1>
156
157    # Convert our Treebank-style tags to WordNet-style tags.
158    tagged = [(word, pt_to_wn(tag))
159                     for (word, tag) in tagged] # <2>
160
161    # Lemmatize:
162    lemmer = WordNetLemmatizer()
163    lemmas = [lemmer.lemmatize(token, pos)
164                     for (token, pos) in tagged] # <3>
165
166    return nltk_stem_hapaxes(lemmas) # <4>
167
168def pt_to_wn(pos):
169    """
170    Takes a Penn Treebank tag and converts it to an
171    appropriate WordNet equivalent for lemmatization.
172
173    A list of Penn Treebank tags is available at:
174    https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html
175    """
176
177    from nltk.corpus.reader.wordnet import NOUN, VERB, ADJ, ADV
178
179    pos = pos.lower()
180
181    if pos.startswith('jj'):
182        tag = ADJ
183    elif pos == 'md':
184        # Modal auxiliary verbs
185        tag = VERB
186    elif pos.startswith('rb'):
187        tag = ADV
188    elif pos.startswith('vb'):
189        tag = VERB
190    elif pos == 'wrb':
191        # Wh-adverb (how, however, whence, whenever...)
192        tag = ADV
193    else:
194        # default to NOUN
195        # This is not strictly correct, but it is good
196        # enough for lemmatization.
197        tag = NOUN
198
199    return tag
200
201if __name__ == "__main__":
202    """
203    The code in this block is run when this file is executed as a script (but
204    not if it is imported as a module by another Python script).
205    """
206
207    # If no file is provided, then use this sample text:
208    text = """Cory Linguist, a cautious corpus linguist, in creating a
209    corpus of courtship correspondence, corrupted a crucial link. Now, if Cory
210    Linguist, a careful corpus linguist, in creating a corpus of courtship
211    correspondence, corrupted a crucial link, see that YOU, in creating a
212    corpus of courtship correspondence, corrupt not a crucial link."""
213
214    if len(sys.argv) > 1:
215        # We got at least one command-line argument. We'll ignore all but the
216        # first.
217        with open(sys.argv[1], 'r') as file:
218            text = file.read()
219            try:
220                # in Python 2 we need a unicode string
221                text = unicode(text)
222            except:
223                # in Python 3 'unicode()' is not defined
224                # we don't have to do anything
225                pass
226
227    # tokenize the text (break into words)
228    tokens = normalize_tokenize(text)
229
230    # Get hapaxes based on wordforms, stems, and lemmas:
231    wfs = word_form_hapaxes(tokens)
232    stems = nltk_stem_hapaxes(tokens)
233    lemmas = nltk_lemma_hapaxes(tokens)
234
235    # Print count table and list of hapaxes:
236    row_labels = ["Wordforms"]
237    row_data = [wfs]
238
239    # only add NLTK data if it is installed
240    if nltk:
241        row_labels.extend(["NLTK-stems", "NLTK-lemmas"])
242        row_data.extend([stems, lemmas])
243
244    # sort happaxes for display
245    row_date = [row.sort() for row in row_data]
246
247    # format and print output
248    rows = zip(row_labels, row_data)
249    row_fmt = "{:>14}{:^8}"
250    print("\n")
251    print(row_fmt.format("", "Count"))
252    hapax_list = []
253    for row in rows:
254        print(row_fmt.format(row[0], len(row[1])))
255        hapax_list += ["{:<14}{:<68}".format(row[0] + ":", ", ".join(row[1]))]
256
257    print("\n-- Hapaxes --")
258    for row in hapax_list:
259        print(row)
260    print("\n")
261

Comments