<?xml version="1.0" encoding="UTF-8"?>
<feed xmlns="http://www.w3.org/2005/Atom" xml:base="http://catswhisker.xyz/">
  <id>http://catswhisker.xyz/</id>
  <title>Atom Feed for 'nlp' Articles</title>
  <updated>2022-10-23T17:44:18Z</updated>
  <link rel="alternate" href="http://catswhisker.xyz/" type="text/html"/>
  <link rel="self" href="http://catswhisker.xyz/tags/nlp/atom.xml" type="application/atom+xml"/>
  <author>
    <name>A. Cynic</name>
    <uri>http://catswhisker.xyz/about/</uri>
  </author>
  <entry>
    <id>tag:catswhisker.xyz,2017-09-07:/log/2017/9/7/a_first_excercise_in_natural_language_processing_with_python_counting_hapaxes/</id>
    <title type="html">A First Exercise in Natural Language Processing with Python: Counting Hapaxes</title>
    <published>2017-09-07T18:32:37Z</published>
    <updated>2022-10-23T17:44:18Z</updated>
    <link rel="alternate" href="http://catswhisker.xyz/log/2017/9/7/a_first_excercise_in_natural_language_processing_with_python_counting_hapaxes/" type="text/html"/>
    <content type="html">&lt;div id="toc" class="toc"&gt;
&lt;div id="toctitle"&gt;Table of Contents&lt;/div&gt;
&lt;ul class="sectlevel1"&gt;
&lt;li&gt;&lt;a href="#_a_first_exercise"&gt;A first exercise&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#_natural_language_processing_with_python"&gt;Natural language processing with Python&lt;/a&gt;
&lt;ul class="sectlevel2"&gt;
&lt;li&gt;&lt;a href="#_installation"&gt;Installation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#_optional_dependency_on_python_modules"&gt;Optional dependency on Python modules&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#_tokenization"&gt;Tokenization&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#_counting_word_forms"&gt;Counting word forms&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#_stemming_and_lemmatization"&gt;Stemming and Lemmatization&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#_lemmatization_with_nltk"&gt;Lemmatization with NLTK&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#_make_it_a_script"&gt;Make it a script&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="#_hapaxes_py_listing"&gt;hapaxes.py listing&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;div class="sect1"&gt;
&lt;h2 id="_a_first_exercise"&gt;A first exercise&lt;/h2&gt;
&lt;div class="sectionbody"&gt;
&lt;div class="paragraph"&gt;
&lt;p&gt;Counting &lt;a href="https://en.wikipedia.org/wiki/Hapax_legomenon"&gt;hapaxes&lt;/a&gt; (words which occur only once in a text or corpus) is an easy enough problem that makes use of both simple data structures and some fundamental tasks of natural language processing (NLP): tokenization (dividing a text into words), stemming, and part-of-speech tagging for lemmatization. For that reason it makes a good exercise to get started with NLP in a new language or library.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="paragraph"&gt;
&lt;p&gt;As a first exercise in implementing NLP tasks with Python, then, we&amp;#8217;ll write a script which outputs the count and a list of the hapaxes in the following paragraph (our script can also be run on an arbitrary input file). You can follow along, or try it yourself and then compare your solution to mine.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="listingblock"&gt;
&lt;div class="content"&gt;
&lt;pre&gt;Cory Linguist, a cautious corpus linguist, in creating a corpus of courtship correspondence, corrupted a crucial link. Now, if Cory Linguist, a careful corpus linguist, in creating a corpus of courtship correspondence, corrupted a crucial link, see that YOU, in creating a corpus of courtship correspondence, corrupt not a crucial link.&lt;/pre&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="paragraph"&gt;
&lt;p&gt;To keep things simple, ignore punctuation and case. To make things complex, count hapaxes in all three of word form, stemmed form, and lemma form. The final program (&lt;a href="hapaxes.py"&gt;hapaxes.py&lt;/a&gt;) is listed at the end of this post. The sections below walk through it in detail for the beginning NLP/Python programmer.&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="sect1"&gt;
&lt;h2 id="_natural_language_processing_with_python"&gt;Natural language processing with Python&lt;/h2&gt;
&lt;div class="sectionbody"&gt;
&lt;div class="paragraph"&gt;
&lt;p&gt;There are several NLP packages available to the Python programmer. The most well-known is the &lt;a href="http://www.nltk.org/"&gt;Natural Language Toolkit (NLTK)&lt;/a&gt;, which is the subject of the popular book &lt;a href="http://www.nltk.org/book/"&gt;&lt;em&gt;Natural Language Processing with Python&lt;/em&gt;&lt;/a&gt; by Bird et al. NLTK has a focus on education/research with a rather sprawling API. &lt;a href="https://github.com/clips/pattern"&gt;Pattern&lt;/a&gt; is a Python package for datamining the WWW which includes submodules for language processing and machine learning. &lt;a href="http://polyglot.readthedocs.io/en/latest/"&gt;Polyglot&lt;/a&gt; is a language library focusing on &amp;#8220;massive multilingual applications.&amp;#8221; Many of its features support over 100 languages (but it doesn&amp;#8217;t seem to have a stemmer or lemmatizer builtin). And there is Matthew Honnibal&amp;#8217;s &lt;a href="https://spacy.io/"&gt;spaCy&lt;/a&gt;, an &amp;#8220;industrial strength&amp;#8221; NLP library focused on performance and integration with machine learning models.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="paragraph"&gt;
&lt;p&gt;If you don&amp;#8217;t already know which library you want to use, I recommend starting with NLTK because there are so many online resources available for it. The program presented below actually presents several solutions to counting hapaxes, using both plain Python and the NLTK library.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="ulist"&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;Word forms - counts unique spellings (normalized for case). This uses plain Python (no NLP packages required)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;NLTK stems - counts unique stems using a stemmer provided by NLTK&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;NLTK lemmas - counts unique lemma forms using NLTK&amp;#8217;s part of speech tagger and
interface to the WordNet lemmatizer&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;div class="sect2"&gt;
&lt;h3 id="_installation"&gt;Installation&lt;/h3&gt;
&lt;div class="paragraph"&gt;
&lt;p&gt;This tutorial assumes you already have Python installed on your system and have some experience using the interpreter. I recommend referring to each package&amp;#8217;s project page for installation instructions, but here is one way using &lt;a href="https://pypi.python.org/pypi/pip"&gt;pip&lt;/a&gt;. As explained below, each of the NLP packages are optional; feel free to install only the ones you&amp;#8217;re interested in playing with.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="listingblock"&gt;
&lt;div class="content"&gt;
&lt;pre class="CodeRay highlight"&gt;&lt;code data-lang="sh"&gt;# Install NLTK:
$ pip install nltk

# Download reqed NLTK data packages
$ python -c 'import nltk; nltk.download(&amp;quot;wordnet&amp;quot;); nltk.download(&amp;quot;averaged_perceptron_tagger_eng&amp;quot;); nltk.download(&amp;quot;omw-1.4&amp;quot;)'&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="sect2"&gt;
&lt;h3 id="_optional_dependency_on_python_modules"&gt;Optional dependency on Python modules&lt;/h3&gt;
&lt;div class="paragraph"&gt;
&lt;p&gt;It would be nice if our script didn&amp;#8217;t depend on any particular NLP package so that it could still run even if one or more of them were not installed (using only the functionality provided by whichever packages are installed).&lt;/p&gt;
&lt;/div&gt;
&lt;div class="paragraph"&gt;
&lt;p&gt;One way to implement a script with optional package dependencies in Python is to try to import a module, and if we get an &lt;code&gt;ImportError&lt;/code&gt; &lt;a href="https://docs.python.org/3/tutorial/errors.html#exceptions"&gt;exception&lt;/a&gt; we mark the package as uninstalled (by setting a variable with the module&amp;#8217;s name to &lt;code&gt;None&lt;/code&gt;) which we can check for later in our code:&lt;/p&gt;
&lt;/div&gt;
&lt;div class="listingblock wide"&gt;
&lt;div class="title"&gt; [hapaxes.py: 59-88]&lt;/div&gt;
&lt;div class="content"&gt;
&lt;pre class="CodeRay highlight"&gt;&lt;code data-lang="python"&gt;&lt;span class="comment"&gt;### Imports&lt;/span&gt;
&lt;span class="comment"&gt;#&lt;/span&gt;
&lt;span class="comment"&gt;# Import some Python 3 features to use in Python 2&lt;/span&gt;
&lt;span class="keyword"&gt;from&lt;/span&gt; &lt;span class="include"&gt;__future__&lt;/span&gt; &lt;span class="keyword"&gt;import&lt;/span&gt; &lt;span class="include"&gt;print_function&lt;/span&gt;
&lt;span class="keyword"&gt;from&lt;/span&gt; &lt;span class="include"&gt;__future__&lt;/span&gt; &lt;span class="keyword"&gt;import&lt;/span&gt; &lt;span class="include"&gt;unicode_literals&lt;/span&gt;

&lt;span class="comment"&gt;# gives us access to command-line arguments&lt;/span&gt;
&lt;span class="keyword"&gt;import&lt;/span&gt; &lt;span class="include"&gt;sys&lt;/span&gt;

&lt;span class="comment"&gt;# The Counter collection is a convenient layer on top of&lt;/span&gt;
&lt;span class="comment"&gt;# python's standard dictionary type for counting iterables.&lt;/span&gt;
&lt;span class="keyword"&gt;from&lt;/span&gt; &lt;span class="include"&gt;collections&lt;/span&gt; &lt;span class="keyword"&gt;import&lt;/span&gt; &lt;span class="include"&gt;Counter&lt;/span&gt;

&lt;span class="comment"&gt;# The standard python regular expression module:&lt;/span&gt;
&lt;span class="keyword"&gt;import&lt;/span&gt; &lt;span class="include"&gt;re&lt;/span&gt;

&lt;span class="keyword"&gt;try&lt;/span&gt;:
    &lt;span class="comment"&gt;# Import NLTK if it is installed&lt;/span&gt;
    &lt;span class="keyword"&gt;import&lt;/span&gt; &lt;span class="include"&gt;nltk&lt;/span&gt;

    &lt;span class="comment"&gt;# This imports NLTK's implementation of the Snowball&lt;/span&gt;
    &lt;span class="comment"&gt;# stemmer algorithm&lt;/span&gt;
    &lt;span class="keyword"&gt;from&lt;/span&gt; &lt;span class="include"&gt;nltk.stem.snowball&lt;/span&gt; &lt;span class="keyword"&gt;import&lt;/span&gt; &lt;span class="include"&gt;SnowballStemmer&lt;/span&gt;

    &lt;span class="comment"&gt;# NLTK's interface to the WordNet lemmatizer&lt;/span&gt;
    &lt;span class="keyword"&gt;from&lt;/span&gt; &lt;span class="include"&gt;nltk.stem.wordnet&lt;/span&gt; &lt;span class="keyword"&gt;import&lt;/span&gt; &lt;span class="include"&gt;WordNetLemmatizer&lt;/span&gt;
&lt;span class="keyword"&gt;except&lt;/span&gt; &lt;span class="exception"&gt;ImportError&lt;/span&gt;:
    nltk = &lt;span class="predefined-constant"&gt;None&lt;/span&gt;
    print(&lt;span class="string"&gt;&lt;span class="delimiter"&gt;&amp;quot;&lt;/span&gt;&lt;span class="content"&gt;NLTK is not installed, so we won't use it.&lt;/span&gt;&lt;span class="delimiter"&gt;&amp;quot;&lt;/span&gt;&lt;/span&gt;)&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="sect2"&gt;
&lt;h3 id="_tokenization"&gt;Tokenization&lt;/h3&gt;
&lt;div class="paragraph"&gt;
&lt;p&gt;&lt;a href="https://en.wikipedia.org/wiki/Tokenization_(lexical_analysis)"&gt;Tokenization&lt;/a&gt; is the process of splitting a string into lexical &amp;#8216;tokens&amp;#8217;&amp;#8201;&amp;#8212;&amp;#8201;usually words or sentences.  In languages with space-separated words, satisfactory tokenization can often be accomplished with a few simple rules, though ambiguous punctuation can cause errors (such as mistaking a period after an abbreviation as the end of a sentence). Some tokenizers use statistical inference (trained on a corpus with known token boundaries) to recognize tokens.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="paragraph"&gt;
&lt;p&gt;In our case we need to break the text into a list of words in order to find the hapaxes. But since we are not interested in punctuation or capitalization, we can make tokenization very simple by first normalizing the text to lower case and stripping out every punctuation symbol:&lt;/p&gt;
&lt;/div&gt;
&lt;div class="listingblock wide"&gt;
&lt;div class="title"&gt; [hapaxes.py: 90-109]&lt;/div&gt;
&lt;div class="content"&gt;
&lt;pre class="CodeRay highlight"&gt;&lt;code data-lang="python"&gt;&lt;span class="keyword"&gt;def&lt;/span&gt; &lt;span class="function"&gt;normalize_tokenize&lt;/span&gt;(string):
    &lt;span class="docstring"&gt;&lt;span class="delimiter"&gt;&amp;quot;&amp;quot;&amp;quot;&lt;/span&gt;&lt;span class="content"&gt;
&lt;/span&gt;&lt;span class="content"&gt;    Takes a string, normalizes it (makes it lowercase and&lt;/span&gt;&lt;span class="content"&gt;
&lt;/span&gt;&lt;span class="content"&gt;    removes punctuation), and then splits it into a list of&lt;/span&gt;&lt;span class="content"&gt;
&lt;/span&gt;&lt;span class="content"&gt;    words.&lt;/span&gt;&lt;span class="content"&gt;
&lt;/span&gt;&lt;span class="content"&gt;
&lt;/span&gt;&lt;span class="content"&gt;    Note that everything in this function is plain Python&lt;/span&gt;&lt;span class="content"&gt;
&lt;/span&gt;&lt;span class="content"&gt;    without using NLTK (although as noted below, NLTK provides&lt;/span&gt;&lt;span class="content"&gt;
&lt;/span&gt;&lt;span class="content"&gt;    some more sophisticated tokenizers we could have used).&lt;/span&gt;&lt;span class="content"&gt;
&lt;/span&gt;&lt;span class="content"&gt;    &lt;/span&gt;&lt;span class="delimiter"&gt;&amp;quot;&amp;quot;&amp;quot;&lt;/span&gt;&lt;/span&gt;
    &lt;span class="comment"&gt;# make lowercase&lt;/span&gt;
    norm = string.lower()

    &lt;span class="comment"&gt;# remove punctuation&lt;/span&gt;
    norm = re.sub(&lt;span class="string"&gt;&lt;span class="modifier"&gt;r&lt;/span&gt;&lt;span class="delimiter"&gt;'&lt;/span&gt;&lt;span class="content"&gt;(?u)[^&lt;/span&gt;&lt;span class="content"&gt;\w&lt;/span&gt;&lt;span class="content"&gt;\s&lt;/span&gt;&lt;span class="content"&gt;]&lt;/span&gt;&lt;span class="delimiter"&gt;'&lt;/span&gt;&lt;/span&gt;, &lt;span class="string"&gt;&lt;span class="delimiter"&gt;'&lt;/span&gt;&lt;span class="delimiter"&gt;'&lt;/span&gt;&lt;/span&gt;, norm) &lt;i class="conum" data-value="1"&gt;&lt;/i&gt;&lt;b&gt;(1)&lt;/b&gt;

    &lt;span class="comment"&gt;# split into words&lt;/span&gt;
    tokens = norm.split()

    &lt;span class="keyword"&gt;return&lt;/span&gt; tokens&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="colist arabic"&gt;
&lt;table&gt;
&lt;tr&gt;
&lt;td&gt;&lt;i class="conum" data-value="1"&gt;&lt;/i&gt;&lt;b&gt;1&lt;/b&gt;&lt;/td&gt;
&lt;td&gt;Remove punctuation by replacing everything that is not a word (&lt;code&gt;\w&lt;/code&gt;) or whitespace (&lt;code&gt;\s&lt;/code&gt;) with an empty string. The (&lt;code&gt;?u&lt;/code&gt;) flag at the beginning of the regex enables unicode matching for the \w and \s character classes in Python 2 (unicode is the default with Python 3).&lt;/td&gt;
&lt;/tr&gt;
&lt;/table&gt;
&lt;/div&gt;
&lt;div class="paragraph"&gt;
&lt;p&gt;Our tokenizer produces output like this:&lt;/p&gt;
&lt;/div&gt;
&lt;div class="listingblock"&gt;
&lt;div class="content"&gt;
&lt;pre class="CodeRay highlight"&gt;&lt;code data-lang="python"&gt;&amp;gt;&amp;gt;&amp;gt; normalize_tokenize(&lt;span class="string"&gt;&lt;span class="delimiter"&gt;&amp;quot;&lt;/span&gt;&lt;span class="content"&gt;This is a test sentence of white-space separated words.&lt;/span&gt;&lt;span class="delimiter"&gt;&amp;quot;&lt;/span&gt;&lt;/span&gt;)
[&lt;span class="string"&gt;&lt;span class="delimiter"&gt;'&lt;/span&gt;&lt;span class="content"&gt;this&lt;/span&gt;&lt;span class="delimiter"&gt;'&lt;/span&gt;&lt;/span&gt;, &lt;span class="string"&gt;&lt;span class="delimiter"&gt;'&lt;/span&gt;&lt;span class="content"&gt;is&lt;/span&gt;&lt;span class="delimiter"&gt;'&lt;/span&gt;&lt;/span&gt;, &lt;span class="string"&gt;&lt;span class="delimiter"&gt;'&lt;/span&gt;&lt;span class="content"&gt;a&lt;/span&gt;&lt;span class="delimiter"&gt;'&lt;/span&gt;&lt;/span&gt;, &lt;span class="string"&gt;&lt;span class="delimiter"&gt;'&lt;/span&gt;&lt;span class="content"&gt;test&lt;/span&gt;&lt;span class="delimiter"&gt;'&lt;/span&gt;&lt;/span&gt;, &lt;span class="string"&gt;&lt;span class="delimiter"&gt;'&lt;/span&gt;&lt;span class="content"&gt;sentence&lt;/span&gt;&lt;span class="delimiter"&gt;'&lt;/span&gt;&lt;/span&gt;, &lt;span class="string"&gt;&lt;span class="delimiter"&gt;'&lt;/span&gt;&lt;span class="content"&gt;of&lt;/span&gt;&lt;span class="delimiter"&gt;'&lt;/span&gt;&lt;/span&gt;, &lt;span class="string"&gt;&lt;span class="delimiter"&gt;'&lt;/span&gt;&lt;span class="content"&gt;whitespace&lt;/span&gt;&lt;span class="delimiter"&gt;'&lt;/span&gt;&lt;/span&gt;, &lt;span class="string"&gt;&lt;span class="delimiter"&gt;'&lt;/span&gt;&lt;span class="content"&gt;separated&lt;/span&gt;&lt;span class="delimiter"&gt;'&lt;/span&gt;&lt;/span&gt;, &lt;span class="string"&gt;&lt;span class="delimiter"&gt;'&lt;/span&gt;&lt;span class="content"&gt;words&lt;/span&gt;&lt;span class="delimiter"&gt;'&lt;/span&gt;&lt;/span&gt;]&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="paragraph"&gt;
&lt;p&gt;Instead of simply removing punctuation and then splitting words on whitespace, we could have used one of &lt;a href="http://www.nltk.org/api/nltk.tokenize.html"&gt;the tokenizers provided by NLTK&lt;/a&gt;. Specifically the &lt;code&gt;word_tokenize()&lt;/code&gt; method, which first splits the text into sentences using a pre-trained English sentences tokenizer (&lt;code&gt;sent_tokenize&lt;/code&gt;), and then finds words using regular expressions in the style of the Penn Treebank tokens.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="listingblock"&gt;
&lt;div class="content"&gt;
&lt;pre class="CodeRay highlight"&gt;&lt;code data-lang="python"&gt;&lt;span class="comment"&gt;# We could have done it this way (requires the&lt;/span&gt;
&lt;span class="comment"&gt;# 'punkt' data package):&lt;/span&gt;
&lt;span class="keyword"&gt;from&lt;/span&gt; &lt;span class="include"&gt;nltk.tokenize&lt;/span&gt; &lt;span class="keyword"&gt;import&lt;/span&gt; &lt;span class="include"&gt;word_tokenize&lt;/span&gt;
tokens = word_tokenize(norm)&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="paragraph"&gt;
&lt;p&gt;The main advantage of &lt;code&gt;word_tokenize()&lt;/code&gt; is that it will turn contractions into separate tokens. But using Python&amp;#8217;s standard &lt;code&gt;split()&lt;/code&gt; is good enough for our purposes.&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="sect2"&gt;
&lt;h3 id="_counting_word_forms"&gt;Counting word forms&lt;/h3&gt;
&lt;div class="paragraph"&gt;
&lt;p&gt;We can use the tokenizer defined above to get a list of words from any string, so now we need a way to count how many times each word occurs. Those that occur only once are our word-form hapaxes.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="listingblock wide"&gt;
&lt;div class="title"&gt; [hapaxes.py: 111-125]&lt;/div&gt;
&lt;div class="content"&gt;
&lt;pre class="CodeRay highlight"&gt;&lt;code data-lang="python"&gt;&lt;span class="keyword"&gt;def&lt;/span&gt; &lt;span class="function"&gt;word_form_hapaxes&lt;/span&gt;(tokens):
    &lt;span class="docstring"&gt;&lt;span class="delimiter"&gt;&amp;quot;&amp;quot;&amp;quot;&lt;/span&gt;&lt;span class="content"&gt;
&lt;/span&gt;&lt;span class="content"&gt;    Takes a list of tokens and returns a list of the&lt;/span&gt;&lt;span class="content"&gt;
&lt;/span&gt;&lt;span class="content"&gt;    wordform hapaxes (those wordforms that only appear once)&lt;/span&gt;&lt;span class="content"&gt;
&lt;/span&gt;&lt;span class="content"&gt;
&lt;/span&gt;&lt;span class="content"&gt;    For wordforms this is simple enough to do in plain&lt;/span&gt;&lt;span class="content"&gt;
&lt;/span&gt;&lt;span class="content"&gt;    Python without an NLP package, especially using the Counter&lt;/span&gt;&lt;span class="content"&gt;
&lt;/span&gt;&lt;span class="content"&gt;    type from the collections module (part of the Python&lt;/span&gt;&lt;span class="content"&gt;
&lt;/span&gt;&lt;span class="content"&gt;    standard library).&lt;/span&gt;&lt;span class="content"&gt;
&lt;/span&gt;&lt;span class="content"&gt;    &lt;/span&gt;&lt;span class="delimiter"&gt;&amp;quot;&amp;quot;&amp;quot;&lt;/span&gt;&lt;/span&gt;

    counts = Counter(tokens) &lt;i class="conum" data-value="1"&gt;&lt;/i&gt;&lt;b&gt;(1)&lt;/b&gt;
    hapaxes = [word &lt;span class="keyword"&gt;for&lt;/span&gt; word &lt;span class="keyword"&gt;in&lt;/span&gt; counts &lt;span class="keyword"&gt;if&lt;/span&gt; counts[word] == &lt;span class="integer"&gt;1&lt;/span&gt;] &lt;i class="conum" data-value="2"&gt;&lt;/i&gt;&lt;b&gt;(2)&lt;/b&gt;

    &lt;span class="keyword"&gt;return&lt;/span&gt; hapaxes&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="colist arabic"&gt;
&lt;table&gt;
&lt;tr&gt;
&lt;td&gt;&lt;i class="conum" data-value="1"&gt;&lt;/i&gt;&lt;b&gt;1&lt;/b&gt;&lt;/td&gt;
&lt;td&gt;Use the convenient &lt;code&gt;&lt;a href="https://docs.python.org/3/library/collections.html#collections.Counter"&gt;Counter&lt;/a&gt;&lt;/code&gt; class from Python&amp;#8217;s standard library to count the occurrences of each token. &lt;code&gt;Counter&lt;/code&gt; is a subclass of the standard &lt;code&gt;dict&lt;/code&gt; type; its constructor takes a list of items from which it builds a dictionary whose keys are elements from the list and whose values are the number of times each element appeared in the list.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;i class="conum" data-value="2"&gt;&lt;/i&gt;&lt;b&gt;2&lt;/b&gt;&lt;/td&gt;
&lt;td&gt;This &lt;a href="https://docs.python.org/3/tutorial/datastructures.html#list-comprehensions"&gt;list comprehension&lt;/a&gt; creates a list from the Counter dictionary containing only the dictionary keys that have a count of 1. These are our hapaxes.&lt;/td&gt;
&lt;/tr&gt;
&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="sect2"&gt;
&lt;h3 id="_stemming_and_lemmatization"&gt;Stemming and Lemmatization&lt;/h3&gt;
&lt;div class="paragraph"&gt;
&lt;p&gt;If we use our two functions to first tokenize and then find the hapaxes in our example text, we get this output:&lt;/p&gt;
&lt;/div&gt;
&lt;div class="listingblock"&gt;
&lt;div class="content"&gt;
&lt;pre class="CodeRay highlight"&gt;&lt;code data-lang="python"&gt;&amp;gt;&amp;gt;&amp;gt; text = &lt;span class="string"&gt;&lt;span class="delimiter"&gt;&amp;quot;&lt;/span&gt;&lt;span class="content"&gt;Cory Linguist, a cautious corpus linguist, in creating a corpus of courtship correspondence, corrupted a crucial link. Now, if Cory Linguist, a careful corpus linguist, in creating a corpus of courtship correspondence, corrupted a crucial link, see that YOU, in creating a corpus of courtship correspondence, corrupt not a crucial link.&lt;/span&gt;&lt;span class="delimiter"&gt;&amp;quot;&lt;/span&gt;&lt;/span&gt;
&amp;gt;&amp;gt;&amp;gt; tokens = normalize_tokenize(text)
&amp;gt;&amp;gt;&amp;gt; word_form_hapaxes(tokens)
[&lt;span class="string"&gt;&lt;span class="delimiter"&gt;'&lt;/span&gt;&lt;span class="content"&gt;now&lt;/span&gt;&lt;span class="delimiter"&gt;'&lt;/span&gt;&lt;/span&gt;, &lt;span class="string"&gt;&lt;span class="delimiter"&gt;'&lt;/span&gt;&lt;span class="content"&gt;not&lt;/span&gt;&lt;span class="delimiter"&gt;'&lt;/span&gt;&lt;/span&gt;, &lt;span class="string"&gt;&lt;span class="delimiter"&gt;'&lt;/span&gt;&lt;span class="content"&gt;that&lt;/span&gt;&lt;span class="delimiter"&gt;'&lt;/span&gt;&lt;/span&gt;, &lt;span class="string"&gt;&lt;span class="delimiter"&gt;'&lt;/span&gt;&lt;span class="content"&gt;see&lt;/span&gt;&lt;span class="delimiter"&gt;'&lt;/span&gt;&lt;/span&gt;, &lt;span class="string"&gt;&lt;span class="delimiter"&gt;'&lt;/span&gt;&lt;span class="content"&gt;if&lt;/span&gt;&lt;span class="delimiter"&gt;'&lt;/span&gt;&lt;/span&gt;, &lt;span class="string"&gt;&lt;span class="delimiter"&gt;'&lt;/span&gt;&lt;span class="content"&gt;corrupt&lt;/span&gt;&lt;span class="delimiter"&gt;'&lt;/span&gt;&lt;/span&gt;, &lt;span class="string"&gt;&lt;span class="delimiter"&gt;'&lt;/span&gt;&lt;span class="content"&gt;you&lt;/span&gt;&lt;span class="delimiter"&gt;'&lt;/span&gt;&lt;/span&gt;, &lt;span class="string"&gt;&lt;span class="delimiter"&gt;'&lt;/span&gt;&lt;span class="content"&gt;careful&lt;/span&gt;&lt;span class="delimiter"&gt;'&lt;/span&gt;&lt;/span&gt;, &lt;span class="string"&gt;&lt;span class="delimiter"&gt;'&lt;/span&gt;&lt;span class="content"&gt;cautious&lt;/span&gt;&lt;span class="delimiter"&gt;'&lt;/span&gt;&lt;/span&gt;]&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="paragraph"&gt;
&lt;p&gt;Notice that &amp;#8216;corrupt&amp;#8217; is counted as a hapax even though the text also includes two instances of the word &amp;#8216;corrupted&amp;#8217;. That is expected because &amp;#8216;corrupt&amp;#8217; and &amp;#8216;corrupted&amp;#8217; are different word-forms, but if we want to count word roots regardless of their inflections we must process our tokens further. There are two main methods we can try:&lt;/p&gt;
&lt;/div&gt;
&lt;div class="ulist"&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href="https://en.wikipedia.org/wiki/Stemming"&gt;Stemming&lt;/a&gt; uses an algorithm (and/or a lookup table) to remove the suffix of tokens so that words with the same base but different inflections are reduced to the same form. For example: &amp;#8216;argued&amp;#8217; and &amp;#8216;arguing&amp;#8217; are both stemmed to &amp;#8216;argu&amp;#8217;.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href="https://en.wikipedia.org/wiki/Lemmatisation"&gt;Lemmatization&lt;/a&gt; reduces tokens to their lemmas, their canonical dictionary form. For example, &amp;#8216;argued&amp;#8217; and &amp;#8216;arguing&amp;#8217; are both lemmatized to &amp;#8216;argue&amp;#8217;.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;div class="sect3"&gt;
&lt;h4 id="_stemming_with_nltk"&gt;Stemming with NLTK&lt;/h4&gt;
&lt;div class="paragraph"&gt;
&lt;p&gt;In 1980 Martin Porter published &lt;a href="https://tartarus.org/martin/PorterStemmer/index.html"&gt;a stemming algorithm&lt;/a&gt; which has become a standard way to stem English words. His algorithm was implemented so many times, and with so many errors, that he later created &lt;a href="https://snowballstem.org/"&gt;a programming language called Snowball&lt;/a&gt; to help clearly and exactly define stemmers. NLTK includes a Python port of the Snowball implementation of an improved version of Porter&amp;#8217;s original stemmer:&lt;/p&gt;
&lt;/div&gt;
&lt;div class="listingblock wide"&gt;
&lt;div class="title"&gt; [hapaxes.py: 127-143]&lt;/div&gt;
&lt;div class="content"&gt;
&lt;pre class="CodeRay highlight"&gt;&lt;code data-lang="python"&gt;&lt;span class="keyword"&gt;def&lt;/span&gt; &lt;span class="function"&gt;nltk_stem_hapaxes&lt;/span&gt;(tokens):
    &lt;span class="docstring"&gt;&lt;span class="delimiter"&gt;&amp;quot;&amp;quot;&amp;quot;&lt;/span&gt;&lt;span class="content"&gt;
&lt;/span&gt;&lt;span class="content"&gt;    Takes a list of tokens and returns a list of the word&lt;/span&gt;&lt;span class="content"&gt;
&lt;/span&gt;&lt;span class="content"&gt;    stem hapaxes.&lt;/span&gt;&lt;span class="content"&gt;
&lt;/span&gt;&lt;span class="content"&gt;    &lt;/span&gt;&lt;span class="delimiter"&gt;&amp;quot;&amp;quot;&amp;quot;&lt;/span&gt;&lt;/span&gt;
    &lt;span class="keyword"&gt;if&lt;/span&gt; &lt;span class="keyword"&gt;not&lt;/span&gt; nltk: &lt;i class="conum" data-value="1"&gt;&lt;/i&gt;&lt;b&gt;(1)&lt;/b&gt;
        &lt;span class="comment"&gt;# Only run if NLTK is loaded&lt;/span&gt;
        &lt;span class="keyword"&gt;return&lt;/span&gt; &lt;span class="predefined-constant"&gt;None&lt;/span&gt;

    &lt;span class="comment"&gt;# Apply NLTK's Snowball stemmer algorithm to tokens:&lt;/span&gt;
    stemmer = SnowballStemmer(&lt;span class="string"&gt;&lt;span class="delimiter"&gt;&amp;quot;&lt;/span&gt;&lt;span class="content"&gt;english&lt;/span&gt;&lt;span class="delimiter"&gt;&amp;quot;&lt;/span&gt;&lt;/span&gt;)
    stems = [stemmer.stem(token) &lt;span class="keyword"&gt;for&lt;/span&gt; token &lt;span class="keyword"&gt;in&lt;/span&gt; tokens]

    &lt;span class="comment"&gt;# Filter down to hapaxes:&lt;/span&gt;
    counts = nltk.FreqDist(stems) &lt;i class="conum" data-value="2"&gt;&lt;/i&gt;&lt;b&gt;(2)&lt;/b&gt;
    hapaxes = counts.hapaxes() &lt;i class="conum" data-value="3"&gt;&lt;/i&gt;&lt;b&gt;(3)&lt;/b&gt;
    &lt;span class="keyword"&gt;return&lt;/span&gt; hapaxes&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="colist arabic"&gt;
&lt;table&gt;
&lt;tr&gt;
&lt;td&gt;&lt;i class="conum" data-value="1"&gt;&lt;/i&gt;&lt;b&gt;1&lt;/b&gt;&lt;/td&gt;
&lt;td&gt;Here we check if the &lt;code&gt;nltk&lt;/code&gt; module was loaded; if it was not (presumably because it is not installed), we return without trying to run the stemmer.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;i class="conum" data-value="2"&gt;&lt;/i&gt;&lt;b&gt;2&lt;/b&gt;&lt;/td&gt;
&lt;td&gt;NLTK&amp;#8217;s &lt;code&gt;&lt;a href="http://www.nltk.org/_modules/nltk/probability.html"&gt;FreqDist&lt;/a&gt;&lt;/code&gt; class subclasses the &lt;code&gt;Counter&lt;/code&gt; container type we used above to count word-forms. It adds some methods useful for calculating frequency distributions.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;i class="conum" data-value="3"&gt;&lt;/i&gt;&lt;b&gt;3&lt;/b&gt;&lt;/td&gt;
&lt;td&gt;The &lt;code&gt;FreqDist&lt;/code&gt; class also adds a &lt;code&gt;hapaxes()&lt;/code&gt; method, which is implemented exactly like the list comprehension we used to count word-form hapaxes.&lt;/td&gt;
&lt;/tr&gt;
&lt;/table&gt;
&lt;/div&gt;
&lt;div class="paragraph"&gt;
&lt;p&gt;Running &lt;code&gt;nltk_stem_hapaxes()&lt;/code&gt; on our tokenized example text produces this list of stem hapaxes:&lt;/p&gt;
&lt;/div&gt;
&lt;div class="listingblock"&gt;
&lt;div class="content"&gt;
&lt;pre class="CodeRay highlight"&gt;&lt;code data-lang="python"&gt;&amp;gt;&amp;gt;&amp;gt; nltk_stem_hapaxes(tokens)
[&lt;span class="string"&gt;&lt;span class="delimiter"&gt;'&lt;/span&gt;&lt;span class="content"&gt;now&lt;/span&gt;&lt;span class="delimiter"&gt;'&lt;/span&gt;&lt;/span&gt;, &lt;span class="string"&gt;&lt;span class="delimiter"&gt;'&lt;/span&gt;&lt;span class="content"&gt;cautious&lt;/span&gt;&lt;span class="delimiter"&gt;'&lt;/span&gt;&lt;/span&gt;, &lt;span class="string"&gt;&lt;span class="delimiter"&gt;'&lt;/span&gt;&lt;span class="content"&gt;that&lt;/span&gt;&lt;span class="delimiter"&gt;'&lt;/span&gt;&lt;/span&gt;, &lt;span class="string"&gt;&lt;span class="delimiter"&gt;'&lt;/span&gt;&lt;span class="content"&gt;not&lt;/span&gt;&lt;span class="delimiter"&gt;'&lt;/span&gt;&lt;/span&gt;, &lt;span class="string"&gt;&lt;span class="delimiter"&gt;'&lt;/span&gt;&lt;span class="content"&gt;see&lt;/span&gt;&lt;span class="delimiter"&gt;'&lt;/span&gt;&lt;/span&gt;, &lt;span class="string"&gt;&lt;span class="delimiter"&gt;'&lt;/span&gt;&lt;span class="content"&gt;you&lt;/span&gt;&lt;span class="delimiter"&gt;'&lt;/span&gt;&lt;/span&gt;, &lt;span class="string"&gt;&lt;span class="delimiter"&gt;'&lt;/span&gt;&lt;span class="content"&gt;care&lt;/span&gt;&lt;span class="delimiter"&gt;'&lt;/span&gt;&lt;/span&gt;, &lt;span class="string"&gt;&lt;span class="delimiter"&gt;'&lt;/span&gt;&lt;span class="content"&gt;if&lt;/span&gt;&lt;span class="delimiter"&gt;'&lt;/span&gt;&lt;/span&gt;]&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="paragraph"&gt;
&lt;p&gt;Notice that &amp;#8216;corrupt&amp;#8217; is no longer counted as a hapax (since it shares a stem with &amp;#8216;corrupted&amp;#8217;), and &amp;#8216;careful&amp;#8217; has been stemmed to &amp;#8216;care&amp;#8217;.&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="sect2"&gt;
&lt;h3 id="_lemmatization_with_nltk"&gt;Lemmatization with NLTK&lt;/h3&gt;
&lt;div class="paragraph"&gt;
&lt;p&gt;NLTK provides a lemmatizer (the &lt;code&gt;WordNetLemmatizer&lt;/code&gt; class in &lt;a href="http://www.nltk.org/_modules/nltk/stem/wordnet.html"&gt;nltk.stem.wordnet&lt;/a&gt;) which tries to find a word&amp;#8217;s lemma form with help from the &lt;a href="https://wordnet.princeton.edu/"&gt;WordNet&lt;/a&gt; corpus (which can be downloaded by running &lt;code&gt;nltk.download()&lt;/code&gt; from an interactive python prompt&amp;#8201;&amp;#8212;&amp;#8201;refer to &lt;a href="http://www.nltk.org/data.html"&gt;&amp;#8220;Installing NLTK Data&amp;#8221;&lt;/a&gt; for general instructions).&lt;/p&gt;
&lt;/div&gt;
&lt;div class="paragraph"&gt;
&lt;p&gt;In order to resolve ambiguous cases, lemmatization usually requires tokens to be accompanied by part-of-speech tags. For example, the word lemma for &lt;em&gt;rose&lt;/em&gt; depends on whether it is used as a noun or a verb:&lt;/p&gt;
&lt;/div&gt;
&lt;div class="listingblock"&gt;
&lt;div class="content"&gt;
&lt;pre class="CodeRay highlight"&gt;&lt;code data-lang="python"&gt;&amp;gt;&amp;gt;&amp;gt; lemmer = WordNetLemmatizer()
&amp;gt;&amp;gt;&amp;gt; lemmer.lemmatize(&lt;span class="string"&gt;&lt;span class="delimiter"&gt;'&lt;/span&gt;&lt;span class="content"&gt;rose&lt;/span&gt;&lt;span class="delimiter"&gt;'&lt;/span&gt;&lt;/span&gt;, &lt;span class="string"&gt;&lt;span class="delimiter"&gt;'&lt;/span&gt;&lt;span class="content"&gt;n&lt;/span&gt;&lt;span class="delimiter"&gt;'&lt;/span&gt;&lt;/span&gt;) &lt;span class="comment"&gt;# tag as noun&lt;/span&gt;
&lt;span class="string"&gt;&lt;span class="delimiter"&gt;'&lt;/span&gt;&lt;span class="content"&gt;rose&lt;/span&gt;&lt;span class="delimiter"&gt;'&lt;/span&gt;&lt;/span&gt;
&amp;gt;&amp;gt;&amp;gt; lemmer.lemmatize(&lt;span class="string"&gt;&lt;span class="delimiter"&gt;'&lt;/span&gt;&lt;span class="content"&gt;rose&lt;/span&gt;&lt;span class="delimiter"&gt;'&lt;/span&gt;&lt;/span&gt;, &lt;span class="string"&gt;&lt;span class="delimiter"&gt;'&lt;/span&gt;&lt;span class="content"&gt;v&lt;/span&gt;&lt;span class="delimiter"&gt;'&lt;/span&gt;&lt;/span&gt;) &lt;span class="comment"&gt;# tag as verb&lt;/span&gt;
&lt;span class="string"&gt;&lt;span class="delimiter"&gt;'&lt;/span&gt;&lt;span class="content"&gt;rise&lt;/span&gt;&lt;span class="delimiter"&gt;'&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="paragraph"&gt;
&lt;p&gt;Since we are operating on untagged tokens, we&amp;#8217;ll first run them through an automated part-of-speech tagger provided by NLTK (it uses a pre-trained perceptron tagger originally by Matthew Honnibal: &lt;a href="https://explosion.ai/blog/part-of-speech-pos-tagger-in-python"&gt;&amp;#8220;A Good Part-of-Speech Tagger in about 200 Lines of Python&amp;#8221;&lt;/a&gt;). The tagger requires the training data available in the 'averaged_perceptron_tagger.pickle' file which can be downloaded by running &lt;code&gt;nltk.download()&lt;/code&gt; from an interactive python prompt.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="listingblock wide"&gt;
&lt;div class="title"&gt; [hapaxes.py: 145-166]&lt;/div&gt;
&lt;div class="content"&gt;
&lt;pre class="CodeRay highlight"&gt;&lt;code data-lang="python"&gt;&lt;span class="keyword"&gt;def&lt;/span&gt; &lt;span class="function"&gt;nltk_lemma_hapaxes&lt;/span&gt;(tokens):
    &lt;span class="docstring"&gt;&lt;span class="delimiter"&gt;&amp;quot;&amp;quot;&amp;quot;&lt;/span&gt;&lt;span class="content"&gt;
&lt;/span&gt;&lt;span class="content"&gt;    Takes a list of tokens and returns a list of the lemma&lt;/span&gt;&lt;span class="content"&gt;
&lt;/span&gt;&lt;span class="content"&gt;    hapaxes.&lt;/span&gt;&lt;span class="content"&gt;
&lt;/span&gt;&lt;span class="content"&gt;    &lt;/span&gt;&lt;span class="delimiter"&gt;&amp;quot;&amp;quot;&amp;quot;&lt;/span&gt;&lt;/span&gt;
    &lt;span class="keyword"&gt;if&lt;/span&gt; &lt;span class="keyword"&gt;not&lt;/span&gt; nltk:
        &lt;span class="comment"&gt;# Only run if NLTK is loaded&lt;/span&gt;
        &lt;span class="keyword"&gt;return&lt;/span&gt; &lt;span class="predefined-constant"&gt;None&lt;/span&gt;

    &lt;span class="comment"&gt;# Tag tokens with part-of-speech:&lt;/span&gt;
    tagged = nltk.pos_tag(tokens) &lt;i class="conum" data-value="1"&gt;&lt;/i&gt;&lt;b&gt;(1)&lt;/b&gt;

    &lt;span class="comment"&gt;# Convert our Treebank-style tags to WordNet-style tags.&lt;/span&gt;
    tagged = [(word, pt_to_wn(tag))
                     &lt;span class="keyword"&gt;for&lt;/span&gt; (word, tag) &lt;span class="keyword"&gt;in&lt;/span&gt; tagged] &lt;i class="conum" data-value="2"&gt;&lt;/i&gt;&lt;b&gt;(2)&lt;/b&gt;

    &lt;span class="comment"&gt;# Lemmatize:&lt;/span&gt;
    lemmer = WordNetLemmatizer()
    lemmas = [lemmer.lemmatize(token, pos)
                     &lt;span class="keyword"&gt;for&lt;/span&gt; (token, pos) &lt;span class="keyword"&gt;in&lt;/span&gt; tagged] &lt;i class="conum" data-value="3"&gt;&lt;/i&gt;&lt;b&gt;(3)&lt;/b&gt;

    &lt;span class="keyword"&gt;return&lt;/span&gt; nltk_stem_hapaxes(lemmas) &lt;i class="conum" data-value="4"&gt;&lt;/i&gt;&lt;b&gt;(4)&lt;/b&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="colist arabic"&gt;
&lt;table&gt;
&lt;tr&gt;
&lt;td&gt;&lt;i class="conum" data-value="1"&gt;&lt;/i&gt;&lt;b&gt;1&lt;/b&gt;&lt;/td&gt;
&lt;td&gt;This turns our list of tokens into a list of 2-tuples: &lt;code&gt;[(token1, tag1), (token2, tag2)&amp;#8230;&amp;#8203;]&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;i class="conum" data-value="2"&gt;&lt;/i&gt;&lt;b&gt;2&lt;/b&gt;&lt;/td&gt;
&lt;td&gt;We must convert between the tags returned by &lt;code&gt;pos_tag()&lt;/code&gt; and the tags expected by the WordNet lemmatizer. This is done by applying the &lt;code&gt;pt_to_wn()&lt;/code&gt; function (defined below) to each tag.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;i class="conum" data-value="3"&gt;&lt;/i&gt;&lt;b&gt;3&lt;/b&gt;&lt;/td&gt;
&lt;td&gt;Pass each token and POS tag to the WordNet lemmatizer.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;i class="conum" data-value="4"&gt;&lt;/i&gt;&lt;b&gt;4&lt;/b&gt;&lt;/td&gt;
&lt;td&gt;If a lemma is not found for a token, then it is returned from &lt;code&gt;lemmatize()&lt;/code&gt; unchanged. To ensure these unhandled words don&amp;#8217;t contribute spurious hapaxes, we pass our lemmatized tokens through the word stemmer for good measure (which also filters the list down to only hapaxes).&lt;/td&gt;
&lt;/tr&gt;
&lt;/table&gt;
&lt;/div&gt;
&lt;div class="paragraph"&gt;
&lt;p&gt;As noted above, the tags returned by &lt;code&gt;pos_tag()&lt;/code&gt; are &lt;a href="https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html"&gt;Penn Treebank style tags&lt;/a&gt; while the WordNet lemmatizer uses its own tag set (defined in the &lt;code&gt;nltk.corpus.reader.wordnet&lt;/code&gt; module, though that is not very clear from the NLTK documentation). The &lt;code&gt;pt_to_wn()&lt;/code&gt; function converts Treebank tags to the tags required for lemmatization:&lt;/p&gt;
&lt;/div&gt;
&lt;div class="listingblock wide"&gt;
&lt;div class="title"&gt; [hapaxes.py: 168-199]&lt;/div&gt;
&lt;div class="content"&gt;
&lt;pre class="CodeRay highlight"&gt;&lt;code data-lang="python"&gt;&lt;span class="keyword"&gt;def&lt;/span&gt; &lt;span class="function"&gt;pt_to_wn&lt;/span&gt;(pos):
    &lt;span class="docstring"&gt;&lt;span class="delimiter"&gt;&amp;quot;&amp;quot;&amp;quot;&lt;/span&gt;&lt;span class="content"&gt;
&lt;/span&gt;&lt;span class="content"&gt;    Takes a Penn Treebank tag and converts it to an&lt;/span&gt;&lt;span class="content"&gt;
&lt;/span&gt;&lt;span class="content"&gt;    appropriate WordNet equivalent for lemmatization.&lt;/span&gt;&lt;span class="content"&gt;
&lt;/span&gt;&lt;span class="content"&gt;
&lt;/span&gt;&lt;span class="content"&gt;    A list of Penn Treebank tags is available at:&lt;/span&gt;&lt;span class="content"&gt;
&lt;/span&gt;&lt;span class="content"&gt;    https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html&lt;/span&gt;&lt;span class="content"&gt;
&lt;/span&gt;&lt;span class="content"&gt;    &lt;/span&gt;&lt;span class="delimiter"&gt;&amp;quot;&amp;quot;&amp;quot;&lt;/span&gt;&lt;/span&gt;

    &lt;span class="keyword"&gt;from&lt;/span&gt; &lt;span class="include"&gt;nltk.corpus.reader.wordnet&lt;/span&gt; &lt;span class="keyword"&gt;import&lt;/span&gt; &lt;span class="include"&gt;NOUN&lt;/span&gt;, &lt;span class="include"&gt;VERB&lt;/span&gt;, &lt;span class="include"&gt;ADJ&lt;/span&gt;, &lt;span class="include"&gt;ADV&lt;/span&gt;

    pos = pos.lower()

    &lt;span class="keyword"&gt;if&lt;/span&gt; pos.startswith(&lt;span class="string"&gt;&lt;span class="delimiter"&gt;'&lt;/span&gt;&lt;span class="content"&gt;jj&lt;/span&gt;&lt;span class="delimiter"&gt;'&lt;/span&gt;&lt;/span&gt;):
        tag = ADJ
    &lt;span class="keyword"&gt;elif&lt;/span&gt; pos == &lt;span class="string"&gt;&lt;span class="delimiter"&gt;'&lt;/span&gt;&lt;span class="content"&gt;md&lt;/span&gt;&lt;span class="delimiter"&gt;'&lt;/span&gt;&lt;/span&gt;:
        &lt;span class="comment"&gt;# Modal auxiliary verbs&lt;/span&gt;
        tag = VERB
    &lt;span class="keyword"&gt;elif&lt;/span&gt; pos.startswith(&lt;span class="string"&gt;&lt;span class="delimiter"&gt;'&lt;/span&gt;&lt;span class="content"&gt;rb&lt;/span&gt;&lt;span class="delimiter"&gt;'&lt;/span&gt;&lt;/span&gt;):
        tag = ADV
    &lt;span class="keyword"&gt;elif&lt;/span&gt; pos.startswith(&lt;span class="string"&gt;&lt;span class="delimiter"&gt;'&lt;/span&gt;&lt;span class="content"&gt;vb&lt;/span&gt;&lt;span class="delimiter"&gt;'&lt;/span&gt;&lt;/span&gt;):
        tag = VERB
    &lt;span class="keyword"&gt;elif&lt;/span&gt; pos == &lt;span class="string"&gt;&lt;span class="delimiter"&gt;'&lt;/span&gt;&lt;span class="content"&gt;wrb&lt;/span&gt;&lt;span class="delimiter"&gt;'&lt;/span&gt;&lt;/span&gt;:
        &lt;span class="comment"&gt;# Wh-adverb (how, however, whence, whenever...)&lt;/span&gt;
        tag = ADV
    &lt;span class="keyword"&gt;else&lt;/span&gt;:
        &lt;span class="comment"&gt;# default to NOUN&lt;/span&gt;
        &lt;span class="comment"&gt;# This is not strictly correct, but it is good&lt;/span&gt;
        &lt;span class="comment"&gt;# enough for lemmatization.&lt;/span&gt;
        tag = NOUN

    &lt;span class="keyword"&gt;return&lt;/span&gt; tag&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="sect2"&gt;
&lt;h3 id="_make_it_a_script"&gt;Make it a script&lt;/h3&gt;
&lt;div class="paragraph"&gt;
&lt;p&gt;You can play with the functions we&amp;#8217;ve defined above by typing (copy-and-pasting) them into an interactive Python session. If we save them all to a file, then that file is a Python module which we could &lt;code&gt;import&lt;/code&gt; and use in a Python script. To use a single file as both a module and a script, our file can include a construct like this:&lt;/p&gt;
&lt;/div&gt;
&lt;div class="listingblock"&gt;
&lt;div class="content"&gt;
&lt;pre class="CodeRay highlight"&gt;&lt;code data-lang="python"&gt;&lt;span class="keyword"&gt;if&lt;/span&gt; __name__ == &lt;span class="string"&gt;&lt;span class="delimiter"&gt;&amp;quot;&lt;/span&gt;&lt;span class="content"&gt;__main__&lt;/span&gt;&lt;span class="delimiter"&gt;&amp;quot;&lt;/span&gt;&lt;/span&gt;:
    &lt;span class="comment"&gt;# our script logic here&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="paragraph"&gt;
&lt;p&gt;This works because when the Python interpreter executes a script (as opposed to importing a module), it sets the top-level variable __name__ equal to the string "__main__" (see also: &lt;a href="https://stackoverflow.com/questions/419163/what-does-if-name-main-do"&gt;What does if __name__ == “__main__”: do?&lt;/a&gt;).&lt;/p&gt;
&lt;/div&gt;
&lt;div class="paragraph"&gt;
&lt;p&gt;In our case, our script logic consists of reading any input files if given, running all of our hapax functions, then collecting and displaying the output. To see how it is done, scroll down to the full program listing below.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="sect3"&gt;
&lt;h4 id="_running_it"&gt;Running it&lt;/h4&gt;
&lt;div class="paragraph"&gt;
&lt;p&gt;To run the script, first download and save &lt;a href="hapaxes.py"&gt;hapaxes.py&lt;/a&gt;. Then:&lt;/p&gt;
&lt;/div&gt;
&lt;div class="listingblock"&gt;
&lt;div class="content"&gt;
&lt;pre class="CodeRay highlight"&gt;&lt;code data-lang="sh"&gt;$ python hapaxes.py&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="paragraph"&gt;
&lt;p&gt;Depending on which NLP packages you have installed, you should see output like:&lt;/p&gt;
&lt;/div&gt;
&lt;div class="listingblock"&gt;
&lt;div class="content"&gt;
&lt;pre&gt;               Count
     Wordforms   9
    NLTK-stems   8
   NLTK-lemmas   8

-- Hapaxes --
Wordforms:    careful, cautious, corrupt, if, not, now, see, that, you
NLTK-stems:   care, cautious, if, not, now, see, that, you
NLTK-lemmas:  care, cautious, if, not, now, see, that, you&lt;/pre&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="paragraph"&gt;
&lt;p&gt;Try also running the script on an arbitrary input file:&lt;/p&gt;
&lt;/div&gt;
&lt;div class="listingblock"&gt;
&lt;div class="content"&gt;
&lt;pre class="CodeRay highlight"&gt;&lt;code data-lang="sh"&gt;$ python hapaxes.py somefilename

# run it on itself and note that
# source code doesn't give great results:
$ python hapaxes.py hapaxes.py&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="sect1"&gt;
&lt;h2 id="_hapaxes_py_listing"&gt;hapaxes.py listing&lt;/h2&gt;
&lt;div class="sectionbody"&gt;
&lt;div class="paragraph"&gt;
&lt;p&gt;The entire script is listed below and available at &lt;a href="hapaxes.py"&gt;hapaxes.py&lt;/a&gt;.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="listingblock wide"&gt;
&lt;div class="title"&gt;hapaxes.py&lt;/div&gt;
&lt;div class="content"&gt;
&lt;pre class="CodeRay highlight"&gt;&lt;code data-lang="python"&gt;&lt;span class="line-numbers"&gt;  1&lt;/span&gt;&lt;span class="docstring"&gt;&lt;span class="delimiter"&gt;&amp;quot;&amp;quot;&amp;quot;&lt;/span&gt;&lt;span class="content"&gt;&lt;/span&gt;&lt;/span&gt;
&lt;span class="line-numbers"&gt;  2&lt;/span&gt;&lt;span class="docstring"&gt;&lt;span class="content"&gt;&lt;/span&gt;&lt;span class="content"&gt;A sample script/module which demonstrates how to count hapaxes (tokens which&lt;/span&gt;&lt;span class="content"&gt;&lt;/span&gt;&lt;/span&gt;
&lt;span class="line-numbers"&gt;  3&lt;/span&gt;&lt;span class="docstring"&gt;&lt;span class="content"&gt;&lt;/span&gt;&lt;span class="content"&gt;appear only once) in an untagged text corpus using plain python and NLTK.&lt;/span&gt;&lt;span class="content"&gt;&lt;/span&gt;&lt;/span&gt;
&lt;span class="line-numbers"&gt;  4&lt;/span&gt;&lt;span class="docstring"&gt;&lt;span class="content"&gt;&lt;/span&gt;&lt;span class="content"&gt;It counts and lists hapaxes in five different ways:&lt;/span&gt;&lt;span class="content"&gt;&lt;/span&gt;&lt;/span&gt;
&lt;span class="line-numbers"&gt;  5&lt;/span&gt;&lt;span class="docstring"&gt;&lt;span class="content"&gt;&lt;/span&gt;&lt;span class="content"&gt;&lt;/span&gt;&lt;/span&gt;
&lt;span class="line-numbers"&gt;  6&lt;/span&gt;&lt;span class="docstring"&gt;&lt;span class="content"&gt;&lt;/span&gt;&lt;span class="content"&gt;    * Wordforms - counts unique spellings (normalized for case). This uses&lt;/span&gt;&lt;span class="content"&gt;&lt;/span&gt;&lt;/span&gt;
&lt;span class="line-numbers"&gt;  7&lt;/span&gt;&lt;span class="docstring"&gt;&lt;span class="content"&gt;&lt;/span&gt;&lt;span class="content"&gt;    plain Python (no NLTK required)&lt;/span&gt;&lt;span class="content"&gt;&lt;/span&gt;&lt;/span&gt;
&lt;span class="line-numbers"&gt;  8&lt;/span&gt;&lt;span class="docstring"&gt;&lt;span class="content"&gt;&lt;/span&gt;&lt;span class="content"&gt;&lt;/span&gt;&lt;/span&gt;
&lt;span class="line-numbers"&gt;  9&lt;/span&gt;&lt;span class="docstring"&gt;&lt;span class="content"&gt;&lt;/span&gt;&lt;span class="content"&gt;    * NLTK stems - counts unique stems using a stemmer provided by NLTK&lt;/span&gt;&lt;span class="content"&gt;&lt;/span&gt;&lt;/span&gt;
&lt;span class="line-numbers"&gt; 10&lt;/span&gt;&lt;span class="docstring"&gt;&lt;span class="content"&gt;&lt;/span&gt;&lt;span class="content"&gt;&lt;/span&gt;&lt;/span&gt;
&lt;span class="line-numbers"&gt; 11&lt;/span&gt;&lt;span class="docstring"&gt;&lt;span class="content"&gt;&lt;/span&gt;&lt;span class="content"&gt;    * NLTK lemmas - counts unique lemma forms using NLTK's part of speech&lt;/span&gt;&lt;span class="content"&gt;&lt;/span&gt;&lt;/span&gt;
&lt;span class="line-numbers"&gt; 12&lt;/span&gt;&lt;span class="docstring"&gt;&lt;span class="content"&gt;&lt;/span&gt;&lt;span class="content"&gt;    * tagger and interface to the WordNet lemmatizer.&lt;/span&gt;&lt;span class="content"&gt;&lt;/span&gt;&lt;/span&gt;
&lt;span class="line-numbers"&gt; 13&lt;/span&gt;&lt;span class="docstring"&gt;&lt;span class="content"&gt;&lt;/span&gt;&lt;span class="content"&gt;&lt;/span&gt;&lt;/span&gt;
&lt;span class="line-numbers"&gt; 14&lt;/span&gt;&lt;span class="docstring"&gt;&lt;span class="content"&gt;&lt;/span&gt;&lt;span class="content"&gt;The nltk module is optional. If it is not installed, only the plain python code&lt;/span&gt;&lt;span class="content"&gt;&lt;/span&gt;&lt;/span&gt;
&lt;span class="line-numbers"&gt; 15&lt;/span&gt;&lt;span class="docstring"&gt;&lt;span class="content"&gt;&lt;/span&gt;&lt;span class="content"&gt;will be run.&lt;/span&gt;&lt;span class="content"&gt;&lt;/span&gt;&lt;/span&gt;
&lt;span class="line-numbers"&gt; 16&lt;/span&gt;&lt;span class="docstring"&gt;&lt;span class="content"&gt;&lt;/span&gt;&lt;span class="content"&gt;&lt;/span&gt;&lt;/span&gt;
&lt;span class="line-numbers"&gt; 17&lt;/span&gt;&lt;span class="docstring"&gt;&lt;span class="content"&gt;&lt;/span&gt;&lt;span class="content"&gt;Usage:&lt;/span&gt;&lt;span class="content"&gt;&lt;/span&gt;&lt;/span&gt;
&lt;span class="line-numbers"&gt; 18&lt;/span&gt;&lt;span class="docstring"&gt;&lt;span class="content"&gt;&lt;/span&gt;&lt;span class="content"&gt;&lt;/span&gt;&lt;/span&gt;
&lt;span class="line-numbers"&gt; 19&lt;/span&gt;&lt;span class="docstring"&gt;&lt;span class="content"&gt;&lt;/span&gt;&lt;span class="content"&gt;    python hapaxes.py [file]&lt;/span&gt;&lt;span class="content"&gt;&lt;/span&gt;&lt;/span&gt;
&lt;span class="line-numbers"&gt; 20&lt;/span&gt;&lt;span class="docstring"&gt;&lt;span class="content"&gt;&lt;/span&gt;&lt;span class="content"&gt;&lt;/span&gt;&lt;/span&gt;
&lt;span class="line-numbers"&gt; 21&lt;/span&gt;&lt;span class="docstring"&gt;&lt;span class="content"&gt;&lt;/span&gt;&lt;span class="content"&gt;If 'file' is given, its contents are read and used as the text in which to&lt;/span&gt;&lt;span class="content"&gt;&lt;/span&gt;&lt;/span&gt;
&lt;span class="line-numbers"&gt; 22&lt;/span&gt;&lt;span class="docstring"&gt;&lt;span class="content"&gt;&lt;/span&gt;&lt;span class="content"&gt;find hapaxes. If 'file' is omitted, then a test text will be used.&lt;/span&gt;&lt;span class="content"&gt;&lt;/span&gt;&lt;/span&gt;
&lt;span class="line-numbers"&gt; 23&lt;/span&gt;&lt;span class="docstring"&gt;&lt;span class="content"&gt;&lt;/span&gt;&lt;span class="content"&gt;&lt;/span&gt;&lt;/span&gt;
&lt;span class="line-numbers"&gt; 24&lt;/span&gt;&lt;span class="docstring"&gt;&lt;span class="content"&gt;&lt;/span&gt;&lt;span class="content"&gt;Example:&lt;/span&gt;&lt;span class="content"&gt;&lt;/span&gt;&lt;/span&gt;
&lt;span class="line-numbers"&gt; 25&lt;/span&gt;&lt;span class="docstring"&gt;&lt;span class="content"&gt;&lt;/span&gt;&lt;span class="content"&gt;&lt;/span&gt;&lt;/span&gt;
&lt;span class="line-numbers"&gt; 26&lt;/span&gt;&lt;span class="docstring"&gt;&lt;span class="content"&gt;&lt;/span&gt;&lt;span class="content"&gt;Running this script with no arguments:&lt;/span&gt;&lt;span class="content"&gt;&lt;/span&gt;&lt;/span&gt;
&lt;span class="line-numbers"&gt; 27&lt;/span&gt;&lt;span class="docstring"&gt;&lt;span class="content"&gt;&lt;/span&gt;&lt;span class="content"&gt;&lt;/span&gt;&lt;/span&gt;
&lt;span class="line-numbers"&gt; 28&lt;/span&gt;&lt;span class="docstring"&gt;&lt;span class="content"&gt;&lt;/span&gt;&lt;span class="content"&gt;    python hapaxes.py&lt;/span&gt;&lt;span class="content"&gt;&lt;/span&gt;&lt;/span&gt;
&lt;span class="line-numbers"&gt; 29&lt;/span&gt;&lt;span class="docstring"&gt;&lt;span class="content"&gt;&lt;/span&gt;&lt;span class="content"&gt;&lt;/span&gt;&lt;/span&gt;
&lt;span class="line-numbers"&gt; 30&lt;/span&gt;&lt;span class="docstring"&gt;&lt;span class="content"&gt;&lt;/span&gt;&lt;span class="content"&gt;Will process this text:&lt;/span&gt;&lt;span class="content"&gt;&lt;/span&gt;&lt;/span&gt;
&lt;span class="line-numbers"&gt; 31&lt;/span&gt;&lt;span class="docstring"&gt;&lt;span class="content"&gt;&lt;/span&gt;&lt;span class="content"&gt;&lt;/span&gt;&lt;/span&gt;
&lt;span class="line-numbers"&gt; 32&lt;/span&gt;&lt;span class="docstring"&gt;&lt;span class="content"&gt;&lt;/span&gt;&lt;span class="content"&gt;    Cory Linguist, a cautious corpus linguist, in creating a corpus of&lt;/span&gt;&lt;span class="content"&gt;&lt;/span&gt;&lt;/span&gt;
&lt;span class="line-numbers"&gt; 33&lt;/span&gt;&lt;span class="docstring"&gt;&lt;span class="content"&gt;&lt;/span&gt;&lt;span class="content"&gt;    courtship correspondence, corrupted a crucial link. Now, if Cory Linguist,&lt;/span&gt;&lt;span class="content"&gt;&lt;/span&gt;&lt;/span&gt;
&lt;span class="line-numbers"&gt; 34&lt;/span&gt;&lt;span class="docstring"&gt;&lt;span class="content"&gt;&lt;/span&gt;&lt;span class="content"&gt;    a careful corpus linguist, in creating a corpus of courtship&lt;/span&gt;&lt;span class="content"&gt;&lt;/span&gt;&lt;/span&gt;
&lt;span class="line-numbers"&gt; 35&lt;/span&gt;&lt;span class="docstring"&gt;&lt;span class="content"&gt;&lt;/span&gt;&lt;span class="content"&gt;    correspondence, corrupted a crucial link, see that YOU, in creating a&lt;/span&gt;&lt;span class="content"&gt;&lt;/span&gt;&lt;/span&gt;
&lt;span class="line-numbers"&gt; 36&lt;/span&gt;&lt;span class="docstring"&gt;&lt;span class="content"&gt;&lt;/span&gt;&lt;span class="content"&gt;    corpus of courtship correspondence, corrupt not a crucial link.&lt;/span&gt;&lt;span class="content"&gt;&lt;/span&gt;&lt;/span&gt;
&lt;span class="line-numbers"&gt; 37&lt;/span&gt;&lt;span class="docstring"&gt;&lt;span class="content"&gt;&lt;/span&gt;&lt;span class="content"&gt;&lt;/span&gt;&lt;/span&gt;
&lt;span class="line-numbers"&gt; 38&lt;/span&gt;&lt;span class="docstring"&gt;&lt;span class="content"&gt;&lt;/span&gt;&lt;span class="content"&gt;And produce this output:&lt;/span&gt;&lt;span class="content"&gt;&lt;/span&gt;&lt;/span&gt;
&lt;span class="line-numbers"&gt; 39&lt;/span&gt;&lt;span class="docstring"&gt;&lt;span class="content"&gt;&lt;/span&gt;&lt;span class="content"&gt;&lt;/span&gt;&lt;/span&gt;
&lt;span class="line-numbers"&gt; 40&lt;/span&gt;&lt;span class="docstring"&gt;&lt;span class="content"&gt;&lt;/span&gt;&lt;span class="content"&gt;                Count&lt;/span&gt;&lt;span class="content"&gt;&lt;/span&gt;&lt;/span&gt;
&lt;span class="line-numbers"&gt; 41&lt;/span&gt;&lt;span class="docstring"&gt;&lt;span class="content"&gt;&lt;/span&gt;&lt;span class="content"&gt;         Wordforms   9&lt;/span&gt;&lt;span class="content"&gt;&lt;/span&gt;&lt;/span&gt;
&lt;span class="line-numbers"&gt; 42&lt;/span&gt;&lt;span class="docstring"&gt;&lt;span class="content"&gt;&lt;/span&gt;&lt;span class="content"&gt;             Stems   8&lt;/span&gt;&lt;span class="content"&gt;&lt;/span&gt;&lt;/span&gt;
&lt;span class="line-numbers"&gt; 43&lt;/span&gt;&lt;span class="docstring"&gt;&lt;span class="content"&gt;&lt;/span&gt;&lt;span class="content"&gt;            Lemmas   8&lt;/span&gt;&lt;span class="content"&gt;&lt;/span&gt;&lt;/span&gt;
&lt;span class="line-numbers"&gt; 44&lt;/span&gt;&lt;span class="docstring"&gt;&lt;span class="content"&gt;&lt;/span&gt;&lt;span class="content"&gt;&lt;/span&gt;&lt;/span&gt;
&lt;span class="line-numbers"&gt; 45&lt;/span&gt;&lt;span class="docstring"&gt;&lt;span class="content"&gt;&lt;/span&gt;&lt;span class="content"&gt;    -- Hapaxes --&lt;/span&gt;&lt;span class="content"&gt;&lt;/span&gt;&lt;/span&gt;
&lt;span class="line-numbers"&gt; 46&lt;/span&gt;&lt;span class="docstring"&gt;&lt;span class="content"&gt;&lt;/span&gt;&lt;span class="content"&gt;    Wordforms:    careful, cautious, corrupt, if, not, now, see, that, you&lt;/span&gt;&lt;span class="content"&gt;&lt;/span&gt;&lt;/span&gt;
&lt;span class="line-numbers"&gt; 47&lt;/span&gt;&lt;span class="docstring"&gt;&lt;span class="content"&gt;&lt;/span&gt;&lt;span class="content"&gt;    NLTK-stems:   care, cautious, if, not, now, see, that, you&lt;/span&gt;&lt;span class="content"&gt;&lt;/span&gt;&lt;/span&gt;
&lt;span class="line-numbers"&gt; 48&lt;/span&gt;&lt;span class="docstring"&gt;&lt;span class="content"&gt;&lt;/span&gt;&lt;span class="content"&gt;    NLTK-lemmas:  care, cautious, if, not, now, see, that, you&lt;/span&gt;&lt;span class="content"&gt;&lt;/span&gt;&lt;/span&gt;
&lt;span class="line-numbers"&gt; 49&lt;/span&gt;&lt;span class="docstring"&gt;&lt;span class="content"&gt;&lt;/span&gt;&lt;span class="content"&gt;&lt;/span&gt;&lt;/span&gt;
&lt;span class="line-numbers"&gt; 50&lt;/span&gt;&lt;span class="docstring"&gt;&lt;span class="content"&gt;&lt;/span&gt;&lt;span class="content"&gt;&lt;/span&gt;&lt;/span&gt;
&lt;span class="line-numbers"&gt; 51&lt;/span&gt;&lt;span class="docstring"&gt;&lt;span class="content"&gt;&lt;/span&gt;&lt;span class="content"&gt;Notice that the stems and lemmas methods do not count &amp;quot;corrupt&amp;quot; as a hapax&lt;/span&gt;&lt;span class="content"&gt;&lt;/span&gt;&lt;/span&gt;
&lt;span class="line-numbers"&gt; 52&lt;/span&gt;&lt;span class="docstring"&gt;&lt;span class="content"&gt;&lt;/span&gt;&lt;span class="content"&gt;because it also occurs as &amp;quot;corrupted&amp;quot;. Notice also that &amp;quot;Linguist&amp;quot; is not&lt;/span&gt;&lt;span class="content"&gt;&lt;/span&gt;&lt;/span&gt;
&lt;span class="line-numbers"&gt; 53&lt;/span&gt;&lt;span class="docstring"&gt;&lt;span class="content"&gt;&lt;/span&gt;&lt;span class="content"&gt;counted as the text is normalized for case.&lt;/span&gt;&lt;span class="content"&gt;&lt;/span&gt;&lt;/span&gt;
&lt;span class="line-numbers"&gt; 54&lt;/span&gt;&lt;span class="docstring"&gt;&lt;span class="content"&gt;&lt;/span&gt;&lt;span class="content"&gt;&lt;/span&gt;&lt;/span&gt;
&lt;span class="line-numbers"&gt; 55&lt;/span&gt;&lt;span class="docstring"&gt;&lt;span class="content"&gt;&lt;/span&gt;&lt;span class="content"&gt;See also the Wikipedia entry on &amp;quot;Hapex legomenon&amp;quot;&lt;/span&gt;&lt;span class="content"&gt;&lt;/span&gt;&lt;/span&gt;
&lt;span class="line-numbers"&gt; 56&lt;/span&gt;&lt;span class="docstring"&gt;&lt;span class="content"&gt;&lt;/span&gt;&lt;span class="content"&gt;(https://en.wikipedia.org/wiki/Hapax_legomenon)&lt;/span&gt;&lt;span class="content"&gt;&lt;/span&gt;&lt;/span&gt;
&lt;span class="line-numbers"&gt; 57&lt;/span&gt;&lt;span class="docstring"&gt;&lt;span class="content"&gt;&lt;/span&gt;&lt;span class="delimiter"&gt;&amp;quot;&amp;quot;&amp;quot;&lt;/span&gt;&lt;/span&gt;
&lt;span class="line-numbers"&gt; 58&lt;/span&gt;
&lt;span class="line-numbers"&gt; 59&lt;/span&gt;&lt;span class="comment"&gt;### Imports&lt;/span&gt;
&lt;span class="line-numbers"&gt; 60&lt;/span&gt;&lt;span class="comment"&gt;#&lt;/span&gt;
&lt;span class="line-numbers"&gt; 61&lt;/span&gt;&lt;span class="comment"&gt;# Import some Python 3 features to use in Python 2&lt;/span&gt;
&lt;span class="line-numbers"&gt; 62&lt;/span&gt;&lt;span class="keyword"&gt;from&lt;/span&gt; &lt;span class="include"&gt;__future__&lt;/span&gt; &lt;span class="keyword"&gt;import&lt;/span&gt; &lt;span class="include"&gt;print_function&lt;/span&gt;
&lt;span class="line-numbers"&gt; 63&lt;/span&gt;&lt;span class="keyword"&gt;from&lt;/span&gt; &lt;span class="include"&gt;__future__&lt;/span&gt; &lt;span class="keyword"&gt;import&lt;/span&gt; &lt;span class="include"&gt;unicode_literals&lt;/span&gt;
&lt;span class="line-numbers"&gt; 64&lt;/span&gt;
&lt;span class="line-numbers"&gt; 65&lt;/span&gt;&lt;span class="comment"&gt;# gives us access to command-line arguments&lt;/span&gt;
&lt;span class="line-numbers"&gt; 66&lt;/span&gt;&lt;span class="keyword"&gt;import&lt;/span&gt; &lt;span class="include"&gt;sys&lt;/span&gt;
&lt;span class="line-numbers"&gt; 67&lt;/span&gt;
&lt;span class="line-numbers"&gt; 68&lt;/span&gt;&lt;span class="comment"&gt;# The Counter collection is a convenient layer on top of&lt;/span&gt;
&lt;span class="line-numbers"&gt; 69&lt;/span&gt;&lt;span class="comment"&gt;# python's standard dictionary type for counting iterables.&lt;/span&gt;
&lt;span class="line-numbers"&gt; 70&lt;/span&gt;&lt;span class="keyword"&gt;from&lt;/span&gt; &lt;span class="include"&gt;collections&lt;/span&gt; &lt;span class="keyword"&gt;import&lt;/span&gt; &lt;span class="include"&gt;Counter&lt;/span&gt;
&lt;span class="line-numbers"&gt; 71&lt;/span&gt;
&lt;span class="line-numbers"&gt; 72&lt;/span&gt;&lt;span class="comment"&gt;# The standard python regular expression module:&lt;/span&gt;
&lt;span class="line-numbers"&gt; 73&lt;/span&gt;&lt;span class="keyword"&gt;import&lt;/span&gt; &lt;span class="include"&gt;re&lt;/span&gt;
&lt;span class="line-numbers"&gt; 74&lt;/span&gt;
&lt;span class="line-numbers"&gt; 75&lt;/span&gt;&lt;span class="keyword"&gt;try&lt;/span&gt;:
&lt;span class="line-numbers"&gt; 76&lt;/span&gt;    &lt;span class="comment"&gt;# Import NLTK if it is installed&lt;/span&gt;
&lt;span class="line-numbers"&gt; 77&lt;/span&gt;    &lt;span class="keyword"&gt;import&lt;/span&gt; &lt;span class="include"&gt;nltk&lt;/span&gt;
&lt;span class="line-numbers"&gt; 78&lt;/span&gt;
&lt;span class="line-numbers"&gt; 79&lt;/span&gt;    &lt;span class="comment"&gt;# This imports NLTK's implementation of the Snowball&lt;/span&gt;
&lt;span class="line-numbers"&gt; 80&lt;/span&gt;    &lt;span class="comment"&gt;# stemmer algorithm&lt;/span&gt;
&lt;span class="line-numbers"&gt; 81&lt;/span&gt;    &lt;span class="keyword"&gt;from&lt;/span&gt; &lt;span class="include"&gt;nltk.stem.snowball&lt;/span&gt; &lt;span class="keyword"&gt;import&lt;/span&gt; &lt;span class="include"&gt;SnowballStemmer&lt;/span&gt;
&lt;span class="line-numbers"&gt; 82&lt;/span&gt;
&lt;span class="line-numbers"&gt; 83&lt;/span&gt;    &lt;span class="comment"&gt;# NLTK's interface to the WordNet lemmatizer&lt;/span&gt;
&lt;span class="line-numbers"&gt; 84&lt;/span&gt;    &lt;span class="keyword"&gt;from&lt;/span&gt; &lt;span class="include"&gt;nltk.stem.wordnet&lt;/span&gt; &lt;span class="keyword"&gt;import&lt;/span&gt; &lt;span class="include"&gt;WordNetLemmatizer&lt;/span&gt;
&lt;span class="line-numbers"&gt; 85&lt;/span&gt;&lt;span class="keyword"&gt;except&lt;/span&gt; &lt;span class="exception"&gt;ImportError&lt;/span&gt;:
&lt;span class="line-numbers"&gt; 86&lt;/span&gt;    nltk = &lt;span class="predefined-constant"&gt;None&lt;/span&gt;
&lt;span class="line-numbers"&gt; 87&lt;/span&gt;    print(&lt;span class="string"&gt;&lt;span class="delimiter"&gt;&amp;quot;&lt;/span&gt;&lt;span class="content"&gt;NLTK is not installed, so we won't use it.&lt;/span&gt;&lt;span class="delimiter"&gt;&amp;quot;&lt;/span&gt;&lt;/span&gt;)
&lt;span class="line-numbers"&gt; 88&lt;/span&gt;
&lt;span class="line-numbers"&gt; 89&lt;/span&gt;
&lt;span class="line-numbers"&gt; 90&lt;/span&gt;&lt;span class="keyword"&gt;def&lt;/span&gt; &lt;span class="function"&gt;normalize_tokenize&lt;/span&gt;(string):
&lt;span class="line-numbers"&gt; 91&lt;/span&gt;    &lt;span class="docstring"&gt;&lt;span class="delimiter"&gt;&amp;quot;&amp;quot;&amp;quot;&lt;/span&gt;&lt;span class="content"&gt;&lt;/span&gt;&lt;/span&gt;
&lt;span class="line-numbers"&gt; 92&lt;/span&gt;&lt;span class="docstring"&gt;&lt;span class="content"&gt;&lt;/span&gt;&lt;span class="content"&gt;    Takes a string, normalizes it (makes it lowercase and&lt;/span&gt;&lt;span class="content"&gt;&lt;/span&gt;&lt;/span&gt;
&lt;span class="line-numbers"&gt; 93&lt;/span&gt;&lt;span class="docstring"&gt;&lt;span class="content"&gt;&lt;/span&gt;&lt;span class="content"&gt;    removes punctuation), and then splits it into a list of&lt;/span&gt;&lt;span class="content"&gt;&lt;/span&gt;&lt;/span&gt;
&lt;span class="line-numbers"&gt; 94&lt;/span&gt;&lt;span class="docstring"&gt;&lt;span class="content"&gt;&lt;/span&gt;&lt;span class="content"&gt;    words.&lt;/span&gt;&lt;span class="content"&gt;&lt;/span&gt;&lt;/span&gt;
&lt;span class="line-numbers"&gt; 95&lt;/span&gt;&lt;span class="docstring"&gt;&lt;span class="content"&gt;&lt;/span&gt;&lt;span class="content"&gt;&lt;/span&gt;&lt;/span&gt;
&lt;span class="line-numbers"&gt; 96&lt;/span&gt;&lt;span class="docstring"&gt;&lt;span class="content"&gt;&lt;/span&gt;&lt;span class="content"&gt;    Note that everything in this function is plain Python&lt;/span&gt;&lt;span class="content"&gt;&lt;/span&gt;&lt;/span&gt;
&lt;span class="line-numbers"&gt; 97&lt;/span&gt;&lt;span class="docstring"&gt;&lt;span class="content"&gt;&lt;/span&gt;&lt;span class="content"&gt;    without using NLTK (although as noted below, NLTK provides&lt;/span&gt;&lt;span class="content"&gt;&lt;/span&gt;&lt;/span&gt;
&lt;span class="line-numbers"&gt; 98&lt;/span&gt;&lt;span class="docstring"&gt;&lt;span class="content"&gt;&lt;/span&gt;&lt;span class="content"&gt;    some more sophisticated tokenizers we could have used).&lt;/span&gt;&lt;span class="content"&gt;&lt;/span&gt;&lt;/span&gt;
&lt;span class="line-numbers"&gt; 99&lt;/span&gt;&lt;span class="docstring"&gt;&lt;span class="content"&gt;&lt;/span&gt;&lt;span class="content"&gt;    &lt;/span&gt;&lt;span class="delimiter"&gt;&amp;quot;&amp;quot;&amp;quot;&lt;/span&gt;&lt;/span&gt;
&lt;span class="line-numbers"&gt;100&lt;/span&gt;    &lt;span class="comment"&gt;# make lowercase&lt;/span&gt;
&lt;span class="line-numbers"&gt;101&lt;/span&gt;    norm = string.lower()
&lt;span class="line-numbers"&gt;102&lt;/span&gt;
&lt;span class="line-numbers"&gt;103&lt;/span&gt;    &lt;span class="comment"&gt;# remove punctuation&lt;/span&gt;
&lt;span class="line-numbers"&gt;104&lt;/span&gt;    norm = re.sub(&lt;span class="string"&gt;&lt;span class="modifier"&gt;r&lt;/span&gt;&lt;span class="delimiter"&gt;'&lt;/span&gt;&lt;span class="content"&gt;(?u)[^&lt;/span&gt;&lt;span class="content"&gt;\w&lt;/span&gt;&lt;span class="content"&gt;\s&lt;/span&gt;&lt;span class="content"&gt;]&lt;/span&gt;&lt;span class="delimiter"&gt;'&lt;/span&gt;&lt;/span&gt;, &lt;span class="string"&gt;&lt;span class="delimiter"&gt;'&lt;/span&gt;&lt;span class="delimiter"&gt;'&lt;/span&gt;&lt;/span&gt;, norm) &lt;span class="comment"&gt;# &amp;lt;1&amp;gt;&lt;/span&gt;
&lt;span class="line-numbers"&gt;105&lt;/span&gt;
&lt;span class="line-numbers"&gt;106&lt;/span&gt;    &lt;span class="comment"&gt;# split into words&lt;/span&gt;
&lt;span class="line-numbers"&gt;107&lt;/span&gt;    tokens = norm.split()
&lt;span class="line-numbers"&gt;108&lt;/span&gt;
&lt;span class="line-numbers"&gt;109&lt;/span&gt;    &lt;span class="keyword"&gt;return&lt;/span&gt; tokens
&lt;span class="line-numbers"&gt;110&lt;/span&gt;
&lt;span class="line-numbers"&gt;111&lt;/span&gt;&lt;span class="keyword"&gt;def&lt;/span&gt; &lt;span class="function"&gt;word_form_hapaxes&lt;/span&gt;(tokens):
&lt;span class="line-numbers"&gt;112&lt;/span&gt;    &lt;span class="docstring"&gt;&lt;span class="delimiter"&gt;&amp;quot;&amp;quot;&amp;quot;&lt;/span&gt;&lt;span class="content"&gt;&lt;/span&gt;&lt;/span&gt;
&lt;span class="line-numbers"&gt;113&lt;/span&gt;&lt;span class="docstring"&gt;&lt;span class="content"&gt;&lt;/span&gt;&lt;span class="content"&gt;    Takes a list of tokens and returns a list of the&lt;/span&gt;&lt;span class="content"&gt;&lt;/span&gt;&lt;/span&gt;
&lt;span class="line-numbers"&gt;114&lt;/span&gt;&lt;span class="docstring"&gt;&lt;span class="content"&gt;&lt;/span&gt;&lt;span class="content"&gt;    wordform hapaxes (those wordforms that only appear once)&lt;/span&gt;&lt;span class="content"&gt;&lt;/span&gt;&lt;/span&gt;
&lt;span class="line-numbers"&gt;115&lt;/span&gt;&lt;span class="docstring"&gt;&lt;span class="content"&gt;&lt;/span&gt;&lt;span class="content"&gt;&lt;/span&gt;&lt;/span&gt;
&lt;span class="line-numbers"&gt;116&lt;/span&gt;&lt;span class="docstring"&gt;&lt;span class="content"&gt;&lt;/span&gt;&lt;span class="content"&gt;    For wordforms this is simple enough to do in plain&lt;/span&gt;&lt;span class="content"&gt;&lt;/span&gt;&lt;/span&gt;
&lt;span class="line-numbers"&gt;117&lt;/span&gt;&lt;span class="docstring"&gt;&lt;span class="content"&gt;&lt;/span&gt;&lt;span class="content"&gt;    Python without an NLP package, especially using the Counter&lt;/span&gt;&lt;span class="content"&gt;&lt;/span&gt;&lt;/span&gt;
&lt;span class="line-numbers"&gt;118&lt;/span&gt;&lt;span class="docstring"&gt;&lt;span class="content"&gt;&lt;/span&gt;&lt;span class="content"&gt;    type from the collections module (part of the Python&lt;/span&gt;&lt;span class="content"&gt;&lt;/span&gt;&lt;/span&gt;
&lt;span class="line-numbers"&gt;119&lt;/span&gt;&lt;span class="docstring"&gt;&lt;span class="content"&gt;&lt;/span&gt;&lt;span class="content"&gt;    standard library).&lt;/span&gt;&lt;span class="content"&gt;&lt;/span&gt;&lt;/span&gt;
&lt;span class="line-numbers"&gt;120&lt;/span&gt;&lt;span class="docstring"&gt;&lt;span class="content"&gt;&lt;/span&gt;&lt;span class="content"&gt;    &lt;/span&gt;&lt;span class="delimiter"&gt;&amp;quot;&amp;quot;&amp;quot;&lt;/span&gt;&lt;/span&gt;
&lt;span class="line-numbers"&gt;121&lt;/span&gt;
&lt;span class="line-numbers"&gt;122&lt;/span&gt;    counts = Counter(tokens) &lt;span class="comment"&gt;# &amp;lt;1&amp;gt;&lt;/span&gt;
&lt;span class="line-numbers"&gt;123&lt;/span&gt;    hapaxes = [word &lt;span class="keyword"&gt;for&lt;/span&gt; word &lt;span class="keyword"&gt;in&lt;/span&gt; counts &lt;span class="keyword"&gt;if&lt;/span&gt; counts[word] == &lt;span class="integer"&gt;1&lt;/span&gt;] &lt;span class="comment"&gt;# &amp;lt;2&amp;gt;&lt;/span&gt;
&lt;span class="line-numbers"&gt;124&lt;/span&gt;
&lt;span class="line-numbers"&gt;125&lt;/span&gt;    &lt;span class="keyword"&gt;return&lt;/span&gt; hapaxes
&lt;span class="line-numbers"&gt;126&lt;/span&gt;
&lt;span class="line-numbers"&gt;127&lt;/span&gt;&lt;span class="keyword"&gt;def&lt;/span&gt; &lt;span class="function"&gt;nltk_stem_hapaxes&lt;/span&gt;(tokens):
&lt;span class="line-numbers"&gt;128&lt;/span&gt;    &lt;span class="docstring"&gt;&lt;span class="delimiter"&gt;&amp;quot;&amp;quot;&amp;quot;&lt;/span&gt;&lt;span class="content"&gt;&lt;/span&gt;&lt;/span&gt;
&lt;span class="line-numbers"&gt;129&lt;/span&gt;&lt;span class="docstring"&gt;&lt;span class="content"&gt;&lt;/span&gt;&lt;span class="content"&gt;    Takes a list of tokens and returns a list of the word&lt;/span&gt;&lt;span class="content"&gt;&lt;/span&gt;&lt;/span&gt;
&lt;span class="line-numbers"&gt;130&lt;/span&gt;&lt;span class="docstring"&gt;&lt;span class="content"&gt;&lt;/span&gt;&lt;span class="content"&gt;    stem hapaxes.&lt;/span&gt;&lt;span class="content"&gt;&lt;/span&gt;&lt;/span&gt;
&lt;span class="line-numbers"&gt;131&lt;/span&gt;&lt;span class="docstring"&gt;&lt;span class="content"&gt;&lt;/span&gt;&lt;span class="content"&gt;    &lt;/span&gt;&lt;span class="delimiter"&gt;&amp;quot;&amp;quot;&amp;quot;&lt;/span&gt;&lt;/span&gt;
&lt;span class="line-numbers"&gt;132&lt;/span&gt;    &lt;span class="keyword"&gt;if&lt;/span&gt; &lt;span class="keyword"&gt;not&lt;/span&gt; nltk: &lt;span class="comment"&gt;# &amp;lt;1&amp;gt;&lt;/span&gt;
&lt;span class="line-numbers"&gt;133&lt;/span&gt;        &lt;span class="comment"&gt;# Only run if NLTK is loaded&lt;/span&gt;
&lt;span class="line-numbers"&gt;134&lt;/span&gt;        &lt;span class="keyword"&gt;return&lt;/span&gt; &lt;span class="predefined-constant"&gt;None&lt;/span&gt;
&lt;span class="line-numbers"&gt;135&lt;/span&gt;
&lt;span class="line-numbers"&gt;136&lt;/span&gt;    &lt;span class="comment"&gt;# Apply NLTK's Snowball stemmer algorithm to tokens:&lt;/span&gt;
&lt;span class="line-numbers"&gt;137&lt;/span&gt;    stemmer = SnowballStemmer(&lt;span class="string"&gt;&lt;span class="delimiter"&gt;&amp;quot;&lt;/span&gt;&lt;span class="content"&gt;english&lt;/span&gt;&lt;span class="delimiter"&gt;&amp;quot;&lt;/span&gt;&lt;/span&gt;)
&lt;span class="line-numbers"&gt;138&lt;/span&gt;    stems = [stemmer.stem(token) &lt;span class="keyword"&gt;for&lt;/span&gt; token &lt;span class="keyword"&gt;in&lt;/span&gt; tokens]
&lt;span class="line-numbers"&gt;139&lt;/span&gt;
&lt;span class="line-numbers"&gt;140&lt;/span&gt;    &lt;span class="comment"&gt;# Filter down to hapaxes:&lt;/span&gt;
&lt;span class="line-numbers"&gt;141&lt;/span&gt;    counts = nltk.FreqDist(stems) &lt;span class="comment"&gt;# &amp;lt;2&amp;gt;&lt;/span&gt;
&lt;span class="line-numbers"&gt;142&lt;/span&gt;    hapaxes = counts.hapaxes() &lt;span class="comment"&gt;# &amp;lt;3&amp;gt;&lt;/span&gt;
&lt;span class="line-numbers"&gt;143&lt;/span&gt;    &lt;span class="keyword"&gt;return&lt;/span&gt; hapaxes
&lt;span class="line-numbers"&gt;144&lt;/span&gt;
&lt;span class="line-numbers"&gt;145&lt;/span&gt;&lt;span class="keyword"&gt;def&lt;/span&gt; &lt;span class="function"&gt;nltk_lemma_hapaxes&lt;/span&gt;(tokens):
&lt;span class="line-numbers"&gt;146&lt;/span&gt;    &lt;span class="docstring"&gt;&lt;span class="delimiter"&gt;&amp;quot;&amp;quot;&amp;quot;&lt;/span&gt;&lt;span class="content"&gt;&lt;/span&gt;&lt;/span&gt;
&lt;span class="line-numbers"&gt;147&lt;/span&gt;&lt;span class="docstring"&gt;&lt;span class="content"&gt;&lt;/span&gt;&lt;span class="content"&gt;    Takes a list of tokens and returns a list of the lemma&lt;/span&gt;&lt;span class="content"&gt;&lt;/span&gt;&lt;/span&gt;
&lt;span class="line-numbers"&gt;148&lt;/span&gt;&lt;span class="docstring"&gt;&lt;span class="content"&gt;&lt;/span&gt;&lt;span class="content"&gt;    hapaxes.&lt;/span&gt;&lt;span class="content"&gt;&lt;/span&gt;&lt;/span&gt;
&lt;span class="line-numbers"&gt;149&lt;/span&gt;&lt;span class="docstring"&gt;&lt;span class="content"&gt;&lt;/span&gt;&lt;span class="content"&gt;    &lt;/span&gt;&lt;span class="delimiter"&gt;&amp;quot;&amp;quot;&amp;quot;&lt;/span&gt;&lt;/span&gt;
&lt;span class="line-numbers"&gt;150&lt;/span&gt;    &lt;span class="keyword"&gt;if&lt;/span&gt; &lt;span class="keyword"&gt;not&lt;/span&gt; nltk:
&lt;span class="line-numbers"&gt;151&lt;/span&gt;        &lt;span class="comment"&gt;# Only run if NLTK is loaded&lt;/span&gt;
&lt;span class="line-numbers"&gt;152&lt;/span&gt;        &lt;span class="keyword"&gt;return&lt;/span&gt; &lt;span class="predefined-constant"&gt;None&lt;/span&gt;
&lt;span class="line-numbers"&gt;153&lt;/span&gt;
&lt;span class="line-numbers"&gt;154&lt;/span&gt;    &lt;span class="comment"&gt;# Tag tokens with part-of-speech:&lt;/span&gt;
&lt;span class="line-numbers"&gt;155&lt;/span&gt;    tagged = nltk.pos_tag(tokens) &lt;span class="comment"&gt;# &amp;lt;1&amp;gt;&lt;/span&gt;
&lt;span class="line-numbers"&gt;156&lt;/span&gt;
&lt;span class="line-numbers"&gt;157&lt;/span&gt;    &lt;span class="comment"&gt;# Convert our Treebank-style tags to WordNet-style tags.&lt;/span&gt;
&lt;span class="line-numbers"&gt;158&lt;/span&gt;    tagged = [(word, pt_to_wn(tag))
&lt;span class="line-numbers"&gt;159&lt;/span&gt;                     &lt;span class="keyword"&gt;for&lt;/span&gt; (word, tag) &lt;span class="keyword"&gt;in&lt;/span&gt; tagged] &lt;span class="comment"&gt;# &amp;lt;2&amp;gt;&lt;/span&gt;
&lt;span class="line-numbers"&gt;160&lt;/span&gt;
&lt;span class="line-numbers"&gt;161&lt;/span&gt;    &lt;span class="comment"&gt;# Lemmatize:&lt;/span&gt;
&lt;span class="line-numbers"&gt;162&lt;/span&gt;    lemmer = WordNetLemmatizer()
&lt;span class="line-numbers"&gt;163&lt;/span&gt;    lemmas = [lemmer.lemmatize(token, pos)
&lt;span class="line-numbers"&gt;164&lt;/span&gt;                     &lt;span class="keyword"&gt;for&lt;/span&gt; (token, pos) &lt;span class="keyword"&gt;in&lt;/span&gt; tagged] &lt;span class="comment"&gt;# &amp;lt;3&amp;gt;&lt;/span&gt;
&lt;span class="line-numbers"&gt;165&lt;/span&gt;
&lt;span class="line-numbers"&gt;166&lt;/span&gt;    &lt;span class="keyword"&gt;return&lt;/span&gt; nltk_stem_hapaxes(lemmas) &lt;span class="comment"&gt;# &amp;lt;4&amp;gt;&lt;/span&gt;
&lt;span class="line-numbers"&gt;167&lt;/span&gt;
&lt;span class="line-numbers"&gt;168&lt;/span&gt;&lt;span class="keyword"&gt;def&lt;/span&gt; &lt;span class="function"&gt;pt_to_wn&lt;/span&gt;(pos):
&lt;span class="line-numbers"&gt;169&lt;/span&gt;    &lt;span class="docstring"&gt;&lt;span class="delimiter"&gt;&amp;quot;&amp;quot;&amp;quot;&lt;/span&gt;&lt;span class="content"&gt;&lt;/span&gt;&lt;/span&gt;
&lt;span class="line-numbers"&gt;170&lt;/span&gt;&lt;span class="docstring"&gt;&lt;span class="content"&gt;&lt;/span&gt;&lt;span class="content"&gt;    Takes a Penn Treebank tag and converts it to an&lt;/span&gt;&lt;span class="content"&gt;&lt;/span&gt;&lt;/span&gt;
&lt;span class="line-numbers"&gt;171&lt;/span&gt;&lt;span class="docstring"&gt;&lt;span class="content"&gt;&lt;/span&gt;&lt;span class="content"&gt;    appropriate WordNet equivalent for lemmatization.&lt;/span&gt;&lt;span class="content"&gt;&lt;/span&gt;&lt;/span&gt;
&lt;span class="line-numbers"&gt;172&lt;/span&gt;&lt;span class="docstring"&gt;&lt;span class="content"&gt;&lt;/span&gt;&lt;span class="content"&gt;&lt;/span&gt;&lt;/span&gt;
&lt;span class="line-numbers"&gt;173&lt;/span&gt;&lt;span class="docstring"&gt;&lt;span class="content"&gt;&lt;/span&gt;&lt;span class="content"&gt;    A list of Penn Treebank tags is available at:&lt;/span&gt;&lt;span class="content"&gt;&lt;/span&gt;&lt;/span&gt;
&lt;span class="line-numbers"&gt;174&lt;/span&gt;&lt;span class="docstring"&gt;&lt;span class="content"&gt;&lt;/span&gt;&lt;span class="content"&gt;    https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html&lt;/span&gt;&lt;span class="content"&gt;&lt;/span&gt;&lt;/span&gt;
&lt;span class="line-numbers"&gt;175&lt;/span&gt;&lt;span class="docstring"&gt;&lt;span class="content"&gt;&lt;/span&gt;&lt;span class="content"&gt;    &lt;/span&gt;&lt;span class="delimiter"&gt;&amp;quot;&amp;quot;&amp;quot;&lt;/span&gt;&lt;/span&gt;
&lt;span class="line-numbers"&gt;176&lt;/span&gt;
&lt;span class="line-numbers"&gt;177&lt;/span&gt;    &lt;span class="keyword"&gt;from&lt;/span&gt; &lt;span class="include"&gt;nltk.corpus.reader.wordnet&lt;/span&gt; &lt;span class="keyword"&gt;import&lt;/span&gt; &lt;span class="include"&gt;NOUN&lt;/span&gt;, &lt;span class="include"&gt;VERB&lt;/span&gt;, &lt;span class="include"&gt;ADJ&lt;/span&gt;, &lt;span class="include"&gt;ADV&lt;/span&gt;
&lt;span class="line-numbers"&gt;178&lt;/span&gt;
&lt;span class="line-numbers"&gt;179&lt;/span&gt;    pos = pos.lower()
&lt;span class="line-numbers"&gt;180&lt;/span&gt;
&lt;span class="line-numbers"&gt;181&lt;/span&gt;    &lt;span class="keyword"&gt;if&lt;/span&gt; pos.startswith(&lt;span class="string"&gt;&lt;span class="delimiter"&gt;'&lt;/span&gt;&lt;span class="content"&gt;jj&lt;/span&gt;&lt;span class="delimiter"&gt;'&lt;/span&gt;&lt;/span&gt;):
&lt;span class="line-numbers"&gt;182&lt;/span&gt;        tag = ADJ
&lt;span class="line-numbers"&gt;183&lt;/span&gt;    &lt;span class="keyword"&gt;elif&lt;/span&gt; pos == &lt;span class="string"&gt;&lt;span class="delimiter"&gt;'&lt;/span&gt;&lt;span class="content"&gt;md&lt;/span&gt;&lt;span class="delimiter"&gt;'&lt;/span&gt;&lt;/span&gt;:
&lt;span class="line-numbers"&gt;184&lt;/span&gt;        &lt;span class="comment"&gt;# Modal auxiliary verbs&lt;/span&gt;
&lt;span class="line-numbers"&gt;185&lt;/span&gt;        tag = VERB
&lt;span class="line-numbers"&gt;186&lt;/span&gt;    &lt;span class="keyword"&gt;elif&lt;/span&gt; pos.startswith(&lt;span class="string"&gt;&lt;span class="delimiter"&gt;'&lt;/span&gt;&lt;span class="content"&gt;rb&lt;/span&gt;&lt;span class="delimiter"&gt;'&lt;/span&gt;&lt;/span&gt;):
&lt;span class="line-numbers"&gt;187&lt;/span&gt;        tag = ADV
&lt;span class="line-numbers"&gt;188&lt;/span&gt;    &lt;span class="keyword"&gt;elif&lt;/span&gt; pos.startswith(&lt;span class="string"&gt;&lt;span class="delimiter"&gt;'&lt;/span&gt;&lt;span class="content"&gt;vb&lt;/span&gt;&lt;span class="delimiter"&gt;'&lt;/span&gt;&lt;/span&gt;):
&lt;span class="line-numbers"&gt;189&lt;/span&gt;        tag = VERB
&lt;span class="line-numbers"&gt;190&lt;/span&gt;    &lt;span class="keyword"&gt;elif&lt;/span&gt; pos == &lt;span class="string"&gt;&lt;span class="delimiter"&gt;'&lt;/span&gt;&lt;span class="content"&gt;wrb&lt;/span&gt;&lt;span class="delimiter"&gt;'&lt;/span&gt;&lt;/span&gt;:
&lt;span class="line-numbers"&gt;191&lt;/span&gt;        &lt;span class="comment"&gt;# Wh-adverb (how, however, whence, whenever...)&lt;/span&gt;
&lt;span class="line-numbers"&gt;192&lt;/span&gt;        tag = ADV
&lt;span class="line-numbers"&gt;193&lt;/span&gt;    &lt;span class="keyword"&gt;else&lt;/span&gt;:
&lt;span class="line-numbers"&gt;194&lt;/span&gt;        &lt;span class="comment"&gt;# default to NOUN&lt;/span&gt;
&lt;span class="line-numbers"&gt;195&lt;/span&gt;        &lt;span class="comment"&gt;# This is not strictly correct, but it is good&lt;/span&gt;
&lt;span class="line-numbers"&gt;196&lt;/span&gt;        &lt;span class="comment"&gt;# enough for lemmatization.&lt;/span&gt;
&lt;span class="line-numbers"&gt;197&lt;/span&gt;        tag = NOUN
&lt;span class="line-numbers"&gt;198&lt;/span&gt;
&lt;span class="line-numbers"&gt;199&lt;/span&gt;    &lt;span class="keyword"&gt;return&lt;/span&gt; tag
&lt;span class="line-numbers"&gt;200&lt;/span&gt;
&lt;span class="line-numbers"&gt;201&lt;/span&gt;&lt;span class="keyword"&gt;if&lt;/span&gt; __name__ == &lt;span class="string"&gt;&lt;span class="delimiter"&gt;&amp;quot;&lt;/span&gt;&lt;span class="content"&gt;__main__&lt;/span&gt;&lt;span class="delimiter"&gt;&amp;quot;&lt;/span&gt;&lt;/span&gt;:
&lt;span class="line-numbers"&gt;202&lt;/span&gt;    &lt;span class="docstring"&gt;&lt;span class="delimiter"&gt;&amp;quot;&amp;quot;&amp;quot;&lt;/span&gt;&lt;span class="content"&gt;&lt;/span&gt;&lt;/span&gt;
&lt;span class="line-numbers"&gt;203&lt;/span&gt;&lt;span class="docstring"&gt;&lt;span class="content"&gt;&lt;/span&gt;&lt;span class="content"&gt;    The code in this block is run when this file is executed as a script (but&lt;/span&gt;&lt;span class="content"&gt;&lt;/span&gt;&lt;/span&gt;
&lt;span class="line-numbers"&gt;204&lt;/span&gt;&lt;span class="docstring"&gt;&lt;span class="content"&gt;&lt;/span&gt;&lt;span class="content"&gt;    not if it is imported as a module by another Python script).&lt;/span&gt;&lt;span class="content"&gt;&lt;/span&gt;&lt;/span&gt;
&lt;span class="line-numbers"&gt;205&lt;/span&gt;&lt;span class="docstring"&gt;&lt;span class="content"&gt;&lt;/span&gt;&lt;span class="content"&gt;    &lt;/span&gt;&lt;span class="delimiter"&gt;&amp;quot;&amp;quot;&amp;quot;&lt;/span&gt;&lt;/span&gt;
&lt;span class="line-numbers"&gt;206&lt;/span&gt;
&lt;span class="line-numbers"&gt;207&lt;/span&gt;    &lt;span class="comment"&gt;# If no file is provided, then use this sample text:&lt;/span&gt;
&lt;span class="line-numbers"&gt;208&lt;/span&gt;    text = &lt;span class="string"&gt;&lt;span class="delimiter"&gt;&amp;quot;&amp;quot;&amp;quot;&lt;/span&gt;&lt;span class="content"&gt;Cory Linguist, a cautious corpus linguist, in creating a&lt;/span&gt;&lt;span class="content"&gt;&lt;/span&gt;&lt;/span&gt;
&lt;span class="line-numbers"&gt;209&lt;/span&gt;&lt;span class="string"&gt;&lt;span class="content"&gt;&lt;/span&gt;&lt;span class="content"&gt;    corpus of courtship correspondence, corrupted a crucial link. Now, if Cory&lt;/span&gt;&lt;span class="content"&gt;&lt;/span&gt;&lt;/span&gt;
&lt;span class="line-numbers"&gt;210&lt;/span&gt;&lt;span class="string"&gt;&lt;span class="content"&gt;&lt;/span&gt;&lt;span class="content"&gt;    Linguist, a careful corpus linguist, in creating a corpus of courtship&lt;/span&gt;&lt;span class="content"&gt;&lt;/span&gt;&lt;/span&gt;
&lt;span class="line-numbers"&gt;211&lt;/span&gt;&lt;span class="string"&gt;&lt;span class="content"&gt;&lt;/span&gt;&lt;span class="content"&gt;    correspondence, corrupted a crucial link, see that YOU, in creating a&lt;/span&gt;&lt;span class="content"&gt;&lt;/span&gt;&lt;/span&gt;
&lt;span class="line-numbers"&gt;212&lt;/span&gt;&lt;span class="string"&gt;&lt;span class="content"&gt;&lt;/span&gt;&lt;span class="content"&gt;    corpus of courtship correspondence, corrupt not a crucial link.&lt;/span&gt;&lt;span class="delimiter"&gt;&amp;quot;&amp;quot;&amp;quot;&lt;/span&gt;&lt;/span&gt;
&lt;span class="line-numbers"&gt;213&lt;/span&gt;
&lt;span class="line-numbers"&gt;214&lt;/span&gt;    &lt;span class="keyword"&gt;if&lt;/span&gt; &lt;span class="predefined"&gt;len&lt;/span&gt;(sys.argv) &amp;gt; &lt;span class="integer"&gt;1&lt;/span&gt;:
&lt;span class="line-numbers"&gt;215&lt;/span&gt;        &lt;span class="comment"&gt;# We got at least one command-line argument. We'll ignore all but the&lt;/span&gt;
&lt;span class="line-numbers"&gt;216&lt;/span&gt;        &lt;span class="comment"&gt;# first.&lt;/span&gt;
&lt;span class="line-numbers"&gt;217&lt;/span&gt;        &lt;span class="keyword"&gt;with&lt;/span&gt; &lt;span class="predefined"&gt;open&lt;/span&gt;(sys.argv[&lt;span class="integer"&gt;1&lt;/span&gt;], &lt;span class="string"&gt;&lt;span class="delimiter"&gt;'&lt;/span&gt;&lt;span class="content"&gt;r&lt;/span&gt;&lt;span class="delimiter"&gt;'&lt;/span&gt;&lt;/span&gt;) &lt;span class="keyword"&gt;as&lt;/span&gt; &lt;span class="predefined"&gt;file&lt;/span&gt;:
&lt;span class="line-numbers"&gt;218&lt;/span&gt;            text = &lt;span class="predefined"&gt;file&lt;/span&gt;.read()
&lt;span class="line-numbers"&gt;219&lt;/span&gt;            &lt;span class="keyword"&gt;try&lt;/span&gt;:
&lt;span class="line-numbers"&gt;220&lt;/span&gt;                &lt;span class="comment"&gt;# in Python 2 we need a unicode string&lt;/span&gt;
&lt;span class="line-numbers"&gt;221&lt;/span&gt;                text = &lt;span class="predefined"&gt;unicode&lt;/span&gt;(text)
&lt;span class="line-numbers"&gt;222&lt;/span&gt;            &lt;span class="keyword"&gt;except&lt;/span&gt;:
&lt;span class="line-numbers"&gt;223&lt;/span&gt;                &lt;span class="comment"&gt;# in Python 3 'unicode()' is not defined&lt;/span&gt;
&lt;span class="line-numbers"&gt;224&lt;/span&gt;                &lt;span class="comment"&gt;# we don't have to do anything&lt;/span&gt;
&lt;span class="line-numbers"&gt;225&lt;/span&gt;                &lt;span class="keyword"&gt;pass&lt;/span&gt;
&lt;span class="line-numbers"&gt;226&lt;/span&gt;
&lt;span class="line-numbers"&gt;227&lt;/span&gt;    &lt;span class="comment"&gt;# tokenize the text (break into words)&lt;/span&gt;
&lt;span class="line-numbers"&gt;228&lt;/span&gt;    tokens = normalize_tokenize(text)
&lt;span class="line-numbers"&gt;229&lt;/span&gt;
&lt;span class="line-numbers"&gt;230&lt;/span&gt;    &lt;span class="comment"&gt;# Get hapaxes based on wordforms, stems, and lemmas:&lt;/span&gt;
&lt;span class="line-numbers"&gt;231&lt;/span&gt;    wfs = word_form_hapaxes(tokens)
&lt;span class="line-numbers"&gt;232&lt;/span&gt;    stems = nltk_stem_hapaxes(tokens)
&lt;span class="line-numbers"&gt;233&lt;/span&gt;    lemmas = nltk_lemma_hapaxes(tokens)
&lt;span class="line-numbers"&gt;234&lt;/span&gt;
&lt;span class="line-numbers"&gt;235&lt;/span&gt;    &lt;span class="comment"&gt;# Print count table and list of hapaxes:&lt;/span&gt;
&lt;span class="line-numbers"&gt;236&lt;/span&gt;    row_labels = [&lt;span class="string"&gt;&lt;span class="delimiter"&gt;&amp;quot;&lt;/span&gt;&lt;span class="content"&gt;Wordforms&lt;/span&gt;&lt;span class="delimiter"&gt;&amp;quot;&lt;/span&gt;&lt;/span&gt;]
&lt;span class="line-numbers"&gt;237&lt;/span&gt;    row_data = [wfs]
&lt;span class="line-numbers"&gt;238&lt;/span&gt;
&lt;span class="line-numbers"&gt;239&lt;/span&gt;    &lt;span class="comment"&gt;# only add NLTK data if it is installed&lt;/span&gt;
&lt;span class="line-numbers"&gt;240&lt;/span&gt;    &lt;span class="keyword"&gt;if&lt;/span&gt; nltk:
&lt;span class="line-numbers"&gt;241&lt;/span&gt;        row_labels.extend([&lt;span class="string"&gt;&lt;span class="delimiter"&gt;&amp;quot;&lt;/span&gt;&lt;span class="content"&gt;NLTK-stems&lt;/span&gt;&lt;span class="delimiter"&gt;&amp;quot;&lt;/span&gt;&lt;/span&gt;, &lt;span class="string"&gt;&lt;span class="delimiter"&gt;&amp;quot;&lt;/span&gt;&lt;span class="content"&gt;NLTK-lemmas&lt;/span&gt;&lt;span class="delimiter"&gt;&amp;quot;&lt;/span&gt;&lt;/span&gt;])
&lt;span class="line-numbers"&gt;242&lt;/span&gt;        row_data.extend([stems, lemmas])
&lt;span class="line-numbers"&gt;243&lt;/span&gt;
&lt;span class="line-numbers"&gt;244&lt;/span&gt;    &lt;span class="comment"&gt;# sort happaxes for display&lt;/span&gt;
&lt;span class="line-numbers"&gt;245&lt;/span&gt;    row_date = [row.sort() &lt;span class="keyword"&gt;for&lt;/span&gt; row &lt;span class="keyword"&gt;in&lt;/span&gt; row_data]
&lt;span class="line-numbers"&gt;246&lt;/span&gt;
&lt;span class="line-numbers"&gt;247&lt;/span&gt;    &lt;span class="comment"&gt;# format and print output&lt;/span&gt;
&lt;span class="line-numbers"&gt;248&lt;/span&gt;    rows = &lt;span class="predefined"&gt;zip&lt;/span&gt;(row_labels, row_data)
&lt;span class="line-numbers"&gt;249&lt;/span&gt;    row_fmt = &lt;span class="string"&gt;&lt;span class="delimiter"&gt;&amp;quot;&lt;/span&gt;&lt;span class="content"&gt;{:&amp;gt;14}{:^8}&lt;/span&gt;&lt;span class="delimiter"&gt;&amp;quot;&lt;/span&gt;&lt;/span&gt;
&lt;span class="line-numbers"&gt;250&lt;/span&gt;    print(&lt;span class="string"&gt;&lt;span class="delimiter"&gt;&amp;quot;&lt;/span&gt;&lt;span class="char"&gt;\n&lt;/span&gt;&lt;span class="delimiter"&gt;&amp;quot;&lt;/span&gt;&lt;/span&gt;)
&lt;span class="line-numbers"&gt;251&lt;/span&gt;    print(row_fmt.format(&lt;span class="string"&gt;&lt;span class="delimiter"&gt;&amp;quot;&lt;/span&gt;&lt;span class="delimiter"&gt;&amp;quot;&lt;/span&gt;&lt;/span&gt;, &lt;span class="string"&gt;&lt;span class="delimiter"&gt;&amp;quot;&lt;/span&gt;&lt;span class="content"&gt;Count&lt;/span&gt;&lt;span class="delimiter"&gt;&amp;quot;&lt;/span&gt;&lt;/span&gt;))
&lt;span class="line-numbers"&gt;252&lt;/span&gt;    hapax_list = []
&lt;span class="line-numbers"&gt;253&lt;/span&gt;    &lt;span class="keyword"&gt;for&lt;/span&gt; row &lt;span class="keyword"&gt;in&lt;/span&gt; rows:
&lt;span class="line-numbers"&gt;254&lt;/span&gt;        print(row_fmt.format(row[&lt;span class="integer"&gt;0&lt;/span&gt;], &lt;span class="predefined"&gt;len&lt;/span&gt;(row[&lt;span class="integer"&gt;1&lt;/span&gt;])))
&lt;span class="line-numbers"&gt;255&lt;/span&gt;        hapax_list += [&lt;span class="string"&gt;&lt;span class="delimiter"&gt;&amp;quot;&lt;/span&gt;&lt;span class="content"&gt;{:&amp;lt;14}{:&amp;lt;68}&lt;/span&gt;&lt;span class="delimiter"&gt;&amp;quot;&lt;/span&gt;&lt;/span&gt;.format(row[&lt;span class="integer"&gt;0&lt;/span&gt;] + &lt;span class="string"&gt;&lt;span class="delimiter"&gt;&amp;quot;&lt;/span&gt;&lt;span class="content"&gt;:&lt;/span&gt;&lt;span class="delimiter"&gt;&amp;quot;&lt;/span&gt;&lt;/span&gt;, &lt;span class="string"&gt;&lt;span class="delimiter"&gt;&amp;quot;&lt;/span&gt;&lt;span class="content"&gt;, &lt;/span&gt;&lt;span class="delimiter"&gt;&amp;quot;&lt;/span&gt;&lt;/span&gt;.join(row[&lt;span class="integer"&gt;1&lt;/span&gt;]))]
&lt;span class="line-numbers"&gt;256&lt;/span&gt;
&lt;span class="line-numbers"&gt;257&lt;/span&gt;    print(&lt;span class="string"&gt;&lt;span class="delimiter"&gt;&amp;quot;&lt;/span&gt;&lt;span class="char"&gt;\n&lt;/span&gt;&lt;span class="content"&gt;-- Hapaxes --&lt;/span&gt;&lt;span class="delimiter"&gt;&amp;quot;&lt;/span&gt;&lt;/span&gt;)
&lt;span class="line-numbers"&gt;258&lt;/span&gt;    &lt;span class="keyword"&gt;for&lt;/span&gt; row &lt;span class="keyword"&gt;in&lt;/span&gt; hapax_list:
&lt;span class="line-numbers"&gt;259&lt;/span&gt;        print(row)
&lt;span class="line-numbers"&gt;260&lt;/span&gt;    print(&lt;span class="string"&gt;&lt;span class="delimiter"&gt;&amp;quot;&lt;/span&gt;&lt;span class="char"&gt;\n&lt;/span&gt;&lt;span class="delimiter"&gt;&amp;quot;&lt;/span&gt;&lt;/span&gt;)
&lt;span class="line-numbers"&gt;261&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;</content>
    <summary type="html">A tutorial on simple NLP tasks with Python which serves as an introduction to the NLTK library.</summary>
  </entry>
</feed>

