Atom Feed for 'programming' Articles

Quit Your Job Like Andrew Kelley

2021-08-23T14:11:59Z

The CoRecursive podcast has a great recent interview with Andrew Kelley, the creator of the Zig programming language, where he talks about quitting his job and working on open source full time (Episode #67):

Adam: Once you did quit, was it everything that you thought it would be?

Andrew: First of all, I’ve never been happier. Second of all, I realized that the freedom that I have has allowed me to open my mind up to just other, even just different politics and ways of thinking about society and how the world works. It’s harder to think about maybe more radical ways that society could run when you have to play the game, and you’re spending 40 or plus hours per week clocked in and just like doing the labor.

Not only was it everything I thought I would be, but once I tasted this freedom, I know I will never have a boss again. I will go start a farm if I have to. My sense of self-worth has just skyrocketed and I just, I don’t even want to be subject to another person’s domain anymore. I want everyone to feel this way. I want everyone to feel they get to decide what they do with their life and no one’s going to tell them what they have to do.

Adam: Did you have a really bad boss?

Andrew: Actually, no. […] I think that’s why I realized that I never want to have a boss again is that I had a good one and I still really hated it.

I Hate Magento

2021-08-22T13:30:18Z

Sometimes I think I hate Magento for being so overly complex, slow, resource-intensive, and feature-poor yet bloat-rich that only Adobe could be interested in it. Then I think of the poor people paying me to maintain and write extensions for it.

— Comment seen on hacker news

I’ve inherited the maintenance duties for a retail shop’s website which is powered by some open-source PHP software called Magento. (Magento was purchased by Adobe a few years ago and they offer value-added commercial and hosted versions called Adobe Commerce). According to the marketing copy, “Magento is a feature-rich eCommerce platform solution that offers merchants complete flexibility and control over the functionality of their online channel. Magento’s search engine optimization, catalog management, and powerful marketing tools give merchants the ability to create sites that provide an unrivaled shopping experience for their customers.” And according to Wikipedia, “More than 100,000 online stores have been created on this platform. The platform code has been downloaded more than 2.5 million times, and $155 billion worth of goods have been sold through Magento-based systems in 2019. Two years ago, Magento accounted for about 30% of the total market share.”

I’ve now had time to work with this Magento site from several perspectives over the span of a few years (including a less-than-smooth migration from Magento 1.9 to Magento 2.3) — as a consultant managing product data via the Admin interface, as a sysadmin deploying and hosting a Magento-based site, as a developer making minor modifications and fixes as needed by my client, as a user of the customer-facing site — and it has continually impressed me: I’ve never before used software that provides such a poor experience to administrators, developers, and users alike.

Where I’d expect an eCommerce framework to provide a simple frontend interface that implements basic shopping cart functionality as a starting point to build on, Magento provides a default theme with a product description page which requests over 4MB of HTML/CSS/JS including 200+ JavaScript files (really). Modifying the theme, like all Magento customization (see below), is cumbersome to an unreasonable degree. As far as I can tell, the only sane way to have a performant and maintainable Magento website would be to write a complete frontend using a modern framework and communicate with Magento solely through its REST API. In fact the general rule of successful Magento deployment seems to be to use as little of Magento as possible.

Where I’d expect a catalog management interface with an emphasis on importing and exporting product data, Magento provides a painfully slow product browser and editor for torturing copywriters and an anemic and convoluted product export tool unsuitable for any real reporting or feed generation. My solution lately has been to write programs which get data from the REST API to update reports in Google Sheets rather than trying to use or extend the Admin panel. (Magento’s product import tool is fine.)

Where I’d expect a “feature-rich eCommerce platform solution that offers merchants complete flexibility and control over the functionality of their online channel” to have an ergonomic and well-documented extension interfaces, Magento provides an over-engineered and convoluted plugin system wired together with XML files (with little reference documentation) and PHP code generated from classes you provide which interact with the core system’s classes (which have no reference documentation).

Where I’d expect to find at the morally murky nexus of a commercial online retail platform that barely works out of the box, preoccupation with marketing and “SEO”, and the extraction of labour from programmers in developing countries (including under the guise of “open source”) an ecosystem of commercial plugins that are both expensive and risky to install, Magento delivers. So if the base install of Magento seems too stable, secure, and inexpensive to you, you could always head over to the Magento Marketplace and find any extension you need written by software developers questioning their career choices.

I don’t mean to denigrate the hard work of the open-source contributors who have helped create Magento. In fact while I don’t understand what motivates them, I admire, in some ways, the sheer tenacity and self-denial it must take to continue to spend time on such a project. From what I can tell by browsing github and the Magento StackExchange (which — and I know this is going to sound hyperbolic — is probably the lowest quality stackexchange site I’ve seen), many developers are Indian or otherwise work outside of North America, so I’m guessing the fragility of Magento has created some demand for affordable PHP developers. And I know it is some sort of broken windows fallacy, but if Magento’s convoluted architecture can provide some paying gigs or job security to those developers I guess that’s something positive, at least.

I think most of Magento’s issues — including its poor performance, poor security record, and high cost to customize and maintain — stem from two core defects: a lack of documentation and a fundamentally flawed software architecture.

If Magento had good documentation, then almost no matter how terrible its design, developers would be able to figure out to make it do what they need. Now, the organization of its documentation has improved since Adobe took over (see https://devdocs.magento.com/). But from the perspective of a PHP developer, what is documented is severely lacking. Lots of mostly useless high level descriptions, a few code examples, but no real documentation of the Magento source code and the classes/interfaces it provides.

Unlike the documentation, which Adobe could improve if they cared (but if they did, then I think they would have by now), the architecture of Magento cannot even be fixed. The entire framework is based around convoluted (which makes its lack of documentation hit harder) XML configuration files and a dynamic dependency injection system. The way it works is that plugins declare dependencies on PHP interfaces and then the Magento code generator (the bin/magento setup:di:compile command) generates the actual plugin code with the dependencies instantiated. It’s the sort of overkill system that makes large enterprise Java applications, developed by teams using a statically typed language with excellent tooling, difficult to reason about and maintain; to adopt that architecture for a PHP shopping cart application is utter madness.

I have not taken the time to investigate its performance bottlenecks, and I hope I never do, but the generated PHP code which supports such a dynamic plugin system no doubt contributes to how slow Magento is. And Magento is slow. Behind an asynchronous reverse proxy on an over-sized ec2 instance, I’m pretty sure my client’s website could be DoS’d by any mischievous kid with a cable modem if it weren’t for Cloudflare. Even with Cloudflare’s firewall a single malicious bot can bring the site to a crawl.

The ‘solution’ Magento offers to its performance problems is more and more layers of cache. First, the administrator must generate static files (CSS, etc.) with the bin/magento setup:static-content:deploy command. Then there is a cache system for various bits of data Magento calculates and which often, inexplicably, needs to be manually refreshed. But it is still slow so it is usually recommend that you also run Magento behind a caching reverse-proxy like Varnish and/or a dedicated key-value cache based on redis. It would be funny if it weren’t real software that I have to maintain.

Worse, the unnecessary complexity makes it more difficult both to understand code execution paths and to make changes to the code base: a recipe for security vulnerabilities.

There are some good things about Magento, of course. Its one saving grace, in my opinion, is its Swagger-based REST API which makes it possible to implement most required functionality outside of Magento itself. And Swagger/OpenAPI is self-documenting, so the Magento devs are not able to make it as difficult as they’ve made the PHP API. But even that has been a source of suffering. In fact my main use-case for it, reporting on current inventory quantity for all products, is not even possible by default because the /V1/products endpoint does not return stock information despite what the documentation claims. That bug has been reported several times (eg #24418), but the response from the maintainers is that the correct way to get stock information is to make a call to /V1/products and then make an additional http request for each returned product (thousands or tens of thousands in my case). (Luckily there are some workarounds. I wrote cristoper/mage_qtyext, a very simple plugin which adds a “qty” field to the product results; I also found menacoders/Stock-Info-API-searchCriteria which adds all stock information to the results.) There are also SOAP and GraphQL APIs which I’ve not investigated (except to find out that the GraphQL API by default also does not offer a way to get stock information for the entire inventory).

Personally, the prospect of developing solutions for my Magento-bound client confronts me as a depressing time sink. I’m convinced Magento will never be anything but expensive to maintain, slow, and insecure. I don’t recommend it for new projects.

I found this 2015 article from the Magento 1 days which complains about mostly the same things: Magento: why complex doesn’t mean good

See also for Spotify users: Magento 2 Rage Tracks

Deranged Sinterklaas: The Math and Algorithms of Secret Santa

2020-12-07T07:00:00Z

Secret Santa

Secret Santa is a traditional Christmas gift exchanging scheme in which each member of a group is randomly and anonymously assigned another member to give a Christmas gift to (usually by drawing names from a container). It is not valid for a person to be assigned to themself (if someone were to draw their own name, for example, all the names should be returned to the jar and the drawing process restarted).

Given a group of a certain size, how many different ways are there to make valid assignments? What is the probability that at least one person will draw their own name? What is the probability that two people will draw each other’s names? What is a good way to have a computer make the assignments while guaranteeing they are generated with equal probability among all possible assignments?

It turns out that these questions about secret santa present good motivation for exploring some of the fundamental concepts in combinatorics (the math of counting). In the sections below we will take a look at a bit of that math and algorithms that allow us to answer the questions we posed above. The final section presents a simple command-line program that allows generating and anonymously sending secret santa assignments via email so that we no longer need to go through the tedious ordeal of drawing names from a hat.

Table of Contents

Secret Santa
Math
- Permutations
- Derangements
Algorithms
Sinterbot2020
- Installation
- Usage

Math

Permutations

As an example let’s take a group of five friends who we will represent by the first initial of their names as a set: $\{\mathrm{\,S, C, A, L, M\,}\}$. The elements of this set of people, like the elements of any set, can be arranged in different orders. For example, the order of the elements as we just happened to write them $(\mathrm{S\; C\; A\; L\; M})$ is one arrangement, and we can also shuffle them around to get a different arrangement like $(\mathrm{C\; A\; M\; L\; S})$.

Each ordered arrangement is called a permutation of the set. How many permutations can be made from a set with $n$ elements? It is straight-forward to count. We can choose any element of the set to be in the first position of the permutation, so there are $n$ choices, which leaves $n-1$ choices for the second position, $n-2$ choices for the third position, and so on until for the $n\text{th}$ (and therefore last) position of the permutation there is only one element of the set remaining to choose from.

Multiplying the number of choices for each position of the permutation gives the total number of possible permutations: $n(n-1)(n-2)\dots(1).$ In other words, the product of all positive integers less than or equal to $n$. That product is known as the factorial of $n$ and is written $n!$:

\[\begin{align*} n! &= n(n-1)(n-2)\cdots (n-(n-1))\\ &= 1 \cdot 2 \cdots n\\ \end{align*}\]

So there are $5! = 5 \cdot 4 \cdot 3 \cdot 2 \cdot 1 = 120$ ways to permute a set of five friends. Writing any two of those permutations in two-line notation with one above the other allows us to read off a secret santa assignment for the group:

\[\begin{equation} \label{eq:perm} \begin{pmatrix} \mathrm{S} & \mathrm{C} & \mathrm{A} & \mathrm{L} & \mathrm{M} \\ \mathrm{C} & \mathrm{A} & \mathrm{M} & \mathrm{L} & \mathrm{S} \\ \end{pmatrix} \end{equation}\]

If we read the top line of as the list of gift givers and the bottom line as the gift recipients, then each santa is assigned to give a gift to the person in the bottom line directly beneath them. So $\mathrm{S}$ gives a gift to $\mathrm{C}$ who gives a gift to $\mathrm{A}$ and so forth.

Figure 1. Graphical representation of the cycles of the permutation $\eqref{eq:perm}$. Arrows point from santas to gift recipients.

A permutation of santas (or anything else) can be represented as a directed graph, as in Figure 1, or more compactly by listing its cycles: $(\mathrm{S\; C\; A\; M})(\mathrm{L})$. To see that the cycle notation is equivalent to the graph, read each cycle from left to right and insert the implied arrow from the last element back to the first: $(\mathrm{S \to C \to A \to M \to S})(\mathrm{L \to L})$. Note that any 1-cycles can be implied and are usually left out when writing a permutation in cycle notation, so an equivalent way to write our example permutation is simply as $(\mathrm{S\; C\; A\; M})$.

But note also that a permutation containing any 1-cycles defines an invalid secret santa assignment! The example permutation above has $\mathrm{L}$ giving a gift to herself, which is against the rules.

Derangements

A permutation with no 1-cycles — in other words, a permutation in which no element is left in its original position so that the entire set has been de-arranged — is called a derangement. One way to derange our example group of secret santas is

\[\begin{equation} \label{eq:der} \begin{pmatrix} \mathrm{S} & \mathrm{C} & \mathrm{A} & \mathrm{L} & \mathrm{M} \\ \mathrm{C} & \mathrm{A} & \mathrm{S} & \mathrm{M} & \mathrm{L} \\ \end{pmatrix} \end{equation}\]

Or equivalently, decomposed into its cycles, $(\mathrm{S\; C\; A})(\mathrm{M\; L})$.

So the problem of generating a valid secret santa assignment is equivalent to generating a derangement. Some algorithms for uniformly generating random derangements are presented in the next section. But first we need a way to calculate $D_n$, the number of derangements that can be made from a set with $n$ elements.

Counting derangements is a trickier than counting unrestricted permutations. We proceed by counting the permutations with at least one 1-cycle, the non-derangements. First we’ll define the subsets $H_p$ to contain all of the permutations on $n$ with the $p^{\text{th}}$ element fixed in its original position (an element that stays in its original position in a permutation is called a “fixed point”). This gives $n$ such subsets $\mathrm{(}H_1 \ldots H_n\mathrm{)}$ each containing $(n-1)!$ permutations (because with one element fixed, there are $(n-1)!$ ways to permute the remaining elements).

We know that the subsets $H_p$ contain only non-derangements since every member has a fixed point. And since together the $H_p$ contain every non-derangement with the $p^{\text{th}}$ element fixed, we know that they contain all possible non-derangements. That means to find $D_n$ we just need to subtract the size of the union of all the $H_p$ subsets from the total number of permutations (which we know is $n!$):

\[\begin{equation} \label{eq:up} D_n = n! - \left\lvert \bigcup H_p \right\rvert \\ \end{equation}\]

But finding the size of $\cup H_p$ is not straightforward. If we simply multiply the number of subsets by their size $n\cdot(n-1)!$ we over count because the subsets $H_p$ are not disjoint: some non-derangements belong to more than one subset. Specifically, every pair of subsets of $H_p$ share $(n-2)!$ permutations with at least two fixed points; every 3-tuple of subsets share $(n-3)!$ permutations with at least three fixed points; and so on.

To visualize this, it helps to draw out each $H_p$ for a small set. The table below shows each $H_p$ subset for the set $\{\,a, b, c, d\,\}$:

$H_1$

$H_2$

$H_3$

$H_4$

abcd

abdc

adcb

acbd

cbad

bacd

acdb

cbda

bdca

bcad

adbc

dbac

dacb

cabd

adcb

dbca

cbad

Notice that the first column, $H_1$, has "a" fixed in the first position; the second column has "b" fixed in the second position; etc. Note also that the $H_1$ and $H_2$ columns share every permutation with the first and second position fixed ("abcd" and "abdc").

To weed out the duplicates, we need to subtract the number of permutations with at least two fixed points multiplied by the number of pairs of $H_p$ subsets. But that will leave us with an under count because it will result in some permutations with three or more fixed points being excluded, so we must add those back in. We need to continue this inclusion-exclusion process until we’ve considered the number of permutations which fix all $n$ elements in $H_p$ taken $n$ at a time:

\[\begin{equation*} \label{eq:binom} \left\lvert \bigcup H_p \right\rvert = \binom{n}{1}(n-1)! - \binom{n}{2}(n-2)! + \binom{n}{3}(n-3)! - \cdots (-1)^n \binom{n}{n}(n-n)! \end{equation*}\]

where $\binom{n}{k}$ gives the binomial coefficients which you may remember from math class can be interpreted as the number of ways to choose $k$ objects from a set of $n$ objects when order doesn’t matter. It can be written in terms of factorials:

\[\begin{equation*} \binom{n}{k} = \frac{n!}{(n-k)!k!} \end{equation*}\]

Now we can calculate the number of possible non-deranged permutations. To get $D_n$, we just subtract it from the total number of possible permutations. When we substitute the expression for $\lvert \bigcup H_p \rvert$ into equation \eqref{eq:up} then expand the binomial coefficients and factorials, this becomes:

\[\begin{eqnarray} \notag D_n & = & n! - \frac{n!}{1!} + \frac{n!}{2!} - \frac{n!}{3!} + \cdots + (-1)^n \\ \notag & = & n!\left(1 - \frac{1}{1!} + \frac{1}{2!} - \frac{1}{3!} + \cdots + (-1)^n \frac{1}{n!}\right) \\ \label{eq:dn} & = & n!\sum_{k=0}^n(-1)^k\frac{1}{k!} \end{eqnarray}\]

And we’ve answered our first question: Given a group of size $n$, there are $D_n =n!\sum_{k=0}^n(-1)^k\frac{1}{k!}$ ways to make a valid secret santa assignment. To calculate the number of valid assignments between our five example friends, then, we have

\[\begin{eqnarray*} D_5 & = & 5!\left(1 - \frac{1}{1!} + \frac{1}{2!} - \frac{1}{3!} + \frac{1}{4!} - \frac{1}{5!}\right) \\ & = & 120 \left(1 - 1 + \frac{1}{2} - \frac{1}{6} + \frac{1}{24} - \frac{1}{120}\right) \\ & = & 120 \left(\frac{44}{120}\right) \\ & = & 44 \end{eqnarray*}\]

The first nine values of $D_n$ are listed in the table below.

\(n\)	1	2	3	4	5	6	7	8	9
\(D_n\)	0	1	2	9	44	265	1,854	14,833	133,496

$n$

$D_n$

265

1,854

14,833

133,496

This is OEIS sequence A000166. The number $D_n$ is known as the subfactorial of $n$ (usually written $!n$). It is also a special case of the rencontres numbers which enumerate partial derangements (derangements with specified numbers of 1-cycles).

Notice that the summation in equation $\eqref{eq:dn}$ is the $n^{\text{th}}$ partial Maclaurin expansion of $e^{-1}$, so that

\[\begin{equation*} \lim_{n \to \infty} D_n = \frac{n!}{e} \end{equation*}\]

Because the series converges rather quickly, $\frac{n!}{e}$ is a good approximation even for even small values of $n$.

The probability that a permutation is a derangement is $D_n$ divided by the number of all possible permutations $n!$. This answers the second question asked in the introduction: The probability that at least one secret santa participant will draw their own name is $1 - \frac{D_n}{n!} \approx 1 - \frac{1}{e} \approx 63\%$. That may seem high, but the mean of the geometric distribution is $e$ so you can expect to draw a valid derangement after 2 or 3 attempts (restarting each time someone draws their own name). There is a nearly 99% chance of having drawn a derangement after 10 attempts $(1 - (1 - \frac{1}{e})^{10} \approx 98.9\%)$.

Beyond counting mere derangements there are more elaborate constraints and questions we could consider, the sorts of things investigated by statistics of random permutations and generating functions. We haven’t even answered the third question from the introduction yet (“What is the probability that two people will draw each other’s names?”). I hope to return to this article in the future when I have more time and a better grasp of combinatoric tools to look into some of those questions.

If my explanation above of how to derive $D_n$ was not clear, don’t worry. Counting derangements is frequently used as an example application of the inclusion-exclusion principle, so better explanations can be found on the web and in almost any introductory combinatorics textbook. See, for example, Professor Howard Haber’s handout on the inclusion-exclusion principle [PDF]. There are also several other methods for deriving and proving the formula for $D_n$, including those that first derive the recurrence relation $D_n = (n-1)(D_{n-1} + D_{n-2})$ and then solve it by iteration or by the method of generating functions. For a solution via generating functions see Jean Pierre Mutanguha’s “The Power of Generating Functions”.

Algorithms

[A]lmost as many algorithms have been published for unsorting as for sorting!

— Donald Knuth

We’ll use Python to explore some algorithms for generating random derangements. The functions given below are sometimes simplified to get the main ideas across; the complete versions can be found in alogrithms.py in the github repository. To keep things simple, all of the functions operate only on permutations of the set of integers from 0 to n-1. If we’d like to permute a set of some other n objects (like santas) we can then use those integers as indexes into the list of our other objects.

Utilities

There are a few utility functions that we might want while exploring and debugging our algorithms. First of all, the ability to calculate $D_n$, the number of derangements in a set of size $n$. Here is a straightforward translation of $\eqref{eq:dn}$ to Python:

Dn()

import math
from decimal import Decimal

def Dn(n: int):
    # Use Decimal to handle large n accurately
    # (by large, I mean n>13 or so...
    # factorials get big fast!)
    s = 0
    for k in range(n+1):
        s += (-1)**k/Decimal(math.factorial(k))
    result = math.factorial(n) * s
    return Decimal.to_integral_exact(result)

[int(Dn(i)) for i in range(9)]
>> [1, 0, 1, 2, 9, 44, 265, 1854, 14833]

Next up is a way to generate all $n!$ permutations of a set. Several algorithms for generating permutations are well-known. Most classic is an algorithm which produces permutations in lexicographical order described by Knuth in 7.2.1.2 Algorithm L. Other techniques produce all permutations by only swapping one pair of elements at a time (see the Steinhaus-Johnson-Trotter algorithm which Knuth gives as Algorithm P).

But in our case the Python standard library provides a function for generating permutations (in lexicographical order) so we’ll just use that:

import itertools
list(itertools.permutations([0,1,2]))

>> [(0, 1, 2), (0, 2, 1), (1, 0, 2), (1, 2, 0), (2, 0, 1), (2, 1, 0)]

Another helpful function would be a way to decompose permutations into their cycles to make them easier to visualize (taking $(0, 1, 2,\cdots, n-1)$ to be the identity permutation). To find the cycles in a permutation, start with the first element and then visit the element it points to (the element in its position in the identity permutation), and then visit the element that one points to and so on until we get back to the first element. That completes a cycle containing each of the elements visited, add it to a list. Now start over with the first unvisited element. Repeat until there are no more unvisited elements.

Below is an implementation of that algorithm. It requires storage for the list of cycles (the cycles variable), a way to keep track of unvisited elements (the unvisited variable, which starts as a copy of the input but has elements removed as they are visited), and a way to keep track of the first element in a cycle so that we know when we’ve returned to it (the variable called first below):

decompose()

def decompose(perm):
    cycles = []
    unvisited = list(perm)

    while len(unvisited):
        first = unvisited.pop(0)
        cur = [first]
        nextval = perm[first]
        while nextval != first:
            cur.append(nextval)
            # Remove each element from unvisited
            # once we visit it
            unvisited.pop(unvisited.index(nextval))
            nextval = perm[nextval]
        cycles.append(cur)

    return cycles

As an example, let’s see the cycles in $\{1, 2, 4, 3, 0\}\mathrm{:}$

decompose([1,2,4,3,0])

>> [[1, 2, 4, 0], [3]]

Notice this agrees with the same cycles we found way back in our first example of a permutation (where $\mathrm{S}=0, \mathrm{C}=1, \mathrm{A}=2, \mathrm{L}=3, \mathrm{M}=4$) $\eqref{eq:perm}$.

Finally, it will be handy to have a function that can test whether a permutation is a derangement or not. One way to do that would be to call decompose() on the permutation and then check if there are any 1-cycles in the decomposition. The nice thing about that method is that it generalizes so we could use it to check if the permutation contains any cycles $\leq m$ for any $m$.

But if we only care about derangements (the case where $m=1$), it is simpler (and faster) to just iterate over the elements of the permutation and check if they are in their original position. If any are, we can immediately return False, the permutation is not a derangement.

check_deranged()

def check_deranged(perm):
    for i, el in enumerate(perm):
        if el == i: return False
    return True


check_deranged([1,2,4,3,0])
>> False

decompose([1,3,4,2,0])
>> [[1, 3, 2, 4, 0]] # Notice no 1-cycles

check_deranged([1,3,4,2,0])
>> True

How not to generate derangements

The first time I sat down to write a secret santa algorithm, my instinct was to try a backtracking approach and ended up with something like the code below. The idea behind the backtracker is to iteratively build a derangement by randomly selecting an element from the identity arrangement, and then checking if the resulting partial permutation is a derangement. If it is not a derangement, undo (backtrack) the last choice and try again. If it is a derangement, randomly choose one of the remaining elements and check again. Repeat until you’ve deranged all $n$ elements:

backtracker()

import random
def backtracker(n):
    if n == 0: return []
    remaining = list(range(n))
    perm = []

    # backtrack until solution
    while len(perm) < n:
        perm.append(random.choice(remaining))
        if not check_deranged(perm):
            if len(remaining) == 1:
                # we're down to the last two elements
                # just swap them to get a derangement
                perm[-1], perm[-2] = perm[-2], perm[-1]
                return perm
            # undo last choice
            perm.pop(-1)
        else:
            remaining.remove(perm[-1])
    return perm

# Use it to generate a derangement and view the cycles:
perm = backtracker(5)
perm, decompose(perm)
>> ([1, 2, 0, 4, 3], [[1, 2, 0], [4, 3]])

As written, backtracker() is fast and will produce any possible derangement (and in only ~20 lines of Python), so it could be used for secret santa. However, as the decades fly by your friends might begin to suspect that the same assignments seem to be ‘randomly’ generated fairly often. They would be right: backtracker() does not produce derangements with uniform probability. Even though each element in the derangement is chosen from the remaining elements of the input set with uniform probability, the number of possible derangements is dependant on which numbers happen to have been chosen first.

For example, let’s look at the probability that backtracker() will produce $(5, 0, 1, 2, 3, 4).$ The first number can be anything but 0, so there are $6-1=5$ ways to choose that. The second number can be any of the remaining 5 numbers except 1, so there are 4 ways to choose that. The third number can be any of the remaining 4 numbers except 2 which leaves 3 possibilities. The fourth element can be any of the remaining numbers except for 2 which leaves 2 possibilities. The fifth element can not be 4, so there is only one way to derange the last two elements. If we take the product of those probabilities we get $\frac{1}{5} \cdot \frac{1}{4} \cdot \frac{1}{3} \cdot \frac{1}{2} = \frac{1}{120}.$

If you do a similar calculation for the probability that backtracker() would produce $(2, 3, 5, 4, 0, 1)$ you should get $\frac{1}{360}.$ Not only are the probabilities for generating those two derangements significantly different from each other but they also both differ from the expected probability of $\frac{1}{265}$ if every one of the $D_6$ derangements had an equal probability of being generated. I generated 10,000 derangements of length 6 with backtracker(), and sure enough $(5, 0, 1, 2, 3, 4)$ was generated 94 times while $(2, 3, 5, 4, 0, 1)$ was generated only 20 times. The graph below shows a plot of counts for every 6-derangement over the 10,000 runs:

Figure 2. Count of each derangement produced after running backtracker() 10,000 times with $n = 6$. It is clearly not choosing derangements uniformly. The grey line shows the expected count if each derangement were generated with uniform probability ($1/D_6\cdot 10000 \approx 37.7$)

Instead of building derangements by randomly selecting elements and checking if the result is a derangement, we could simply generate all possible permutations, filter out the non-derangements, and then randomly select one of the derangements to return. The nice thing about that approach is that we could enforce any other constraints we want in the filter step (maybe we want a minimum cycle length or have a “blacklist” of people who should not be assigned to each other) and we can still be confident we would select a valid secret santa assignment with uniform probability (since we have generated all of them it is easy to select one at random).

generate_all()

import random
def generate_all(n):
    potential = []
    perms = itertools.permutations(range(n))
    for p in perms:
        if check_constraints(p, m, bl):
            potential.append(p)
    return random.choice(potential)

Below you can see a bar graph of the counts after producing 10,000 derangements of length 6 with generate_all(). It looks much more uniform than backtracker() at least. One tool we can use to gauge how closely our counts match what we should expect from a uniform distribution is the chi-squared statistic:

\[\begin{equation*} \chi^2 = \sum\frac{(O_i - E_i)^2}{E_i} \end{equation*}\]

where $O_i$ are our observed counts and $E_i$ are the expected counts for each derangement (which in our case is $1/D_6\cdot 10000 \approx 37.7$). For my data I calculated $\chi^2 \approx 261.75$. If we check that against the chi-squared cumulative distribution function with k-1 degrees of freedom (where k is the number of data points, 265 in this case), we get a p-value of about 0.53. The p-value is the probability that our $\chi^2$ value would be least 261.75 if our counts were uniformly distributed. Usually if p<0.05 it would be prudent to question whether the data fits a uniform distribution. On the other hand if p>0.99 or so we could be confident it is uniform, but we might question whether it is random. A p-value of 0.53 should leave us confident that generate_all() randomly generates derangements with uniform or very nearly uniform probability.

Figure 3. Count of each derangement produced after running generate_all() 10,000 times with $n = 6$. $\chi^2 \approx 261.75$ and the chi-squared test p-value ≈ 0.53. The grey line shows the expected count if each derangement were generated with uniform probability ($\approx 37.7$)

But there are two major problems with generate_all(): it is slow (because we have to generate all $n!$ permutations), and it uses a lot of memory (because we have to store all $D_n$ derangements). $D_{12} = 176,214,841$, for example, so even if we implemented our permutations in some memory efficient way (say an array of one byte per element), we would need over 1GB of memory just to store all of the derangements before returning one. Running generate_all() with $n>11$ runs my desktop out of RAM after about a minute and crashes the Python interpreter. And in the grand scheme of things 12 is not such a huge number.

How to generate derangements

We can do better than the backtracker and generate_all algorithms above by combining the best aspects of each: generate a single random permutation and check if it is a derangement. If it is, return it; otherwise, try again by generating another random permutation. That should be much more efficient than generating and storing all possible derangements, and as long as we can generate permutations with uniform probability we will also generate derangements with uniform probability.

A well-known algorithm for creating a random permutation by shuffling a given arrangement is to simply select one of the elements at random, set that as the leftmost element of the permutation, and then repeat by selecting one of the remaining elements at random until all of the elements have been selected. This is known as the Fisher-Yates shuffle named for two statisticians who described a paper-and-pen method for shuffling a sequence in 1938. The computer version of the algorithm — popularized in Chapter 3 of Knuth’s The Art of Computer Programming (“Algorithm P (Shuffling)”) — usually shuffles an array in place. It does this by iterating through the array from left to right swapping the element at the index with a random element to the right of the index. Once an element has been swapped left it is in its position in the generated permutation. Repeat to the end. For a good visualization of Fisher-Yates (and how it compares to less efficient algorithms) see Mike Bostock’s Fisher-Yates Shuffle.

The shuffle_rejection() algorithm below repeatedly shuffles a list using Fisher-Yates until the resulting permutation is a derangement:

shuffle_rejection()

def shuffle_rejection(n):
    perm = list(range(n))
    while not check_deranged(perm):
      # Fisher-Yates shuffle:
      for i in range(n):
          k = random.randrange(n-i)+i # i <= k < n
          perm[i], perm[k] = perm[k], perm[i]
    return perm

Notice that in the Fisher-Yates algorithm the range of the random index k includes the index of the current element i. In other words, elements can swap with themselves creating a 1-cycle. That is necessary, of course, to generate all possible permutations. If the algorithm is changed so that k ranges only from $i < k < n$ so that it does not include i, then the algorithm will produce only permutations with a single n-cycle. This is known as Sattolo’s algorithm, and it generates n-cycles with uniform probability.

The bar graph above summarizes the results of generating 10,000 derangements of size 6 with shuffle_rejection(). It appears to be uniform as expected. It is also fast and simple, which makes it perfectly suitable for generating secret santa assignments (and is, in fact, what I use in sinterbot, my secret santa tool).

Figure 4. Count of each derangement produced after running shuffle_rejection() 10,000 times with $n = 6$. $\chi^2 \approx 278.9$ and the chi-squared test p-value ≈ 0.25.

One inelegance of shuffle_rejection() is that it generates any number of random permutations just to throw them away. It is possible that it would never actually generate a derangement and just keep generating and rejecting permutations all day. In reality derangements are common enough that that is not a practical concern (it finds close to 20,000 derangements of length 6 per second on my old desktop). But is there a way to directly generate derangements with uniform probability without needing to backtrack or reject non-deranged permutations?

Yes. In 2008 Martínez et al. published one such algorithm (“Generating Random Derangements,” 234-240). It is similar to Sattolo’s algorithm, but instead of joining every element into a single n-cycle, it will randomly close cycles at a specific probability which ensures a uniform generation of derangements. Here is a nice set of slides that goes through their algorithm step by step.

Jörg Arndt provides an easier to follow (in my opinion) version of the algorithm in his 2010 thesis Generating Random Permutations. (It’s a short book that includes several useful algorithms.) This Python implementation more closely follows his version:

rand_derangement()

def rand_derangement(n):
    perm = list(range(n))
    remaining = list(perm)
    while (len(remaining)>1):
        # random index < last:
        rand_i = random.randrange(len(remaining)-1)
        rand = remaining[rand_i]
        last = remaining[-1]

        # swap to join cycles
        perm[last], perm[rand] = perm[rand], perm[last]

        # remove last from remaining
        remaining.pop(-1)

        p = random.random() # uniform [0, 1)
        r = len(remaining)
        prob = r * Dn(r-1)/Dn(r+1)
        if p < prob:
            # Close the cycle
            remaining.pop(rand_i)
    return perm

Figure 5. Count of each derangement produced after running rand_derangement() 10,000 times with $n = 6$. $\chi^2 \approx 261.1$ and the chi-squared test p-value ≈ 0.54.

Arndt’s implementation is in C with a precomputed lookup table for the ratio calculated on the prob = r * Dn(r-1)/Dn(r+1) line. Even then he reports it is only slightly faster than the rejection method. This Python implementation is actually about twice as slow as the rejection method in my tests.

But one advantage of generating derangements directly as in rand_derangement() is that it can be generalized to generate derangements with minimum cycle lengths. Arndt shows how that can be done in his thesis.

There are other ways to generate random derangements that I’ve not covered in this post. Earlier this year J. Ricardo G. Mendonça published two new algorithms for [almost-]uniformly generating random derangements: “Efficient generation of random derangements with the expected distribution of cycle lengths,” Computational and Applied Mathematics 39, no. 3 (2020): 1-15.

Sinterbot2020

sinterbot is a little command line program (Python 3.5+) that helps to manage secret santa assignments. With sinterbot you can generate a valid secret santa assignment for a list of people and email each person their assigned gift recipient without ever revealing to anybody (including the operator of sinterbot) the full secret list of assignments.

Source code and more usage instructions: https://github.com/cristoper/sinterbot

sinterbot allows specifying some extra constraints such as minimum cycle length or a blacklist of people who should not be assigned to each other.

Installation

pip install sinterbot

Usage

First create a config file with a list of participants' names and email addresses. The config file may also specify constraints for minimum cycle length and a blacklist. See sample.conf for a full example:

# xmas2020.conf
Santa A: user1@email.tld
Santa B: user2@email.tld
Santa C: user3@email.tld
Santa D: user4@email.tld
Santa E: user5@email.tld

The format is Name: emailaddress. Only the email addresses need to be unique.

Then run sinterbot derange to compute a valid assignment and save it to the config file:

$ sinterbot derange xmas2020.conf
Derangement info successfully added to config file.
Use `sinterbot send sample.conf -c smtp.conf` to send emails!

sinterbot will not allow you to re-derange a config file without passing the --force flag.

Now if you want you can view the secret santa assignments with sinterbot view xmas2020.conf. However, if you’re a participant that would ruin the suprise for you! Instead you can email each person their assignment without ever seeing them yourself.

First create a file to specify your SMTP credentials:

# smtp.conf
SMTPEmail: yourname@gmail.com
SMTPPass: yourgmailpassword
SMTPServer: smtp.gmail.com
SMTPPort: 587

(If you do not know what SMTP server to use but you have a gmail account, you can use gmail’s SMTP server) using values like those exemplified above.)

Then run the sinterbot send command, giving it the smtp credentials file with the -c option, to send the emails:

$ sinterbot send xmas2020.conf -c smtp.conf
Send message to user1@email.tld!
Send message to user2@email.tld!
Send message to user3@email.tld!
Send message to user4@email.tld!
Send message to user5@email.tld!

A First Exercise in Natural Language Processing with Python: Counting Hapaxes

2017-09-07T18:32:37Z

Table of Contents

A first exercise
Natural language processing with Python
hapaxes.py listing

A first exercise

Counting hapaxes (words which occur only once in a text or corpus) is an easy enough problem that makes use of both simple data structures and some fundamental tasks of natural language processing (NLP): tokenization (dividing a text into words), stemming, and part-of-speech tagging for lemmatization. For that reason it makes a good exercise to get started with NLP in a new language or library.

As a first exercise in implementing NLP tasks with Python, then, we’ll write a script which outputs the count and a list of the hapaxes in the following paragraph (our script can also be run on an arbitrary input file). You can follow along, or try it yourself and then compare your solution to mine.

Cory Linguist, a cautious corpus linguist, in creating a corpus of courtship correspondence, corrupted a crucial link. Now, if Cory Linguist, a careful corpus linguist, in creating a corpus of courtship correspondence, corrupted a crucial link, see that YOU, in creating a corpus of courtship correspondence, corrupt not a crucial link.

To keep things simple, ignore punctuation and case. To make things complex, count hapaxes in all three of word form, stemmed form, and lemma form. The final program (hapaxes.py) is listed at the end of this post. The sections below walk through it in detail for the beginning NLP/Python programmer.

Natural language processing with Python

There are several NLP packages available to the Python programmer. The most well-known is the Natural Language Toolkit (NLTK), which is the subject of the popular book Natural Language Processing with Python by Bird et al. NLTK has a focus on education/research with a rather sprawling API. Pattern is a Python package for datamining the WWW which includes submodules for language processing and machine learning. Polyglot is a language library focusing on “massive multilingual applications.” Many of its features support over 100 languages (but it doesn’t seem to have a stemmer or lemmatizer builtin). And there is Matthew Honnibal’s spaCy, an “industrial strength” NLP library focused on performance and integration with machine learning models.

If you don’t already know which library you want to use, I recommend starting with NLTK because there are so many online resources available for it. The program presented below actually presents five solutions to counting hapaxes, which will hopefully give you a feel for a few of the libraries mentioned above:

Word forms - counts unique spellings (normalized for case). This uses plain Python (no NLP packages required)
NLTK stems - counts unique stems using a stemmer provided by NLTK
NLTK lemmas - counts unique lemma forms using NLTK’s part of speech tagger and interface to the WordNet lemmatizer
spaCy lemmas - counts unique lemma forms using the spaCy NLP package

Installation

This tutorial assumes you already have Python installed on your system and have some experience using the interpreter. I recommend referring to each package’s project page for installation instructions, but here is one way using pip. As explained below, each of the NLP packages are optional; feel free to install only the ones you’re interested in playing with.

# Install NLTK:
$ pip install nltk

# Download reqed NLTK data packages
$ python -c 'import nltk; nltk.download("wordnet"); nltk.download("averaged_perceptron_tagger"); nltk.download("omw-1.4")'

# install spaCy:
$ pip install spacy

# install spaCy en model:
$ python -m spacy download en_core_web_sm

Optional dependency on Python modules

It would be nice if our script didn’t depend on any particular NLP package so that it could still run even if one or more of them were not installed (using only the functionality provided by whichever packages are installed).

One way to implement a script with optional package dependencies in Python is to try to import a module, and if we get an ImportError exception we mark the package as uninstalled (by setting a variable with the module’s name to None) which we can check for later in our code:

[hapaxes.py: 63-98]

### Imports
#
# Import some Python 3 features to use in Python 2
from __future__ import print_function
from __future__ import unicode_literals

# gives us access to command-line arguments
import sys

# The Counter collection is a convenient layer on top of
# python's standard dictionary type for counting iterables.
from collections import Counter

# The standard python regular expression module:
import re

try:
    # Import NLTK if it is installed
    import nltk

    # This imports NLTK's implementation of the Snowball
    # stemmer algorithm
    from nltk.stem.snowball import SnowballStemmer

    # NLTK's interface to the WordNet lemmatizer
    from nltk.stem.wordnet import WordNetLemmatizer
except ImportError:
    nltk = None
    print("NLTK is not installed, so we won't use it.")

try:
    # Import spaCy if it is installed
    import spacy
except ImportError:
    spacy = None
    print("spaCy is not installed, so we won't use it.")

Tokenization

Tokenization is the process of splitting a string into lexical ‘tokens’ — usually words or sentences. In languages with space-separated words, satisfactory tokenization can often be accomplished with a few simple rules, though ambiguous punctuation can cause errors (such as mistaking a period after an abbreviation as the end of a sentence). Some tokenizers use statistical inference (trained on a corpus with known token boundaries) to recognize tokens.

In our case we need to break the text into a list of words in order to find the hapaxes. But since we are not interested in punctuation or capitalization, we can make tokenization very simple by first normalizing the text to lower case and stripping out every punctuation symbol:

[hapaxes.py: 100-119]

def normalize_tokenize(string):
    """
    Takes a string, normalizes it (makes it lowercase and
    removes punctuation), and then splits it into a list of
    words.

    Note that everything in this function is plain Python
    without using NLTK (although as noted below, NLTK provides
    some more sophisticated tokenizers we could have used).
    """
    # make lowercase
    norm = string.lower()

    # remove punctuation
    norm = re.sub(r'(?u)[^\w\s]', '', norm) (1)

    # split into words
    tokens = norm.split()

    return tokens

1	Remove punctuation by replacing everything that is not a word (`\w`) or whitespace (`\s`) with an empty string. The (`?u`) flag at the beginning of the regex enables unicode matching for the \w and \s character classes in Python 2 (unicode is the default with Python 3).

Our tokenizer produces output like this:

>>> normalize_tokenize("This is a test sentence of white-space separated words.")
['this', 'is', 'a', 'test', 'sentence', 'of', 'whitespace', 'separated', 'words']

Instead of simply removing punctuation and then splitting words on whitespace, we could have used one of the tokenizers provided by NLTK. Specifically the word_tokenize() method, which first splits the text into sentences using a pre-trained English sentences tokenizer (sent_tokenize), and then finds words using regular expressions in the style of the Penn Treebank tokens.

# We could have done it this way (requires the
# 'punkt' data package):
from nltk.tokenize import word_tokenize
tokens = word_tokenize(norm)

The main advantage of word_tokenize() is that it will turn contractions into separate tokens. But using Python’s standard split() is good enough for our purposes.

Counting word forms

We can use the tokenizer defined above to get a list of words from any string, so now we need a way to count how many times each word occurs. Those that occur only once are our word-form hapaxes.

[hapaxes.py: 121-135]

def word_form_hapaxes(tokens):
    """
    Takes a list of tokens and returns a list of the
    wordform hapaxes (those wordforms that only appear once)

    For wordforms this is simple enough to do in plain
    Python without an NLP package, especially using the Counter
    type from the collections module (part of the Python
    standard library).
    """

    counts = Counter(tokens) (1)
    hapaxes = [word for word in counts if counts[word] == 1] (2)

    return hapaxes

1	Use the convenient `Counter` class from Python’s standard library to count the occurrences of each token. `Counter` is a subclass of the standard `dict` type; its constructor takes a list of items from which it builds a dictionary whose keys are elements from the list and whose values are the number of times each element appeared in the list.
2	This list comprehension creates a list from the Counter dictionary containing only the dictionary keys that have a count of 1. These are our hapaxes.

Stemming and Lemmatization

If we use our two functions to first tokenize and then find the hapaxes in our example text, we get this output:

>>> text = "Cory Linguist, a cautious corpus linguist, in creating a corpus of courtship correspondence, corrupted a crucial link. Now, if Cory Linguist, a careful corpus linguist, in creating a corpus of courtship correspondence, corrupted a crucial link, see that YOU, in creating a corpus of courtship correspondence, corrupt not a crucial link."
>>> tokens = normalize_tokenize(text)
>>> word_form_hapaxes(tokens)
['now', 'not', 'that', 'see', 'if', 'corrupt', 'you', 'careful', 'cautious']

Notice that ‘corrupt’ is counted as a hapax even though the text also includes two instances of the word ‘corrupted’. That is expected because ‘corrupt’ and ‘corrupted’ are different word-forms, but if we want to count word roots regardless of their inflections we must process our tokens further. There are two main methods we can try:

Stemming uses an algorithm (and/or a lookup table) to remove the suffix of tokens so that words with the same base but different inflections are reduced to the same form. For example: ‘argued’ and ‘arguing’ are both stemmed to ‘argu’.
Lemmatization reduces tokens to their lemmas, their canonical dictionary form. For example, ‘argued’ and ‘arguing’ are both lemmatized to ‘argue’.

Stemming with NLTK

In 1980 Martin Porter published a stemming algorithm which has become a standard way to stem English words. His algorithm was implemented so many times, and with so many errors, that he later created a programming language called Snowball to help clearly and exactly define stemmers. NLTK includes a Python port of the Snowball implementation of an improved version of Porter’s original stemmer:

[hapaxes.py: 137-153]

def nltk_stem_hapaxes(tokens):
    """
    Takes a list of tokens and returns a list of the word
    stem hapaxes.
    """
    if not nltk: (1)
        # Only run if NLTK is loaded
        return None

    # Apply NLTK's Snowball stemmer algorithm to tokens:
    stemmer = SnowballStemmer("english")
    stems = [stemmer.stem(token) for token in tokens]

    # Filter down to hapaxes:
    counts = nltk.FreqDist(stems) (2)
    hapaxes = counts.hapaxes() (3)
    return hapaxes

1	Here we check if the `nltk` module was loaded; if it was not (presumably because it is not installed), we return without trying to run the stemmer.
2	NLTK’s `FreqDist` class subclasses the `Counter` container type we used above to count word-forms. It adds some methods useful for calculating frequency distributions.
3	The `FreqDist` class also adds a `hapaxes()` method, which is implemented exactly like the list comprehension we used to count word-form hapaxes.

Running nltk_stem_hapaxes() on our tokenized example text produces this list of stem hapaxes:

>>> nltk_stem_hapaxes(tokens)
['now', 'cautious', 'that', 'not', 'see', 'you', 'care', 'if']

Notice that ‘corrupt’ is no longer counted as a hapax (since it shares a stem with ‘corrupted’), and ‘careful’ has been stemmed to ‘care’.

Lemmatization with NLTK

NLTK provides a lemmatizer (the WordNetLemmatizer class in nltk.stem.wordnet) which tries to find a word’s lemma form with help from the WordNet corpus (which can be downloaded by running nltk.download() from an interactive python prompt — refer to “Installing NLTK Data” for general instructions).

In order to resolve ambiguous cases, lemmatization usually requires tokens to be accompanied by part-of-speech tags. For example, the word lemma for rose depends on whether it is used as a noun or a verb:

>>> lemmer = WordNetLemmatizer()
>>> lemmer.lemmatize('rose', 'n') # tag as noun
'rose'
>>> lemmer.lemmatize('rose', 'v') # tag as verb
'rise'

Since we are operating on untagged tokens, we’ll first run them through an automated part-of-speech tagger provided by NLTK (it uses a pre-trained perceptron tagger originally by Matthew Honnibal: “A Good Part-of-Speech Tagger in about 200 Lines of Python”). The tagger requires the training data available in the 'averaged_perceptron_tagger.pickle' file which can be downloaded by running nltk.download() from an interactive python prompt.

[hapaxes.py: 155-176]

def nltk_lemma_hapaxes(tokens):
    """
    Takes a list of tokens and returns a list of the lemma
    hapaxes.
    """
    if not nltk:
        # Only run if NLTK is loaded
        return None

    # Tag tokens with part-of-speech:
    tagged = nltk.pos_tag(tokens) (1)

    # Convert our Treebank-style tags to WordNet-style tags.
    tagged = [(word, pt_to_wn(tag))
                     for (word, tag) in tagged] (2)

    # Lemmatize:
    lemmer = WordNetLemmatizer()
    lemmas = [lemmer.lemmatize(token, pos)
                     for (token, pos) in tagged] (3)

    return nltk_stem_hapaxes(lemmas) (4)

1	This turns our list of tokens into a list of 2-tuples: `[(token1, tag1), (token2, tag2)…]`
2	We must convert between the tags returned by `pos_tag()` and the tags expected by the WordNet lemmatizer. This is done by applying the `pt_to_wn()` function (defined below) to each tag.
3	Pass each token and POS tag to the WordNet lemmatizer.
4	If a lemma is not found for a token, then it is returned from `lemmatize()` unchanged. To ensure these unhandled words don’t contribute spurious hapaxes, we pass our lemmatized tokens through the word stemmer for good measure (which also filters the list down to only hapaxes).

As noted above, the tags returned by pos_tag() are Penn Treebank style tags while the WordNet lemmatizer uses its own tag set (defined in the nltk.corpus.reader.wordnet module, though that is not very clear from the NLTK documentation). The pt_to_wn() function converts Treebank tags to the tags required for lemmatization:

[hapaxes.py: 178-209]

def pt_to_wn(pos):
    """
    Takes a Penn Treebank tag and converts it to an
    appropriate WordNet equivalent for lemmatization.

    A list of Penn Treebank tags is available at:
    https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html
    """

    from nltk.corpus.reader.wordnet import NOUN, VERB, ADJ, ADV

    pos = pos.lower()

    if pos.startswith('jj'):
        tag = ADJ
    elif pos == 'md':
        # Modal auxiliary verbs
        tag = VERB
    elif pos.startswith('rb'):
        tag = ADV
    elif pos.startswith('vb'):
        tag = VERB
    elif pos == 'wrb':
        # Wh-adverb (how, however, whence, whenever...)
        tag = ADV
    else:
        # default to NOUN
        # This is not strictly correct, but it is good
        # enough for lemmatization.
        tag = NOUN

    return tag

Finding hapaxes with spaCy

Unlike the NLTK API, spaCy is designed to tokenize, parse, and tag a text all by calling the single function returned by spacy.load(). The spaCy parser returns a ‘document’ object which contains all the tokens, their lemmas, etc. According to the spaCy documentation, “Lemmatization is performed using the WordNet data, but extended to also cover closed-class words such as pronouns.” The function below shows how to find the lemma hapaxes in a spaCy document.

spaCy’s models load quite a bit of data from disk which can cause script startup to be slow making it more suitable for long-running programs than for one-off scripts like ours.

[hapaxes.py: 211-234]

def spacy_hapaxes(rawtext):
    """
    Takes plain text and returns a list of lemma hapaxes using
    the spaCy NLP package.
    """
    if not spacy:
        # Only run if spaCy is installed
        return None

    # Load the English spaCy parser
    spacy_parse = spacy.load('en_core_web_sm')

    # Tokenize, parse, and tag text:
    doc = spacy_parse(rawtext)

    lemmas = [token.lemma_ for token in doc
            if not token.is_punct and not token.is_space] (1)

    # Now we can get a count of every lemma:
    counts = Counter(lemmas) (2)

    # We are interested in lemmas which appear only once
    hapaxes = [lemma for lemma in counts if counts[lemma] == 1]
    return hapaxes

1	This list comprehension collects the lemma form (`token.lemma_` of all tokens in the spaCy document which are not punctuation (`token.is_punct`) or white space (`token.is_space`).
2	An alternative way to do this would be to first get a count of lemmas using the `count_by()` method of a spaCy document, and then filtering out punctuation if desired: `counts = doc.count_by(spacy.attrs.LEMMA)` (but then you’d have to map the resulting attributes (integers) back to words by looping over the tokens and checking their `orth` attribute).

Make it a script

You can play with the functions we’ve defined above by typing (copy-and-pasting) them into an interactive Python session. If we save them all to a file, then that file is a Python module which we could import and use in a Python script. To use a single file as both a module and a script, our file can include a construct like this:

if __name__ == "__main__":
    # our script logic here

This works because when the Python interpreter executes a script (as opposed to importing a module), it sets the top-level variable __name__ equal to the string "__main__" (see also: What does if __name__ == “__main__”: do?).

In our case, our script logic consists of reading any input files if given, running all of our hapax functions, then collecting and displaying the output. To see how it is done, scroll down to the full program listing below.

Running it

To run the script, first download and save hapaxes.py. Then:

$ python hapaxes.py

Depending on which NLP packages you have installed, you should see output like:

               Count
     Wordforms   9
    NLTK-stems   8
   NLTK-lemmas   8
         spaCy   8

-- Hapaxes --
Wordforms:    careful, cautious, corrupt, if, not, now, see, that, you
NLTK-stems:   care, cautious, if, not, now, see, that, you
NLTK-lemmas:  care, cautious, if, not, now, see, that, you
spaCy:        careful, cautious, if, not, now, see, that, you

Try also running the script on an arbitrary input file:

$ python hapaxes.py somefilename

# run it on itself and note that
# source code doesn't give great results:
$ python hapaxes.py hapaxes.py

hapaxes.py listing

The entire script is listed below and available at hapaxes.py.

hapaxes.py

  1"""
  2A sample script/module which demonstrates how to count hapaxes (tokens which
  3appear only once) in an untagged text corpus using plain python, NLTK, and
  4spaCy. It counts and lists hapaxes in five different ways:
  5
  6    * Wordforms - counts unique spellings (normalized for case). This uses
  7    plain Python (no NLTK required)
  8
  9    * NLTK stems - counts unique stems using a stemmer provided by NLTK
 10
 11    * NLTK lemmas - counts unique lemma forms using NLTK's part of speech
 12    * tagger and interface to the WordNet lemmatizer.
 13
 14    * spaCy lemmas - counts unique lemma forms using the spaCy NLP module.
 15
 16Each of the NLP modules (nltk, spaCy) are optional; if one is not
 17installed then its respective hapax-counting method will not be run.
 18
 19Usage:
 20
 21    python hapaxes.py [file]
 22
 23If 'file' is given, its contents are read and used as the text in which to
 24find hapaxes. If 'file' is omitted, then a test text will be used.
 25
 26Example:
 27
 28Running this script with no arguments:
 29
 30    python hapaxes.py
 31
 32Will process this text:
 33
 34    Cory Linguist, a cautious corpus linguist, in creating a corpus of
 35    courtship correspondence, corrupted a crucial link. Now, if Cory Linguist,
 36    a careful corpus linguist, in creating a corpus of courtship
 37    correspondence, corrupted a crucial link, see that YOU, in creating a
 38    corpus of courtship correspondence, corrupt not a crucial link.
 39
 40And produce this output:
 41
 42                Count
 43         Wordforms   9
 44             Stems   8
 45            Lemmas   8
 46             spaCy   8
 47
 48    -- Hapaxes --
 49    Wordforms:    careful, cautious, corrupt, if, not, now, see, that, you
 50    NLTK-stems:   care, cautious, if, not, now, see, that, you
 51    NLTK-lemmas:  care, cautious, if, not, now, see, that, you
 52    spaCy:        careful, cautious, if, not, now, see, that, you
 53
 54
 55Notice that the stems and lemmas methods do not count "corrupt" as a hapax
 56because it also occurs as "corrupted". Notice also that "Linguist" is not
 57counted as the text is normalized for case.
 58
 59See also the Wikipedia entry on "Hapex legomenon"
 60(https://en.wikipedia.org/wiki/Hapax_legomenon)
 61"""
 62
 63### Imports
 64#
 65# Import some Python 3 features to use in Python 2
 66from __future__ import print_function
 67from __future__ import unicode_literals
 68
 69# gives us access to command-line arguments
 70import sys
 71
 72# The Counter collection is a convenient layer on top of
 73# python's standard dictionary type for counting iterables.
 74from collections import Counter
 75
 76# The standard python regular expression module:
 77import re
 78
 79try:
 80    # Import NLTK if it is installed
 81    import nltk
 82
 83    # This imports NLTK's implementation of the Snowball
 84    # stemmer algorithm
 85    from nltk.stem.snowball import SnowballStemmer
 86
 87    # NLTK's interface to the WordNet lemmatizer
 88    from nltk.stem.wordnet import WordNetLemmatizer
 89except ImportError:
 90    nltk = None
 91    print("NLTK is not installed, so we won't use it.")
 92
 93try:
 94    # Import spaCy if it is installed
 95    import spacy
 96except ImportError:
 97    spacy = None
 98    print("spaCy is not installed, so we won't use it.")
 99
100def normalize_tokenize(string):
101    """
102    Takes a string, normalizes it (makes it lowercase and
103    removes punctuation), and then splits it into a list of
104    words.
105
106    Note that everything in this function is plain Python
107    without using NLTK (although as noted below, NLTK provides
108    some more sophisticated tokenizers we could have used).
109    """
110    # make lowercase
111    norm = string.lower()
112
113    # remove punctuation
114    norm = re.sub(r'(?u)[^\w\s]', '', norm) # <1>
115
116    # split into words
117    tokens = norm.split()
118
119    return tokens
120
121def word_form_hapaxes(tokens):
122    """
123    Takes a list of tokens and returns a list of the
124    wordform hapaxes (those wordforms that only appear once)
125
126    For wordforms this is simple enough to do in plain
127    Python without an NLP package, especially using the Counter
128    type from the collections module (part of the Python
129    standard library).
130    """
131
132    counts = Counter(tokens) # <1>
133    hapaxes = [word for word in counts if counts[word] == 1] # <2>
134
135    return hapaxes
136
137def nltk_stem_hapaxes(tokens):
138    """
139    Takes a list of tokens and returns a list of the word
140    stem hapaxes.
141    """
142    if not nltk: # <1>
143        # Only run if NLTK is loaded
144        return None
145
146    # Apply NLTK's Snowball stemmer algorithm to tokens:
147    stemmer = SnowballStemmer("english")
148    stems = [stemmer.stem(token) for token in tokens]
149
150    # Filter down to hapaxes:
151    counts = nltk.FreqDist(stems) # <2>
152    hapaxes = counts.hapaxes() # <3>
153    return hapaxes
154
155def nltk_lemma_hapaxes(tokens):
156    """
157    Takes a list of tokens and returns a list of the lemma
158    hapaxes.
159    """
160    if not nltk:
161        # Only run if NLTK is loaded
162        return None
163
164    # Tag tokens with part-of-speech:
165    tagged = nltk.pos_tag(tokens) # <1>
166
167    # Convert our Treebank-style tags to WordNet-style tags.
168    tagged = [(word, pt_to_wn(tag))
169                     for (word, tag) in tagged] # <2>
170
171    # Lemmatize:
172    lemmer = WordNetLemmatizer()
173    lemmas = [lemmer.lemmatize(token, pos)
174                     for (token, pos) in tagged] # <3>
175
176    return nltk_stem_hapaxes(lemmas) # <4>
177
178def pt_to_wn(pos):
179    """
180    Takes a Penn Treebank tag and converts it to an
181    appropriate WordNet equivalent for lemmatization.
182
183    A list of Penn Treebank tags is available at:
184    https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html
185    """
186
187    from nltk.corpus.reader.wordnet import NOUN, VERB, ADJ, ADV
188
189    pos = pos.lower()
190
191    if pos.startswith('jj'):
192        tag = ADJ
193    elif pos == 'md':
194        # Modal auxiliary verbs
195        tag = VERB
196    elif pos.startswith('rb'):
197        tag = ADV
198    elif pos.startswith('vb'):
199        tag = VERB
200    elif pos == 'wrb':
201        # Wh-adverb (how, however, whence, whenever...)
202        tag = ADV
203    else:
204        # default to NOUN
205        # This is not strictly correct, but it is good
206        # enough for lemmatization.
207        tag = NOUN
208
209    return tag
210
211def spacy_hapaxes(rawtext):
212    """
213    Takes plain text and returns a list of lemma hapaxes using
214    the spaCy NLP package.
215    """
216    if not spacy:
217        # Only run if spaCy is installed
218        return None
219
220    # Load the English spaCy parser
221    spacy_parse = spacy.load('en_core_web_sm')
222
223    # Tokenize, parse, and tag text:
224    doc = spacy_parse(rawtext)
225
226    lemmas = [token.lemma_ for token in doc
227            if not token.is_punct and not token.is_space] # <1>
228
229    # Now we can get a count of every lemma:
230    counts = Counter(lemmas) # <2>
231
232    # We are interested in lemmas which appear only once
233    hapaxes = [lemma for lemma in counts if counts[lemma] == 1]
234    return hapaxes
235
236if __name__ == "__main__":
237    """
238    The code in this block is run when this file is executed as a script (but
239    not if it is imported as a module by another Python script).
240    """
241
242    # If no file is provided, then use this sample text:
243    text = """Cory Linguist, a cautious corpus linguist, in creating a
244    corpus of courtship correspondence, corrupted a crucial link. Now, if Cory
245    Linguist, a careful corpus linguist, in creating a corpus of courtship
246    correspondence, corrupted a crucial link, see that YOU, in creating a
247    corpus of courtship correspondence, corrupt not a crucial link."""
248
249    if len(sys.argv) > 1:
250        # We got at least one command-line argument. We'll ignore all but the
251        # first.
252        with open(sys.argv[1], 'r') as file:
253            text = file.read()
254            try:
255                # in Python 2 we need a unicode string
256                text = unicode(text)
257            except:
258                # in Python 3 'unicode()' is not defined
259                # we don't have to do anything
260                pass
261
262    # tokenize the text (break into words)
263    tokens = normalize_tokenize(text)
264
265    # Get hapaxes based on wordforms, stems, and lemmas:
266    wfs = word_form_hapaxes(tokens)
267    stems = nltk_stem_hapaxes(tokens)
268    lemmas = nltk_lemma_hapaxes(tokens)
269    spacy_lems = spacy_hapaxes(text)
270
271    # Print count table and list of hapaxes:
272    row_labels = ["Wordforms"]
273    row_data = [wfs]
274
275    # only add NLTK data if it is installed
276    if nltk:
277        row_labels.extend(["NLTK-stems", "NLTK-lemmas"])
278        row_data.extend([stems, lemmas])
279
280    # only add spaCy data if it is installed:
281    if spacy_lems:
282        row_labels.append("spaCy")
283        row_data.append(spacy_lems)
284
285    # sort happaxes for display
286    row_date = [row.sort() for row in row_data]
287
288    # format and print output
289    rows = zip(row_labels, row_data)
290    row_fmt = "{:>14}{:^8}"
291    print("\n")
292    print(row_fmt.format("", "Count"))
293    hapax_list = []
294    for row in rows:
295        print(row_fmt.format(row[0], len(row[1])))
296        hapax_list += ["{:<14}{:<68}".format(row[0] + ":", ", ".join(row[1]))]
297
298    print("\n-- Hapaxes --")
299    for row in hapax_list:
300        print(row)
301    print("\n")
302