http://catswhisker.xyz/Atom Feed for 'programming' Articles2022-10-23T17:44:18ZA. Cynichttp://catswhisker.xyz/about/tag:catswhisker.xyz,2021-08-23:/log/2021/8/23/quit_your_job_like_andrew_kelley/Quit Your Job Like Andrew Kelley2021-08-23T14:11:59Z2021-08-23T14:11:59Z<div class="paragraph">
<p>The CoRecursive podcast has a great recent interview with Andrew Kelley, the creator of the <a href="https://en.wikipedia.org/wiki/Zig_(programming_language)">Zig programming language</a>, where he talks about quitting his job and working on open source full time (<a href="https://corecursive.com/067-zig-with-andrew-kelley/">Episode #67</a>):</p>
</div>
<div class="quoteblock">
<blockquote>
<div class="paragraph">
<p><strong>Adam</strong>: Once you did quit, was it everything that you thought it would be?</p>
</div>
<div class="paragraph">
<p><strong>Andrew</strong>: First of all, I’ve never been happier. Second of all, I realized that the freedom that I have has allowed me to open my mind up to just other, even just different politics and ways of thinking about society and how the world works. It’s harder to think about maybe more radical ways that society could run when you have to play the game, and you’re spending 40 or plus hours per week clocked in and just like doing the labor.</p>
</div>
<div class="paragraph">
<p>Not only was it everything I thought I would be, but once I tasted this freedom, I know I will never have a boss again. I will go start a farm if I have to. My sense of self-worth has just skyrocketed and I just, I don’t even want to be subject to another person’s domain anymore. I want everyone to feel this way. I want everyone to feel they get to decide what they do with their life and no one’s going to tell them what they have to do.</p>
</div>
<div class="paragraph">
<p><strong>Adam</strong>: Did you have a really bad boss?</p>
</div>
<div class="paragraph">
<p><strong>Andrew</strong>: Actually, no. […​] I think that’s why I realized that I never want to have a boss again is that I had a good one and I still really hated it.</p>
</div>
</blockquote>
</div>An excerpt from a recent interview with Andrew Kelley, creator of the Zig programming language, about quitting his job.tag:catswhisker.xyz,2021-08-22:/log/2021/8/22/magento_sucks/I Hate Magento2021-08-22T13:30:18Z2021-09-07T18:15:55Z<div class="quoteblock">
<blockquote>
Sometimes I think I hate Magento for being so overly complex, slow, resource-intensive, and feature-poor yet bloat-rich that only Adobe could be interested in it. Then I think of the poor people paying me to maintain and write extensions for it.
</blockquote>
<div class="attribution">
— Comment seen on hacker news
</div>
</div>
<div class="paragraph">
<p>I’ve inherited the maintenance duties for a retail shop’s website which is powered by some open-source PHP software called <a href="https://en.wikipedia.org/wiki/Magento">Magento</a>. (Magento was purchased by Adobe a few years ago and they offer value-added commercial and hosted versions called <a href="https://magento.com/products/magento-commerce">Adobe Commerce</a>).
According to the marketing copy, “Magento is a feature-rich eCommerce platform solution that offers merchants complete flexibility and control over the functionality of their online channel. Magento’s search engine optimization, catalog management, and powerful marketing tools give merchants the ability to create sites that provide an unrivaled shopping experience for their customers.”
And according to Wikipedia, “More than 100,000 online stores have been created on this platform. The platform code has been downloaded more than 2.5 million times, and $155 billion worth of goods have been sold through Magento-based systems in 2019. Two years ago, Magento accounted for about 30% of the total market share.”</p>
</div>
<div class="paragraph">
<p>I’ve now had time to work with this Magento site from several perspectives over the span of a few years (including a less-than-smooth migration from Magento 1.9 to Magento 2.3) — as a consultant managing product data via the Admin interface, as a sysadmin deploying and hosting a Magento-based site, as a developer making minor modifications and fixes as needed by my client, as a user of the customer-facing site — and it has continually impressed me: I’ve never before used software that provides such a poor experience to administrators, developers, and users alike.</p>
</div>
<div class="paragraph">
<p>Where I’d expect an eCommerce framework to provide a simple frontend interface that implements basic shopping cart functionality as a starting point to build on, Magento provides a default theme with a product description page which requests over 4MB of HTML/CSS/JS including 200+ JavaScript files (really). Modifying the theme, like all Magento customization (see below), is cumbersome to an unreasonable degree. As far as I can tell, the only sane way to have a performant and maintainable Magento website would be to write a complete frontend using a modern framework and communicate with Magento solely through its REST API. In fact the general rule of successful Magento deployment seems to be to use as little of Magento as possible.</p>
</div>
<div class="paragraph">
<p>Where I’d expect a catalog management interface with an emphasis on importing and exporting product data, Magento provides a painfully slow product browser and editor for torturing copywriters and an anemic and convoluted product export tool unsuitable for any real reporting or feed generation. My solution lately has been to write programs which get data from the REST API to update reports in Google Sheets rather than trying to use or extend the Admin panel. (Magento’s product import tool is fine.)</p>
</div>
<div class="paragraph">
<p>Where I’d expect a “feature-rich eCommerce platform solution that offers merchants complete flexibility and control over the functionality of their online channel” to have an ergonomic and well-documented extension interfaces, Magento provides an over-engineered and convoluted plugin system wired together with XML files (with little reference documentation) and PHP code generated from classes you provide which interact with the core system’s classes (which have no reference documentation).</p>
</div>
<div class="paragraph">
<p>Where I’d expect to find at the morally murky nexus of a commercial online retail platform that barely works out of the box, preoccupation with marketing and “SEO”, and the extraction of labour from programmers in developing countries (including under the guise of “open source”) an ecosystem of commercial plugins that are both expensive and risky to install, Magento delivers. So if the base install of Magento seems too stable, secure, and inexpensive to you, you could always head over to the <a href="https://marketplace.magento.com/">Magento Marketplace</a> and find any extension you need written by software developers questioning their career choices.</p>
</div>
<div class="paragraph">
<p>I don’t mean to denigrate the hard work of the open-source contributors who have helped create Magento.
In fact while I don’t understand what motivates them, I admire, in some ways, the sheer tenacity and self-denial it must take to continue to spend time on such a project.
From what I can tell by browsing github and the <a href="https://magento.stackexchange.com/">Magento StackExchange</a> (which — and I know this is going to sound hyperbolic — is probably the lowest quality stackexchange site I’ve seen), many developers are Indian or otherwise work outside of North America, so I’m guessing the fragility of Magento has created some demand for affordable PHP developers.
And I know it is some sort of broken windows fallacy, but if Magento’s convoluted architecture can provide some paying gigs or job security to those developers I guess that’s something positive, at least.</p>
</div>
<div class="paragraph">
<p>I think most of Magento’s issues — including its poor performance, poor security record, and high cost to customize and maintain — stem from two core defects: a lack of documentation and a fundamentally flawed software architecture.</p>
</div>
<div class="paragraph">
<p>If Magento had good documentation, then almost no matter how terrible its design, developers would be able to figure out to make it do what they need. Now, the <em>organization</em> of its documentation has improved since Adobe took over (see <a href="https://devdocs.magento.com/" class="bare">https://devdocs.magento.com/</a>). But from the perspective of a PHP developer, <em>what</em> is documented is severely lacking. Lots of mostly useless high level descriptions, a few code examples, but no real documentation of the Magento source code and the classes/interfaces it provides.</p>
</div>
<div class="paragraph">
<p>Unlike the documentation, which Adobe <em>could</em> improve if they cared (but if they did, then I think they would have by now), the architecture of Magento cannot even be fixed.
The entire framework is based around convoluted (which makes its lack of documentation hit harder) XML configuration files and a dynamic <a href="https://devdocs.magento.com/guides/v2.4/extension-dev-guide/depend-inj.html">dependency injection</a> system.
The way it works is that plugins declare dependencies on PHP interfaces and then the Magento code generator (the <code>bin/magento setup:di:compile</code> command) generates the actual plugin code with the dependencies instantiated.
It’s the sort of overkill system that makes large enterprise Java applications, developed by teams using a statically typed language with excellent tooling, difficult to reason about and maintain; to adopt that architecture for a PHP shopping cart application is utter madness.</p>
</div>
<div class="paragraph">
<p>I have not taken the time to investigate its performance bottlenecks, and I hope I never do, but the generated PHP code which supports such a dynamic plugin system no doubt contributes to how slow Magento is.
And Magento <strong>is</strong> slow. Behind an asynchronous reverse proxy on an over-sized ec2 instance, I’m pretty sure my client’s website could be DoS’d by any mischievous kid with a cable modem if it weren’t for Cloudflare. Even with Cloudflare’s firewall a single malicious bot can bring the site to a crawl.</p>
</div>
<div class="paragraph">
<p>The ‘solution’ Magento offers to its performance problems is more and more layers of cache. First, the administrator must generate static files (CSS, etc.) with the <a href="https://devdocs.magento.com/guides/v2.4/config-guide/cli/config-cli-subcommands-static-view.html">bin/magento setup:static-content:deploy</a> command. Then there is <a href="https://devdocs.magento.com/guides/v2.4/config-guide/cli/config-cli-subcommands-cache.html">a cache system</a> for various bits of data Magento calculates and which often, inexplicably, needs to be manually refreshed. But it is still slow so it is usually recommend that you also run Magento behind a caching reverse-proxy like Varnish and/or a dedicated key-value cache based on redis.
It would be funny if it weren’t real software that I have to maintain.</p>
</div>
<div class="paragraph">
<p>Worse, the unnecessary complexity makes it more difficult both to understand code execution paths and to make changes to the code base: a recipe for <a href="https://www.cvedetails.com/product/31613/Magento-Magento.html?vendor_id=15393">security vulnerabilities</a>.</p>
</div>
<div class="paragraph">
<p>There are some good things about Magento, of course.
Its one saving grace, in my opinion, is its Swagger-based REST API which makes it possible to implement most required functionality outside of Magento itself. And Swagger/OpenAPI is self-documenting, so the Magento devs are not able to make it as difficult as they’ve made the PHP API.
But even that has been a source of suffering. In fact my main use-case for it, reporting on current inventory quantity for all products, is not even possible by default because the <code>/V1/products</code> endpoint does not return stock information despite what the documentation claims.
That bug has been reported several times (eg <a href="https://github.com/magento/magento2/issues/24418">#24418</a>), but the response from the maintainers is that the correct way to get stock information is to make a call to <code>/V1/products</code> and then make an additional http request for each returned product (thousands or tens of thousands in my case).
(Luckily there are some workarounds. I wrote <a href="https://github.com/cristoper/mage_qtyext">cristoper/mage_qtyext</a>, a very simple plugin which adds a “qty” field to the product results; I also found <a href="https://github.com/menacoders/Stock-Info-API-searchCriteria">menacoders/Stock-Info-API-searchCriteria</a> which adds all stock information to the results.)
There are also SOAP and GraphQL APIs which I’ve not investigated (except to find out that the GraphQL API by default also does not offer a way to get stock information for the entire inventory).</p>
</div>
<div class="paragraph">
<p>Personally, the prospect of developing solutions for my Magento-bound client confronts me as a depressing time sink. I’m convinced Magento will never be anything but expensive to maintain, slow, and insecure. I don’t recommend it for new projects.</p>
</div>
<hr>
<div class="paragraph">
<p>I found this 2015 article from the Magento 1 days which complains about mostly the same things: <a href="https://medium.com/@salvoadriano/magento-why-complex-doesn-t-mean-good-1f15992202de">Magento: why complex doesn’t mean good</a></p>
</div>
<div class="paragraph">
<p>See also for Spotify users: <a href="https://open.spotify.com/playlist/5YMtRCWJxAw2MBxQBrBtlF?si=163a29ca0dc14a5b">Magento 2 Rage Tracks</a></p>
</div>A rant about an open-source ecommerce platform.tag:catswhisker.xyz,2020-12-07:/log/2020/12/7/deranged_sinterklaas/Deranged Sinterklaas: The Math and Algorithms of Secret Santa2020-12-07T07:00:00Z2021-02-04T23:01:43Z<div class="imageblock">
<div class="content">
<img src="/log/2020/12/7/deranged_sinterklaas/XmasStory.png" alt="XmasStory">
</div>
</div>
<div class="sect1">
<h2 id="_secret_santa">Secret Santa</h2>
<div class="sectionbody">
<div class="paragraph">
<p><a href="https://en.wikipedia.org/wiki/Secret_Santa">Secret Santa</a> is a traditional Christmas gift exchanging scheme in which each member of a group is randomly and anonymously assigned another member to give a Christmas gift to (usually by drawing names from a container). It is not valid for a person to be assigned to themself (if someone were to draw their own name, for example, all the names should be returned to the jar and the drawing process restarted).</p>
</div>
<div class="paragraph">
<p>Given a group of a certain size, how many different ways are there to make valid assignments? What is the probability that at least one person will draw their own name? What is the probability that two people will draw each other’s names? What is a good way to have a computer make the assignments while guaranteeing they are generated with equal probability among all possible assignments?</p>
</div>
<div class="paragraph">
<p>It turns out that these questions about secret santa present good motivation for exploring some of the fundamental concepts in combinatorics (the math of counting).
In the sections below we will take a look at a bit of that math and algorithms that allow us to answer the questions we posed above.
The final section presents a simple command-line <a href="https://github.com/cristoper/sinterbot">program</a> that allows generating and anonymously sending secret santa assignments via email so that we no longer need to go through the tedious ordeal of drawing names from a hat.</p>
</div>
<div id="toc" class="toc">
<div id="toctitle" class="title">Table of Contents</div>
<ul class="sectlevel1">
<li><a href="#_secret_santa">Secret Santa</a></li>
<li><a href="#_math">Math</a>
<ul class="sectlevel2">
<li><a href="#_permutations">Permutations</a></li>
<li><a href="#_derangements">Derangements</a></li>
</ul>
</li>
<li><a href="#_algorithms">Algorithms</a>
<ul class="sectlevel2">
<li><a href="#_utilities">Utilities</a></li>
<li><a href="#_how_not_to_generate_derangements">How not to generate derangements</a></li>
<li><a href="#_how_to_generate_derangements">How to generate derangements</a></li>
</ul>
</li>
<li><a href="#software">Sinterbot2020</a>
<ul class="sectlevel2">
<li><a href="#_installation">Installation</a></li>
<li><a href="#_usage">Usage</a></li>
</ul>
</li>
</ul>
</div>
</div>
</div>
<div class="sect1">
<h2 id="_math">Math</h2>
<div class="sectionbody">
<div class="sect2">
<h3 id="_permutations">Permutations</h3>
<div class="paragraph">
<p>As an example let’s take a group of five friends who we will represent by the first initial of their names as a set: \(\{\mathrm{\,S, C, A, L, M\,}\}\). The elements of this set of people, like the elements of any set, can be arranged in different orders. For example, the order of the elements as we just happened to write them \((\mathrm{S\; C\; A\; L\; M})\) is one arrangement, and we can also shuffle them around to get a different arrangement like \((\mathrm{C\; A\; M\; L\; S})\).</p>
</div>
<div class="paragraph">
<p>Each ordered arrangement is called a permutation of the set. How many permutations can be made from a set with \(n\) elements? It is straight-forward to count. We can choose any element of the set to be in the first position of the permutation, so there are \(n\) choices, which leaves \(n-1\) choices for the second position, \(n-2\) choices for the third position, and so on until for the \(n\text{th}\) (and therefore last) position of the permutation there is only one element of the set remaining to choose from.</p>
</div>
<div class="paragraph">
<p>Multiplying the number of choices for each position of the permutation gives the total number of possible permutations: \(n(n-1)(n-2)\dots(1).\) In other words, the product of all positive integers less than or equal to \(n\). That product is known as the <a href="https://en.wikipedia.org/wiki/Factorial">factorial</a> of \(n\) and is written \(n!\):</p>
</div>
<div class="stemblock">
<div class="content">
\[\begin{align*}
n! &= n(n-1)(n-2)\cdots (n-(n-1))\\
&= 1 \cdot 2 \cdots n\\
\end{align*}\]
</div>
</div>
<div class="paragraph">
<p>So there are \(5! = 5 \cdot 4 \cdot 3 \cdot 2 \cdot 1 = 120\) ways to permute a set of five friends. Writing any two of those permutations in <a href="http://groupprops.subwiki.org/wiki/Two-line_notation_for_permutations">two-line notation</a> with one above the other allows us to read off a secret santa assignment for the group:</p>
</div>
<div class="stemblock">
<div class="content">
\[\begin{equation}
\label{eq:perm}
\begin{pmatrix}
\mathrm{S} & \mathrm{C} & \mathrm{A} & \mathrm{L} & \mathrm{M} \\
\mathrm{C} & \mathrm{A} & \mathrm{M} & \mathrm{L} & \mathrm{S} \\
\end{pmatrix}
\end{equation}\]
</div>
</div>
<div class="paragraph">
<p>If we read the top line of as the list of gift givers and the bottom line as the gift recipients, then each santa is assigned to give a gift to the person in the bottom line directly beneath them. So \(\mathrm{S}\) gives a gift to \(\mathrm{C}\) who gives a gift to \(\mathrm{A}\) and so forth.</p>
</div>
<div id="permgraph" class="imageblock">
<div class="content">
<a class="image" href="dotperm.svg"><img src="/log/2020/12/7/deranged_sinterklaas/dotperm.png" alt="TODO Figure <<permgraph>> illustrates the cycles."></a>
</div>
<div class="title">Figure 1. Graphical representation of the cycles of the permutation \(\eqref{eq:perm}\). Arrows point from santas to gift recipients.</div>
</div>
<div class="paragraph">
<p>A permutation of santas (or anything else) can be represented as a directed graph, as in <a href="#permgraph">Figure 1</a>, or more compactly by listing its cycles: \((\mathrm{S\; C\; A\; M})(\mathrm{L})\). To see that the cycle notation is equivalent to the graph, read each cycle from left to right and insert the implied arrow from the last element back to the first: \((\mathrm{S \to C \to A \to M \to S})(\mathrm{L \to L})\). Note that any 1-cycles can be implied and are usually left out when writing a permutation in cycle notation, so an equivalent way to write our example permutation is simply as \((\mathrm{S\; C\; A\; M})\).</p>
</div>
<div class="paragraph">
<p>But note also that a permutation containing any 1-cycles defines an invalid secret santa assignment! The example permutation above has \(\mathrm{L}\) giving a gift to herself, which is against the rules.</p>
</div>
</div>
<div class="sect2">
<h3 id="_derangements">Derangements</h3>
<div class="paragraph">
<p>A permutation with no 1-cycles — in other words, a permutation in which no element is left in its original position so that the entire set has been de-arranged — is called a <a href="https://en.wikipedia.org/wiki/Derangement">derangement</a>. One way to derange our example group of secret santas is</p>
</div>
<div class="stemblock">
<div class="content">
\[\begin{equation}
\label{eq:der}
\begin{pmatrix}
\mathrm{S} & \mathrm{C} & \mathrm{A} & \mathrm{L} & \mathrm{M} \\
\mathrm{C} & \mathrm{A} & \mathrm{S} & \mathrm{M} & \mathrm{L} \\
\end{pmatrix}
\end{equation}\]
</div>
</div>
<div class="paragraph">
<p>Or equivalently, decomposed into its cycles, \((\mathrm{S\; C\; A})(\mathrm{M\; L})\).</p>
</div>
<div class="paragraph">
<p>So the problem of generating a valid secret santa assignment is equivalent to generating a derangement. Some algorithms for uniformly generating random derangements are presented in the next section. But first we need a way to calculate \(D_n\), the number of derangements that can be made from a set with \(n\) elements.</p>
</div>
<div class="paragraph">
<p>Counting derangements is a trickier than counting unrestricted permutations. We proceed by counting the permutations with at least one 1-cycle, the <em>non</em>-derangements. First we’ll define the subsets \(H_p\) to contain all of the permutations on \(n\) with the \(p^{\text{th}}\) element fixed in its original position (an element that stays in its original position in a permutation is called a “fixed point”). This gives \(n\) such subsets \(\mathrm{(}H_1 \ldots H_n\mathrm{)}\) each containing \((n-1)!\) permutations (because with one element fixed, there are \((n-1)!\) ways to permute the remaining elements).</p>
</div>
<div class="paragraph">
<p>We know that the subsets \(H_p\) contain only non-derangements since every member has a fixed point. And since together the \(H_p\) contain every non-derangement with the \(p^{\text{th}}\) element fixed, we know that they contain <em>all</em> possible non-derangements.
That means to find \(D_n\) we just need to subtract the size of the union of all the \(H_p\) subsets from the total number of permutations (which we know is \(n!\)):</p>
</div>
<div class="stemblock">
<div class="content">
\[\begin{equation}
\label{eq:up}
D_n = n! - \left\lvert \bigcup H_p \right\rvert \\
\end{equation}\]
</div>
</div>
<div class="paragraph">
<p>But finding the size of \(\cup H_p\) is not straightforward.
If we simply multiply the number of subsets by their size \(n\cdot(n-1)!\) we over count because the subsets \(H_p\) are not disjoint: some non-derangements belong to more than one subset. Specifically, every pair of subsets of \(H_p\) share \((n-2)!\) permutations with at least two fixed points; every 3-tuple of subsets share \((n-3)!\) permutations with at least three fixed points; and so on.</p>
</div>
<div class="paragraph">
<p>To visualize this, it helps to draw out each \(H_p\) for a small set. The table below shows each \(H_p\) subset for the set \(\{\,a, b, c, d\,\}\):</p>
</div>
<table class="tableblock frame-all grid-all stretch">
<colgroup>
<col style="width: 25%;">
<col style="width: 25%;">
<col style="width: 25%;">
<col style="width: 25%;">
</colgroup>
<tbody>
<tr>
<td class="tableblock halign-left valign-top"><p class="tableblock">\(H_1\)</p></td>
<td class="tableblock halign-left valign-top"><p class="tableblock">\(H_2\)</p></td>
<td class="tableblock halign-left valign-top"><p class="tableblock">\(H_3\)</p></td>
<td class="tableblock halign-left valign-top"><p class="tableblock">\(H_4\)</p></td>
</tr>
<tr>
<td class="tableblock halign-left valign-top"><p class="tableblock">abcd</p></td>
<td class="tableblock halign-left valign-top"><p class="tableblock">abcd</p></td>
<td class="tableblock halign-left valign-top"><p class="tableblock">abcd</p></td>
<td class="tableblock halign-left valign-top"><p class="tableblock">abcd</p></td>
</tr>
<tr>
<td class="tableblock halign-left valign-top"><p class="tableblock">abdc</p></td>
<td class="tableblock halign-left valign-top"><p class="tableblock">abdc</p></td>
<td class="tableblock halign-left valign-top"><p class="tableblock">adcb</p></td>
<td class="tableblock halign-left valign-top"><p class="tableblock">acbd</p></td>
</tr>
<tr>
<td class="tableblock halign-left valign-top"><p class="tableblock">acbd</p></td>
<td class="tableblock halign-left valign-top"><p class="tableblock">cbad</p></td>
<td class="tableblock halign-left valign-top"><p class="tableblock">bacd</p></td>
<td class="tableblock halign-left valign-top"><p class="tableblock">bacd</p></td>
</tr>
<tr>
<td class="tableblock halign-left valign-top"><p class="tableblock">acdb</p></td>
<td class="tableblock halign-left valign-top"><p class="tableblock">cbda</p></td>
<td class="tableblock halign-left valign-top"><p class="tableblock">bdca</p></td>
<td class="tableblock halign-left valign-top"><p class="tableblock">bcad</p></td>
</tr>
<tr>
<td class="tableblock halign-left valign-top"><p class="tableblock">adbc</p></td>
<td class="tableblock halign-left valign-top"><p class="tableblock">dbac</p></td>
<td class="tableblock halign-left valign-top"><p class="tableblock">dacb</p></td>
<td class="tableblock halign-left valign-top"><p class="tableblock">cabd</p></td>
</tr>
<tr>
<td class="tableblock halign-left valign-top"><p class="tableblock">adcb</p></td>
<td class="tableblock halign-left valign-top"><p class="tableblock">dbca</p></td>
<td class="tableblock halign-left valign-top"><p class="tableblock">dbca</p></td>
<td class="tableblock halign-left valign-top"><p class="tableblock">cbad</p></td>
</tr>
</tbody>
</table>
<div class="paragraph">
<p>Notice that the first column, \(H_1\), has "a" fixed in the first position; the second column has "b" fixed in the second position; etc. Note also that the \(H_1\) and \(H_2\) columns share every permutation with the first and second position fixed ("abcd" and "abdc").</p>
</div>
<div class="paragraph">
<p>To weed out the duplicates, we need to subtract the number of permutations with at least two fixed points multiplied by the number of pairs of \(H_p\) subsets. But that will leave us with an <em>under</em> count because it will result in some permutations with three or more fixed points being excluded, so we must add those back in. We need to continue this <a href="https://en.wikipedia.org/wiki/Inclusion%E2%80%93exclusion_principle">inclusion-exclusion</a> process until we’ve considered the number of permutations which fix all \(n\) elements in \(H_p\) taken \(n\) at a time:</p>
</div>
<div class="stemblock">
<div class="content">
\[\begin{equation*}
\label{eq:binom}
\left\lvert \bigcup H_p \right\rvert = \binom{n}{1}(n-1)! - \binom{n}{2}(n-2)! + \binom{n}{3}(n-3)! - \cdots (-1)^n \binom{n}{n}(n-n)!
\end{equation*}\]
</div>
</div>
<div class="paragraph">
<p>where \(\binom{n}{k}\) gives the <a href="https://en.wikipedia.org/wiki/Binomial_coefficient">binomial coefficients</a> which you may remember from math class can be interpreted as the number of ways to choose \(k\) objects from a set of \(n\) objects when order doesn’t matter. It can be written in terms of factorials:</p>
</div>
<div class="stemblock">
<div class="content">
\[\begin{equation*}
\binom{n}{k} = \frac{n!}{(n-k)!k!}
\end{equation*}\]
</div>
</div>
<div class="paragraph">
<p>Now we can calculate the number of possible <em>non</em>-deranged permutations. To get \(D_n\), we just subtract it from the total number of possible permutations.
When we substitute the expression for \(\lvert \bigcup H_p \rvert\) into equation \eqref{eq:up} then expand the binomial coefficients and factorials, this becomes:</p>
</div>
<div class="stemblock">
<div class="content">
\[\begin{eqnarray}
\notag
D_n & = & n! - \frac{n!}{1!} + \frac{n!}{2!} - \frac{n!}{3!} + \cdots + (-1)^n \\
\notag
& = & n!\left(1 - \frac{1}{1!} + \frac{1}{2!} - \frac{1}{3!} + \cdots + (-1)^n \frac{1}{n!}\right) \\
\label{eq:dn}
& = & n!\sum_{k=0}^n(-1)^k\frac{1}{k!}
\end{eqnarray}\]
</div>
</div>
<div class="paragraph">
<p>And we’ve answered our first question: <strong>Given a group of size \(n\), there are \(D_n =n!\sum_{k=0}^n(-1)^k\frac{1}{k!}\) ways to make a valid secret santa assignment.</strong> To calculate the number of valid assignments between our five example friends, then, we have</p>
</div>
<div class="stemblock">
<div class="content">
\[\begin{eqnarray*}
D_5 & = & 5!\left(1 - \frac{1}{1!} + \frac{1}{2!} - \frac{1}{3!} + \frac{1}{4!} - \frac{1}{5!}\right) \\
& = & 120 \left(1 - 1 + \frac{1}{2} - \frac{1}{6} + \frac{1}{24} - \frac{1}{120}\right) \\
& = & 120 \left(\frac{44}{120}\right) \\
& = & 44
\end{eqnarray*}\]
</div>
</div>
<div class="paragraph">
<p>The first nine values of \(D_n\) are listed in the table below.</p>
</div>
<table class="tableblock frame-all grid-all stretch">
<colgroup>
<col style="width: 10%;">
<col style="width: 10%;">
<col style="width: 10%;">
<col style="width: 10%;">
<col style="width: 10%;">
<col style="width: 10%;">
<col style="width: 10%;">
<col style="width: 10%;">
<col style="width: 10%;">
<col style="width: 10%;">
</colgroup>
<thead>
<tr>
<th class="tableblock halign-left valign-top">\(n\)</th>
<th class="tableblock halign-left valign-top">1</th>
<th class="tableblock halign-left valign-top">2</th>
<th class="tableblock halign-left valign-top">3</th>
<th class="tableblock halign-left valign-top">4</th>
<th class="tableblock halign-left valign-top">5</th>
<th class="tableblock halign-left valign-top">6</th>
<th class="tableblock halign-left valign-top">7</th>
<th class="tableblock halign-left valign-top">8</th>
<th class="tableblock halign-left valign-top">9</th>
</tr>
</thead>
<tbody>
<tr>
<td class="tableblock halign-left valign-top"><p class="tableblock"><strong>\(D_n\)</strong></p></td>
<td class="tableblock halign-left valign-top"><p class="tableblock">0</p></td>
<td class="tableblock halign-left valign-top"><p class="tableblock">1</p></td>
<td class="tableblock halign-left valign-top"><p class="tableblock">2</p></td>
<td class="tableblock halign-left valign-top"><p class="tableblock">9</p></td>
<td class="tableblock halign-left valign-top"><p class="tableblock">44</p></td>
<td class="tableblock halign-left valign-top"><p class="tableblock">265</p></td>
<td class="tableblock halign-left valign-top"><p class="tableblock">1,854</p></td>
<td class="tableblock halign-left valign-top"><p class="tableblock">14,833</p></td>
<td class="tableblock halign-left valign-top"><p class="tableblock">133,496</p></td>
</tr>
</tbody>
</table>
<div class="paragraph">
<p>This is OEIS sequence <a href="https://oeis.org/A000166">A000166</a>.
The number \(D_n\) is known as the <a href="http://mathworld.wolfram.com/Subfactorial.html">subfactorial</a> of \(n\) (usually written \(!n\)). It is also a special case of the <a href="https://en.wikipedia.org/wiki/Rencontres_numbers">rencontres numbers</a> which enumerate partial derangements (derangements with specified numbers of 1-cycles).</p>
</div>
<div class="paragraph">
<p>Notice that the summation in equation \(\eqref{eq:dn}\) is the \(n^{\text{th}}\) partial <a href="https://en.wikipedia.org/wiki/Taylor_series">Maclaurin expansion</a> of \(e^{-1}\), so that</p>
</div>
<div class="stemblock">
<div class="content">
\[\begin{equation*}
\lim_{n \to \infty} D_n = \frac{n!}{e}
\end{equation*}\]
</div>
</div>
<div class="paragraph">
<p>Because the series converges rather quickly, \(\frac{n!}{e}\) is a good approximation even for even small values of \(n\).</p>
</div>
<div class="paragraph">
<p>The probability that a permutation is a derangement is \(D_n\) divided by the number of all possible permutations \(n!\). This answers the second question asked in the introduction: <strong>The probability that at least one secret santa participant will draw their own name is</strong> \(1 - \frac{D_n}{n!} \approx 1 - \frac{1}{e} \approx 63\%\). That may seem high, but the mean of the <a href="https://en.wikipedia.org/wiki/Geometric_distribution">geometric distribution</a> is \(e\) so you can expect to draw a valid derangement after 2 or 3 attempts (restarting each time someone draws their own name). There is a nearly 99% chance of having drawn a derangement after 10 attempts \((1 - (1 - \frac{1}{e})^{10} \approx 98.9\%)\).</p>
</div>
<div class="paragraph">
<p>Beyond counting mere derangements there are more elaborate constraints and questions we could consider, the sorts of things investigated by <a href="https://en.wikipedia.org/wiki/Random_permutation_statistics">statistics of random permutations</a> and <a href="https://en.wikipedia.org/wiki/Generating_function">generating functions</a>.
We haven’t even answered the third question from the introduction yet (“What is the probability that two people will draw each other’s names?”).
I hope to return to this article in the future when I have more time and a better grasp of combinatoric tools to look into some of those questions.</p>
</div>
<div class="paragraph">
<p>If my explanation above of how to derive \(D_n\) was not clear, don’t worry. Counting derangements is frequently used as an example application of the inclusion-exclusion principle, so better explanations can be found on the web and in almost any introductory combinatorics textbook. See, for example, Professor Howard Haber’s <a href="http://scipp.ucsc.edu/~haber/ph116C/InclusionExclusion.pdf">handout on the inclusion-exclusion principle [PDF]</a>. There are also several other methods for deriving and proving the formula for \(D_n\), including those that first derive the recurrence relation \(D_n = (n-1)(D_{n-1} + D_{n-2})\) and then solve it by iteration or by the method of generating functions. For a solution via generating functions see Jean Pierre Mutanguha’s <a href="http://euler.genepeer.com/the-power-of-generating-functions/">“The Power of Generating Functions”</a>.</p>
</div>
</div>
</div>
</div>
<div class="sect1">
<h2 id="_algorithms">Algorithms</h2>
<div class="sectionbody">
<div class="quoteblock">
<blockquote>
[A]lmost as many algorithms have been published for unsorting as for sorting!
</blockquote>
<div class="attribution">
— Donald Knuth
</div>
</div>
<div class="paragraph">
<p>We’ll use Python to explore some algorithms for generating random derangements.
The functions given below are sometimes simplified to get the main ideas across; the complete versions can be found in <a href="https://github.com/cristoper/sinterbot/blob/master/sinterbot/algorithms.py">alogrithms.py in the github repository</a>.
To keep things simple, all of the functions operate only on permutations of the set of integers from 0 to n-1. If we’d like to permute a set of some other n objects (like santas) we can then use those integers as indexes into the list of our other objects.</p>
</div>
<div class="sect2">
<h3 id="_utilities">Utilities</h3>
<div class="paragraph">
<p>There are a few utility functions that we might want while exploring and debugging our algorithms. First of all, the ability to calculate \(D_n\), the number of derangements in a set of size \(n\). Here is a straightforward translation of \(\eqref{eq:dn}\) to Python:</p>
</div>
<div class="listingblock">
<div class="title">Dn()</div>
<div class="content">
<pre class="CodeRay highlight"><code data-lang="python"><span class="keyword">import</span> <span class="include">math</span>
<span class="keyword">from</span> <span class="include">decimal</span> <span class="keyword">import</span> <span class="include">Decimal</span>
<span class="keyword">def</span> <span class="function">Dn</span>(n: <span class="predefined">int</span>):
<span class="comment"># Use Decimal to handle large n accurately</span>
<span class="comment"># (by large, I mean n>13 or so...</span>
<span class="comment"># factorials get big fast!)</span>
s = <span class="integer">0</span>
<span class="keyword">for</span> k <span class="keyword">in</span> <span class="predefined">range</span>(n+<span class="integer">1</span>):
s += (-<span class="integer">1</span>)**k/Decimal(math.factorial(k))
result = math.factorial(n) * s
<span class="keyword">return</span> Decimal.to_integral_exact(result)
[<span class="predefined">int</span>(Dn(i)) <span class="keyword">for</span> i <span class="keyword">in</span> <span class="predefined">range</span>(<span class="integer">9</span>)]
>> [<span class="integer">1</span>, <span class="integer">0</span>, <span class="integer">1</span>, <span class="integer">2</span>, <span class="integer">9</span>, <span class="integer">44</span>, <span class="integer">265</span>, <span class="integer">1854</span>, <span class="integer">14833</span>]</code></pre>
</div>
</div>
<div class="paragraph">
<p>Next up is a way to generate all \(n!\) permutations of a set.
Several algorithms for generating permutations are well-known. Most classic is an algorithm which produces permutations in lexicographical order described by Knuth in <a href="https://www.kcats.org/csci/464/doc/knuth/fascicles/fasc2b.pdf">7.2.1.2 Algorithm L</a>. Other techniques produce all permutations by only swapping one pair of elements at a time (see the <a href="https://en.wikipedia.org/wiki/Steinhaus%E2%80%93Johnson%E2%80%93Trotter_algorithm">Steinhaus-Johnson-Trotter algorithm</a> which Knuth gives as Algorithm P).</p>
</div>
<div class="paragraph">
<p>But in our case the Python standard library provides a function for generating permutations (in lexicographical order) so we’ll just use that:</p>
</div>
<div class="listingblock">
<div class="content">
<pre class="CodeRay highlight"><code data-lang="python"><span class="keyword">import</span> <span class="include">itertools</span>
<span class="predefined">list</span>(itertools.permutations([<span class="integer">0</span>,<span class="integer">1</span>,<span class="integer">2</span>]))
>> [(<span class="integer">0</span>, <span class="integer">1</span>, <span class="integer">2</span>), (<span class="integer">0</span>, <span class="integer">2</span>, <span class="integer">1</span>), (<span class="integer">1</span>, <span class="integer">0</span>, <span class="integer">2</span>), (<span class="integer">1</span>, <span class="integer">2</span>, <span class="integer">0</span>), (<span class="integer">2</span>, <span class="integer">0</span>, <span class="integer">1</span>), (<span class="integer">2</span>, <span class="integer">1</span>, <span class="integer">0</span>)]</code></pre>
</div>
</div>
<div class="paragraph">
<p>Another helpful function would be a way to decompose permutations into their cycles to make them easier to visualize (taking \((0, 1, 2,\cdots, n-1)\) to be the identity permutation).
To find the cycles in a permutation, start with the first element and then visit the element it points to (the element in its position in the identity permutation), and then visit the element that one points to and so on until we get back to the first element. That completes a cycle containing each of the elements visited, add it to a list. Now start over with the first unvisited element. Repeat until there are no more unvisited elements.</p>
</div>
<div class="paragraph">
<p>Below is an implementation of that algorithm.
It requires storage for the list of cycles (the <code>cycles</code> variable), a way to keep track of unvisited elements (the <code>unvisited</code> variable, which starts as a copy of the input but has elements removed as they are visited), and a way to keep track of the first element in a cycle so that we know when we’ve returned to it (the variable called <code>first</code> below):</p>
</div>
<div class="listingblock">
<div class="title">decompose()</div>
<div class="content">
<pre class="CodeRay highlight"><code data-lang="python"><span class="keyword">def</span> <span class="function">decompose</span>(perm):
cycles = []
unvisited = <span class="predefined">list</span>(perm)
<span class="keyword">while</span> <span class="predefined">len</span>(unvisited):
first = unvisited.pop(<span class="integer">0</span>)
cur = [first]
nextval = perm[first]
<span class="keyword">while</span> nextval != first:
cur.append(nextval)
<span class="comment"># Remove each element from unvisited</span>
<span class="comment"># once we visit it</span>
unvisited.pop(unvisited.index(nextval))
nextval = perm[nextval]
cycles.append(cur)
<span class="keyword">return</span> cycles</code></pre>
</div>
</div>
<div class="paragraph">
<p>As an example, let’s see the cycles in \(\{1, 2, 4, 3, 0\}\mathrm{:}\)</p>
</div>
<div class="listingblock">
<div class="content">
<pre class="CodeRay highlight"><code data-lang="python">decompose([<span class="integer">1</span>,<span class="integer">2</span>,<span class="integer">4</span>,<span class="integer">3</span>,<span class="integer">0</span>])
>> [[<span class="integer">1</span>, <span class="integer">2</span>, <span class="integer">4</span>, <span class="integer">0</span>], [<span class="integer">3</span>]]</code></pre>
</div>
</div>
<div class="paragraph">
<p>Notice this agrees with the same cycles we found way back in our first example of a permutation (where \(\mathrm{S}=0, \mathrm{C}=1, \mathrm{A}=2, \mathrm{L}=3, \mathrm{M}=4\)) \(\eqref{eq:perm}\).</p>
</div>
<div class="paragraph">
<p>Finally, it will be handy to have a function that can test whether a permutation is a derangement or not.
One way to do that would be to call <code>decompose()</code> on the permutation and then check if there are any 1-cycles in the decomposition.
The nice thing about that method is that it generalizes so we could use it to check if the permutation contains any cycles \(\leq m\) for any \(m\).</p>
</div>
<div class="paragraph">
<p>But if we only care about derangements (the case where \(m=1\)), it is simpler (and faster) to just iterate over the elements of the permutation and check if they are in their original position. If any are, we can immediately return <code>False</code>, the permutation is not a derangement.</p>
</div>
<div class="listingblock">
<div class="title">check_deranged()</div>
<div class="content">
<pre class="CodeRay highlight"><code data-lang="python"><span class="keyword">def</span> <span class="function">check_deranged</span>(perm):
<span class="keyword">for</span> i, el <span class="keyword">in</span> <span class="predefined">enumerate</span>(perm):
<span class="keyword">if</span> el == i: <span class="keyword">return</span> <span class="predefined-constant">False</span>
<span class="keyword">return</span> <span class="predefined-constant">True</span>
check_deranged([<span class="integer">1</span>,<span class="integer">2</span>,<span class="integer">4</span>,<span class="integer">3</span>,<span class="integer">0</span>])
>> <span class="predefined-constant">False</span>
decompose([<span class="integer">1</span>,<span class="integer">3</span>,<span class="integer">4</span>,<span class="integer">2</span>,<span class="integer">0</span>])
>> [[<span class="integer">1</span>, <span class="integer">3</span>, <span class="integer">2</span>, <span class="integer">4</span>, <span class="integer">0</span>]] <span class="comment"># Notice no 1-cycles</span>
check_deranged([<span class="integer">1</span>,<span class="integer">3</span>,<span class="integer">4</span>,<span class="integer">2</span>,<span class="integer">0</span>])
>> <span class="predefined-constant">True</span></code></pre>
</div>
</div>
</div>
<div class="sect2">
<h3 id="_how_not_to_generate_derangements">How not to generate derangements</h3>
<div class="paragraph">
<p>The first time I sat down to write a secret santa algorithm, my instinct was to try a <a href="https://en.wikipedia.org/wiki/Backtracking">backtracking</a> approach and ended up with something like the code below. The idea behind the backtracker is to iteratively build a derangement by randomly selecting an element from the identity arrangement, and then checking if the resulting partial permutation is a derangement. If it is not a derangement, undo (backtrack) the last choice and try again. If it is a derangement, randomly choose one of the remaining elements and check again. Repeat until you’ve deranged all \(n\) elements:</p>
</div>
<div class="listingblock">
<div class="title">backtracker()</div>
<div class="content">
<pre class="CodeRay highlight"><code data-lang="python"><span class="keyword">import</span> <span class="include">random</span>
<span class="keyword">def</span> <span class="function">backtracker</span>(n):
<span class="keyword">if</span> n == <span class="integer">0</span>: <span class="keyword">return</span> []
remaining = <span class="predefined">list</span>(<span class="predefined">range</span>(n))
perm = []
<span class="comment"># backtrack until solution</span>
<span class="keyword">while</span> <span class="predefined">len</span>(perm) < n:
perm.append(random.choice(remaining))
<span class="keyword">if</span> <span class="keyword">not</span> check_deranged(perm):
<span class="keyword">if</span> <span class="predefined">len</span>(remaining) == <span class="integer">1</span>:
<span class="comment"># we're down to the last two elements</span>
<span class="comment"># just swap them to get a derangement</span>
perm[-<span class="integer">1</span>], perm[-<span class="integer">2</span>] = perm[-<span class="integer">2</span>], perm[-<span class="integer">1</span>]
<span class="keyword">return</span> perm
<span class="comment"># undo last choice</span>
perm.pop(-<span class="integer">1</span>)
<span class="keyword">else</span>:
remaining.remove(perm[-<span class="integer">1</span>])
<span class="keyword">return</span> perm
<span class="comment"># Use it to generate a derangement and view the cycles:</span>
perm = backtracker(<span class="integer">5</span>)
perm, decompose(perm)
>> ([<span class="integer">1</span>, <span class="integer">2</span>, <span class="integer">0</span>, <span class="integer">4</span>, <span class="integer">3</span>], [[<span class="integer">1</span>, <span class="integer">2</span>, <span class="integer">0</span>], [<span class="integer">4</span>, <span class="integer">3</span>]])</code></pre>
</div>
</div>
<div class="paragraph">
<p>As written, <code>backtracker()</code> is fast and will produce any possible derangement (and in only ~20 lines of Python), so it <em>could</em> be used for secret santa. However, as the decades fly by your friends might begin to suspect that the same assignments seem to be ‘randomly’ generated fairly often.
They would be right: <code>backtracker()</code> does not produce derangements with uniform probability.
Even though each element in the derangement is chosen from the remaining elements of the input set with uniform probability, the number of possible derangements is dependant on which numbers happen to have been chosen first.</p>
</div>
<div class="paragraph">
<p>For example, let’s look at the probability that <code>backtracker()</code> will produce \((5, 0, 1, 2, 3, 4).\)
The first number can be anything but 0, so there are \(6-1=5\) ways to choose that.
The second number can be any of the remaining 5 numbers except 1, so there are 4 ways to choose that.
The third number can be any of the remaining 4 numbers except 2 which leaves 3 possibilities.
The fourth element can be any of the remaining numbers except for 2 which leaves 2 possibilities.
The fifth element can not be 4, so there is only one way to derange the last two elements.
If we take the product of those probabilities we get \(\frac{1}{5} \cdot \frac{1}{4} \cdot \frac{1}{3} \cdot \frac{1}{2} = \frac{1}{120}.\)</p>
</div>
<div class="paragraph">
<p>If you do a similar calculation for the probability that <code>backtracker()</code> would produce \((2, 3, 5, 4, 0, 1)\) you should get \(\frac{1}{360}.\)
Not only are the probabilities for generating those two derangements significantly different from each other but they also both differ from the expected probability of \(\frac{1}{265}\) if every one of the \(D_6\) derangements had an equal probability of being generated.
I generated 10,000 derangements of length 6 with <code>backtracker()</code>, and sure enough \((5, 0, 1, 2, 3, 4)\) was generated 94 times while \((2, 3, 5, 4, 0, 1)\) was generated only 20 times.
The graph below shows a plot of counts for every 6-derangement over the 10,000 runs:</p>
</div>
<div class="imageblock">
<div class="content">
<img src="/log/2020/12/7/deranged_sinterklaas/generate_backtrack.png" alt="A bar chart showing the frequency of each derangement produced in 10,000 trials. The backtracker algorithm is clearly not choosing derangements uniformly.">
</div>
<div class="title">Figure 2. Count of each derangement produced after running backtracker() 10,000 times with \(n = 6\). It is clearly not choosing derangements uniformly. The grey line shows the expected count if each derangement were generated with uniform probability (\(1/D_6\cdot 10000 \approx 37.7\))</div>
</div>
<div class="paragraph">
<p>Instead of building derangements by randomly selecting elements and checking if the result is a derangement, we could simply generate all possible permutations, filter out the non-derangements, and then randomly select one of the derangements to return.
The nice thing about that approach is that we could enforce any other constraints we want in the filter step (maybe we want a minimum cycle length or have a “blacklist” of people who should not be assigned to each other) and we can still be confident we would select a valid secret santa assignment with uniform probability (since we have generated all of them it is easy to select one at random).</p>
</div>
<div class="listingblock">
<div class="title">generate_all()</div>
<div class="content">
<pre class="CodeRay highlight"><code data-lang="python"><span class="keyword">import</span> <span class="include">random</span>
<span class="keyword">def</span> <span class="function">generate_all</span>(n):
potential = []
perms = itertools.permutations(<span class="predefined">range</span>(n))
<span class="keyword">for</span> p <span class="keyword">in</span> perms:
<span class="keyword">if</span> check_constraints(p, m, bl):
potential.append(p)
<span class="keyword">return</span> random.choice(potential)</code></pre>
</div>
</div>
<div class="paragraph">
<p>Below you can see a bar graph of the counts after producing 10,000 derangements of length 6 with <code>generate_all()</code>.
It <em>looks</em> much more uniform than <code>backtracker()</code> at least.
One tool we can use to gauge how closely our counts match what we should expect from a uniform distribution is the <a href="https://en.wikipedia.org/wiki/Chi-squared_test">chi-squared statistic</a>:</p>
</div>
<div class="stemblock">
<div class="content">
\[\begin{equation*}
\chi^2 = \sum\frac{(O_i - E_i)^2}{E_i}
\end{equation*}\]
</div>
</div>
<div class="paragraph">
<p>where \(O_i\) are our observed counts and \(E_i\) are the expected counts for each derangement (which in our case is \(1/D_6\cdot 10000 \approx 37.7\)).
For my data I calculated \(\chi^2 \approx 261.75\).
If we check that against the chi-squared cumulative distribution function with k-1 degrees of freedom (where k is the number of data points, 265 in this case), we get a p-value of about 0.53.
The p-value is the probability that our \(\chi^2\) value would be least 261.75 if our counts were uniformly distributed.
Usually if p<0.05 it would be prudent to question whether the data fits a uniform distribution.
On the other hand if p>0.99 or so we could be confident it is uniform, but we might question whether it is random.
A p-value of 0.53 should leave us confident that <code>generate_all()</code> randomly generates derangements with uniform or very nearly uniform probability.</p>
</div>
<div class="imageblock">
<div class="content">
<img src="/log/2020/12/7/deranged_sinterklaas/generate_all.png" alt="A bar chart showing the frequency of each derangement produced in 10,000 trials. The generate_all algorithm appears to be uniform">
</div>
<div class="title">Figure 3. Count of each derangement produced after running generate_all() 10,000 times with \(n = 6\). \(\chi^2 \approx 261.75\) and the chi-squared test p-value ≈ 0.53. The grey line shows the expected count if each derangement were generated with uniform probability (\(\approx 37.7\))</div>
</div>
<div class="paragraph">
<p>But there are two major problems with <code>generate_all()</code>: it is slow (because we have to generate all \(n!\) permutations), and it uses a lot of memory (because we have to store all \(D_n\) derangements).
\(D_{12} = 176,214,841\), for example, so even if we implemented our permutations in some memory efficient way (say an array of one byte per element), we would need over 1GB of memory just to store all of the derangements before returning one.
Running <code>generate_all()</code> with \(n>11\) runs my desktop out of RAM after about a minute and crashes the Python interpreter.
And in the grand scheme of things 12 is not such a huge number.</p>
</div>
</div>
<div class="sect2">
<h3 id="_how_to_generate_derangements">How to generate derangements</h3>
<div class="paragraph">
<p>We can do better than the <code>backtracker</code> and <code>generate_all</code> algorithms above by combining the best aspects of each: generate a single random permutation and check if it is a derangement. If it is, return it; otherwise, try again by generating another random permutation.
That should be much more efficient than generating and storing all possible derangements, and as long as we can generate permutations with uniform probability we will also generate derangements with uniform probability.</p>
</div>
<div class="paragraph">
<p>A well-known algorithm for creating a random permutation by shuffling a given arrangement is to simply select one of the elements at random, set that as the leftmost element of the permutation, and then repeat by selecting one of the remaining elements at random until all of the elements have been selected.
This is known as the <a href="https://en.wikipedia.org/wiki/Fisher%E2%80%93Yates_shuffle">Fisher-Yates shuffle</a> named for two statisticians who described a paper-and-pen method for shuffling a sequence in 1938.
The computer version of the algorithm — popularized in Chapter 3 of Knuth’s <em>The Art of Computer Programming</em> (“Algorithm P (Shuffling)”) — usually shuffles an array in place.
It does this by iterating through the array from left to right swapping the element at the index with a random element to the right of the index.
Once an element has been swapped left it is in its position in the generated permutation.
Repeat to the end.
For a good visualization of Fisher-Yates (and how it compares to less efficient algorithms) see Mike Bostock’s <a href="https://bost.ocks.org/mike/shuffle/">Fisher-Yates Shuffle</a>.</p>
</div>
<div class="paragraph">
<p>The <code>shuffle_rejection()</code> algorithm below repeatedly shuffles a list using Fisher-Yates until the resulting permutation is a derangement:</p>
</div>
<div class="listingblock">
<div class="title">shuffle_rejection()</div>
<div class="content">
<pre class="CodeRay highlight"><code data-lang="python"><span class="keyword">def</span> <span class="function">shuffle_rejection</span>(n):
perm = <span class="predefined">list</span>(<span class="predefined">range</span>(n))
<span class="keyword">while</span> <span class="keyword">not</span> check_deranged(perm):
<span class="comment"># Fisher-Yates shuffle:</span>
<span class="keyword">for</span> i <span class="keyword">in</span> <span class="predefined">range</span>(n):
k = random.randrange(n-i)+i <span class="comment"># i <= k < n</span>
perm[i], perm[k] = perm[k], perm[i]
<span class="keyword">return</span> perm</code></pre>
</div>
</div>
<div class="paragraph">
<p>Notice that in the Fisher-Yates algorithm the range of the random index <code>k</code> includes the index of the current element <code>i</code>. In other words, elements can swap with themselves creating a 1-cycle.
That is necessary, of course, to generate all possible permutations.
If the algorithm is changed so that <code>k</code> ranges only from \(i < k < n\) so that it does not include <code>i</code>, then the algorithm will produce only permutations with a single n-cycle.
This is known as <a href="https://en.wikipedia.org/wiki/Fisher%E2%80%93Yates_shuffle#Sattolo’s_algorithm">Sattolo’s algorithm</a>, and it generates n-cycles with uniform probability.</p>
</div>
<div class="paragraph">
<p>The bar graph above summarizes the results of generating 10,000 derangements of size 6 with <code>shuffle_rejection()</code>.
It appears to be uniform as expected. It is also fast and simple, which makes it perfectly suitable for generating secret santa assignments (and is, in fact, what I use in <a href="#_software">sinterbot</a>, my secret santa tool).</p>
</div>
<div class="imageblock">
<div class="content">
<img src="/log/2020/12/7/deranged_sinterklaas/generate_rejection.png" alt="A bar chart showing the frequency of each derangement produced in 10,000 trials. shuffle_rejection algorithm appears to be uniform">
</div>
<div class="title">Figure 4. Count of each derangement produced after running shuffle_rejection() 10,000 times with \(n = 6\). \(\chi^2 \approx 278.9\) and the chi-squared test p-value ≈ 0.25.</div>
</div>
<div class="paragraph">
<p>One inelegance of <code>shuffle_rejection()</code> is that it generates any number of random permutations just to throw them away.
It is possible that it would never actually generate a derangement and just keep generating and rejecting permutations all day.
In reality derangements are common enough that that is not a practical concern (it finds close to 20,000 derangements of length 6 per second on my old desktop).
But is there a way to directly generate derangements with uniform probability without needing to backtrack or reject non-deranged permutations?</p>
</div>
<div class="paragraph">
<p>Yes. In 2008 Martínez et al. published one such algorithm (<a href="https://doc.lagout.org/science/0_Computer%20Science/2_Algorithms/Proceedings%20of%20the%20Tenth%20Workshop%20on%20Algorithm%20Engineering%20and%20Experiments%20and%20the%20Fifth%20Workshop%20on%20Analytic%20Algorithmics%20and%20Combinatorics%20%5BMunro%20et%20al.%202008-05-30%5D.pdf">“Generating Random Derangements,”</a> 234-240).
It is similar to Sattolo’s algorithm, but instead of joining every element into a single n-cycle, it will randomly close cycles at a specific probability which ensures a uniform generation of derangements.
<a href="https://www.cs.upc.edu/~conrado/research/talks/analco08.pdf">Here is a nice set of slides</a> that goes through their algorithm step by step.</p>
</div>
<div class="paragraph">
<p>Jörg Arndt provides an easier to follow (in my opinion) version of the algorithm in his 2010 thesis <a href="https://maths-people.anu.edu.au/~brent/pd/Arndt-thesis.pdf"><em>Generating Random Permutations</em></a>. (It’s a short book that includes several useful algorithms.)
This Python implementation more closely follows his version:</p>
</div>
<div class="listingblock">
<div class="title">rand_derangement()</div>
<div class="content">
<pre class="CodeRay highlight"><code data-lang="python"><span class="keyword">def</span> <span class="function">rand_derangement</span>(n):
perm = <span class="predefined">list</span>(<span class="predefined">range</span>(n))
remaining = <span class="predefined">list</span>(perm)
<span class="keyword">while</span> (<span class="predefined">len</span>(remaining)><span class="integer">1</span>):
<span class="comment"># random index < last:</span>
rand_i = random.randrange(<span class="predefined">len</span>(remaining)-<span class="integer">1</span>)
rand = remaining[rand_i]
last = remaining[-<span class="integer">1</span>]
<span class="comment"># swap to join cycles</span>
perm[last], perm[rand] = perm[rand], perm[last]
<span class="comment"># remove last from remaining</span>
remaining.pop(-<span class="integer">1</span>)
p = random.random() <span class="comment"># uniform [0, 1)</span>
r = <span class="predefined">len</span>(remaining)
prob = r * Dn(r-<span class="integer">1</span>)/Dn(r+<span class="integer">1</span>)
<span class="keyword">if</span> p < prob:
<span class="comment"># Close the cycle</span>
remaining.pop(rand_i)
<span class="keyword">return</span> perm</code></pre>
</div>
</div>
<div class="imageblock">
<div class="content">
<img src="/log/2020/12/7/deranged_sinterklaas/rand_derangement.png" alt="A bar chart showing the frequency of each derangement produced in 10,000 trials. The rand_derangement() algorithm appears to be uniform.">
</div>
<div class="title">Figure 5. Count of each derangement produced after running rand_derangement() 10,000 times with \(n = 6\). \(\chi^2 \approx 261.1\) and the chi-squared test p-value ≈ 0.54.</div>
</div>
<div class="paragraph">
<p>Arndt’s implementation is in C with a precomputed lookup table for the ratio calculated on the <code>prob = r * Dn(r-1)/Dn(r+1)</code> line. Even then he reports it is only slightly faster than the rejection method. This Python implementation is actually about twice as slow as the rejection method in my tests.</p>
</div>
<div class="paragraph">
<p>But one advantage of generating derangements directly as in <code>rand_derangement()</code> is that it can be generalized to generate derangements with minimum cycle lengths. Arndt shows how that can be done in his thesis.</p>
</div>
<div class="paragraph">
<p>There are other ways to generate random derangements that I’ve not covered in this post.
Earlier this year J. Ricardo G. Mendonça published two new algorithms for [almost-]uniformly generating random derangements: <a href="https://arxiv.org/pdf/1809.04571.pdf">“Efficient generation of random derangements with the expected distribution of cycle lengths,”</a> <em>Computational and Applied Mathematics</em> 39, no. 3 (2020): 1-15.</p>
</div>
</div>
</div>
</div>
<div class="sect1">
<h2 id="software">Sinterbot2020</h2>
<div class="sectionbody">
<div class="paragraph">
<p><code>sinterbot</code> is a little command line program (Python 3.5+) that helps to manage secret santa assignments. With <code>sinterbot</code> you can generate a valid secret santa assignment for a list of people and email each person their assigned gift recipient without ever revealing to anybody (including the operator of <code>sinterbot</code>) the full secret list of assignments.</p>
</div>
<div class="paragraph">
<p>Source code and more usage instructions: <a href="https://github.com/cristoper/sinterbot" class="bare">https://github.com/cristoper/sinterbot</a></p>
</div>
<div class="paragraph">
<p><code>sinterbot</code> allows specifying some extra constraints such as minimum cycle length or a blacklist of people who should not be assigned to each other.</p>
</div>
<div class="sect2">
<h3 id="_installation">Installation</h3>
<div class="listingblock">
<div class="content">
<pre class="CodeRay highlight"><code data-lang="sh">pip install sinterbot</code></pre>
</div>
</div>
</div>
<div class="sect2">
<h3 id="_usage">Usage</h3>
<div class="paragraph">
<p>First create a config file with a list of participants' names and email addresses. The config file may also specify constraints for minimum cycle length and a blacklist. See <a href="https://github.com/cristoper/sinterbot/blob/master/sample.conf">sample.conf</a> for a full example:</p>
</div>
<div class="listingblock">
<div class="content">
<pre class="CodeRay highlight"><code data-lang="xmas2020.conf"># xmas2020.conf
Santa A: user1@email.tld
Santa B: user2@email.tld
Santa C: user3@email.tld
Santa D: user4@email.tld
Santa E: user5@email.tld</code></pre>
</div>
</div>
<div class="paragraph">
<p>The format is <code>Name: emailaddress</code>. Only the email addresses need to be unique.</p>
</div>
<div class="paragraph">
<p>Then run <code>sinterbot derange</code> to compute a valid assignment and save it to the config file:</p>
</div>
<div class="listingblock">
<div class="content">
<pre class="CodeRay highlight"><code data-lang="sh">$ sinterbot derange xmas2020.conf
Derangement info successfully added to config file.
Use `sinterbot send sample.conf -c smtp.conf` to send emails!</code></pre>
</div>
</div>
<div class="paragraph">
<p><code>sinterbot</code> will not allow you to re-derange a config file without passing the <code>--force</code> flag.</p>
</div>
<div class="paragraph">
<p>Now if you want you can view the secret santa assignments with <code>sinterbot view xmas2020.conf</code>. However, if you’re a participant that would ruin the suprise for you! Instead you can email each person their assignment without ever seeing them yourself.</p>
</div>
<div class="paragraph">
<p>First create a file to specify your SMTP credentials:</p>
</div>
<div class="listingblock">
<div class="content">
<pre class="CodeRay highlight"><code data-lang="sh"># smtp.conf
SMTPEmail: yourname@gmail.com
SMTPPass: yourgmailpassword
SMTPServer: smtp.gmail.com
SMTPPort: 587</code></pre>
</div>
</div>
<div class="paragraph">
<p>(If you do not know what SMTP server to use but you have a gmail account, you can <a href="https://www.digitalocean.com/community/tutorials/how-to-use-google-s-smtp-server">use gmail’s SMTP server</a>) using values like those exemplified above.)</p>
</div>
<div class="paragraph">
<p>Then run the <code>sinterbot send</code> command, giving it the smtp credentials file with the <code>-c</code> option, to send the emails:</p>
</div>
<div class="listingblock">
<div class="content">
<pre class="CodeRay highlight"><code data-lang="sh">$ sinterbot send xmas2020.conf -c smtp.conf
Send message to user1@email.tld!
Send message to user2@email.tld!
Send message to user3@email.tld!
Send message to user4@email.tld!
Send message to user5@email.tld!</code></pre>
</div>
</div>
</div>
</div>
</div>How to generate uniformly random derangement for secret santa purposes.tag:catswhisker.xyz,2017-09-07:/log/2017/9/7/a_first_excercise_in_natural_language_processing_with_python_counting_hapaxes/A First Exercise in Natural Language Processing with Python: Counting Hapaxes2017-09-07T18:32:37Z2022-10-23T17:44:18Z<div id="toc" class="toc">
<div id="toctitle">Table of Contents</div>
<ul class="sectlevel1">
<li><a href="#_a_first_exercise">A first exercise</a></li>
<li><a href="#_natural_language_processing_with_python">Natural language processing with Python</a>
<ul class="sectlevel2">
<li><a href="#_installation">Installation</a></li>
<li><a href="#_optional_dependency_on_python_modules">Optional dependency on Python modules</a></li>
<li><a href="#_tokenization">Tokenization</a></li>
<li><a href="#_counting_word_forms">Counting word forms</a></li>
<li><a href="#_stemming_and_lemmatization">Stemming and Lemmatization</a></li>
<li><a href="#_lemmatization_with_nltk">Lemmatization with NLTK</a></li>
<li><a href="#_finding_hapaxes_with_spacy">Finding hapaxes with spaCy</a></li>
<li><a href="#_make_it_a_script">Make it a script</a></li>
</ul>
</li>
<li><a href="#_hapaxes_py_listing">hapaxes.py listing</a></li>
</ul>
</div>
<div class="sect1">
<h2 id="_a_first_exercise">A first exercise</h2>
<div class="sectionbody">
<div class="paragraph">
<p>Counting <a href="https://en.wikipedia.org/wiki/Hapax_legomenon">hapaxes</a> (words which occur only once in a text or corpus) is an easy enough problem that makes use of both simple data structures and some fundamental tasks of natural language processing (NLP): tokenization (dividing a text into words), stemming, and part-of-speech tagging for lemmatization. For that reason it makes a good exercise to get started with NLP in a new language or library.</p>
</div>
<div class="paragraph">
<p>As a first exercise in implementing NLP tasks with Python, then, we’ll write a script which outputs the count and a list of the hapaxes in the following paragraph (our script can also be run on an arbitrary input file). You can follow along, or try it yourself and then compare your solution to mine.</p>
</div>
<div class="listingblock">
<div class="content">
<pre>Cory Linguist, a cautious corpus linguist, in creating a corpus of courtship correspondence, corrupted a crucial link. Now, if Cory Linguist, a careful corpus linguist, in creating a corpus of courtship correspondence, corrupted a crucial link, see that YOU, in creating a corpus of courtship correspondence, corrupt not a crucial link.</pre>
</div>
</div>
<div class="paragraph">
<p>To keep things simple, ignore punctuation and case. To make things complex, count hapaxes in all three of word form, stemmed form, and lemma form. The final program (<a href="hapaxes.py">hapaxes.py</a>) is listed at the end of this post. The sections below walk through it in detail for the beginning NLP/Python programmer.</p>
</div>
</div>
</div>
<div class="sect1">
<h2 id="_natural_language_processing_with_python">Natural language processing with Python</h2>
<div class="sectionbody">
<div class="paragraph">
<p>There are several NLP packages available to the Python programmer. The most well-known is the <a href="http://www.nltk.org/">Natural Language Toolkit (NLTK)</a>, which is the subject of the popular book <a href="http://www.nltk.org/book/"><em>Natural Language Processing with Python</em></a> by Bird et al. NLTK has a focus on education/research with a rather sprawling API. <a href="https://github.com/clips/pattern">Pattern</a> is a Python package for datamining the WWW which includes submodules for language processing and machine learning. <a href="http://polyglot.readthedocs.io/en/latest/">Polyglot</a> is a language library focusing on “massive multilingual applications.” Many of its features support over 100 languages (but it doesn’t seem to have a stemmer or lemmatizer builtin). And there is Matthew Honnibal’s <a href="https://spacy.io/">spaCy</a>, an “industrial strength” NLP library focused on performance and integration with machine learning models.</p>
</div>
<div class="paragraph">
<p>If you don’t already know which library you want to use, I recommend starting with NLTK because there are so many online resources available for it. The program presented below actually presents five solutions to counting hapaxes, which will hopefully give you a feel for a few of the libraries mentioned above:</p>
</div>
<div class="ulist">
<ul>
<li>
<p>Word forms - counts unique spellings (normalized for case). This uses plain Python (no NLP packages required)</p>
</li>
<li>
<p>NLTK stems - counts unique stems using a stemmer provided by NLTK</p>
</li>
<li>
<p>NLTK lemmas - counts unique lemma forms using NLTK’s part of speech tagger and
interface to the WordNet lemmatizer</p>
</li>
<li>
<p>spaCy lemmas - counts unique lemma forms using the spaCy NLP package</p>
</li>
</ul>
</div>
<div class="sect2">
<h3 id="_installation">Installation</h3>
<div class="paragraph">
<p>This tutorial assumes you already have Python installed on your system and have some experience using the interpreter. I recommend referring to each package’s project page for installation instructions, but here is one way using <a href="https://pypi.python.org/pypi/pip">pip</a>. As explained below, each of the NLP packages are optional; feel free to install only the ones you’re interested in playing with.</p>
</div>
<div class="listingblock">
<div class="content">
<pre class="CodeRay highlight"><code data-lang="sh"># Install NLTK:
$ pip install nltk
# Download reqed NLTK data packages
$ python -c 'import nltk; nltk.download("wordnet"); nltk.download("averaged_perceptron_tagger"); nltk.download("omw-1.4")'
# install spaCy:
$ pip install spacy
# install spaCy en model:
$ python -m spacy download en_core_web_sm</code></pre>
</div>
</div>
</div>
<div class="sect2">
<h3 id="_optional_dependency_on_python_modules">Optional dependency on Python modules</h3>
<div class="paragraph">
<p>It would be nice if our script didn’t depend on any particular NLP package so that it could still run even if one or more of them were not installed (using only the functionality provided by whichever packages are installed).</p>
</div>
<div class="paragraph">
<p>One way to implement a script with optional package dependencies in Python is to try to import a module, and if we get an <code>ImportError</code> <a href="https://docs.python.org/3/tutorial/errors.html#exceptions">exception</a> we mark the package as uninstalled (by setting a variable with the module’s name to <code>None</code>) which we can check for later in our code:</p>
</div>
<div class="listingblock wide">
<div class="title"> [hapaxes.py: 63-98]</div>
<div class="content">
<pre class="CodeRay highlight"><code data-lang="python"><span class="comment">### Imports</span>
<span class="comment">#</span>
<span class="comment"># Import some Python 3 features to use in Python 2</span>
<span class="keyword">from</span> <span class="include">__future__</span> <span class="keyword">import</span> <span class="include">print_function</span>
<span class="keyword">from</span> <span class="include">__future__</span> <span class="keyword">import</span> <span class="include">unicode_literals</span>
<span class="comment"># gives us access to command-line arguments</span>
<span class="keyword">import</span> <span class="include">sys</span>
<span class="comment"># The Counter collection is a convenient layer on top of</span>
<span class="comment"># python's standard dictionary type for counting iterables.</span>
<span class="keyword">from</span> <span class="include">collections</span> <span class="keyword">import</span> <span class="include">Counter</span>
<span class="comment"># The standard python regular expression module:</span>
<span class="keyword">import</span> <span class="include">re</span>
<span class="keyword">try</span>:
<span class="comment"># Import NLTK if it is installed</span>
<span class="keyword">import</span> <span class="include">nltk</span>
<span class="comment"># This imports NLTK's implementation of the Snowball</span>
<span class="comment"># stemmer algorithm</span>
<span class="keyword">from</span> <span class="include">nltk.stem.snowball</span> <span class="keyword">import</span> <span class="include">SnowballStemmer</span>
<span class="comment"># NLTK's interface to the WordNet lemmatizer</span>
<span class="keyword">from</span> <span class="include">nltk.stem.wordnet</span> <span class="keyword">import</span> <span class="include">WordNetLemmatizer</span>
<span class="keyword">except</span> <span class="exception">ImportError</span>:
nltk = <span class="predefined-constant">None</span>
print(<span class="string"><span class="delimiter">"</span><span class="content">NLTK is not installed, so we won't use it.</span><span class="delimiter">"</span></span>)
<span class="keyword">try</span>:
<span class="comment"># Import spaCy if it is installed</span>
<span class="keyword">import</span> <span class="include">spacy</span>
<span class="keyword">except</span> <span class="exception">ImportError</span>:
spacy = <span class="predefined-constant">None</span>
print(<span class="string"><span class="delimiter">"</span><span class="content">spaCy is not installed, so we won't use it.</span><span class="delimiter">"</span></span>)</code></pre>
</div>
</div>
</div>
<div class="sect2">
<h3 id="_tokenization">Tokenization</h3>
<div class="paragraph">
<p><a href="https://en.wikipedia.org/wiki/Tokenization_(lexical_analysis)">Tokenization</a> is the process of splitting a string into lexical ‘tokens’ — usually words or sentences. In languages with space-separated words, satisfactory tokenization can often be accomplished with a few simple rules, though ambiguous punctuation can cause errors (such as mistaking a period after an abbreviation as the end of a sentence). Some tokenizers use statistical inference (trained on a corpus with known token boundaries) to recognize tokens.</p>
</div>
<div class="paragraph">
<p>In our case we need to break the text into a list of words in order to find the hapaxes. But since we are not interested in punctuation or capitalization, we can make tokenization very simple by first normalizing the text to lower case and stripping out every punctuation symbol:</p>
</div>
<div class="listingblock wide">
<div class="title"> [hapaxes.py: 100-119]</div>
<div class="content">
<pre class="CodeRay highlight"><code data-lang="python"><span class="keyword">def</span> <span class="function">normalize_tokenize</span>(string):
<span class="docstring"><span class="delimiter">"""</span><span class="content">
</span><span class="content"> Takes a string, normalizes it (makes it lowercase and</span><span class="content">
</span><span class="content"> removes punctuation), and then splits it into a list of</span><span class="content">
</span><span class="content"> words.</span><span class="content">
</span><span class="content">
</span><span class="content"> Note that everything in this function is plain Python</span><span class="content">
</span><span class="content"> without using NLTK (although as noted below, NLTK provides</span><span class="content">
</span><span class="content"> some more sophisticated tokenizers we could have used).</span><span class="content">
</span><span class="content"> </span><span class="delimiter">"""</span></span>
<span class="comment"># make lowercase</span>
norm = string.lower()
<span class="comment"># remove punctuation</span>
norm = re.sub(<span class="string"><span class="modifier">r</span><span class="delimiter">'</span><span class="content">(?u)[^</span><span class="content">\w</span><span class="content">\s</span><span class="content">]</span><span class="delimiter">'</span></span>, <span class="string"><span class="delimiter">'</span><span class="delimiter">'</span></span>, norm) <i class="conum" data-value="1"></i><b>(1)</b>
<span class="comment"># split into words</span>
tokens = norm.split()
<span class="keyword">return</span> tokens</code></pre>
</div>
</div>
<div class="colist arabic">
<table>
<tr>
<td><i class="conum" data-value="1"></i><b>1</b></td>
<td>Remove punctuation by replacing everything that is not a word (<code>\w</code>) or whitespace (<code>\s</code>) with an empty string. The (<code>?u</code>) flag at the beginning of the regex enables unicode matching for the \w and \s character classes in Python 2 (unicode is the default with Python 3).</td>
</tr>
</table>
</div>
<div class="paragraph">
<p>Our tokenizer produces output like this:</p>
</div>
<div class="listingblock">
<div class="content">
<pre class="CodeRay highlight"><code data-lang="python">>>> normalize_tokenize(<span class="string"><span class="delimiter">"</span><span class="content">This is a test sentence of white-space separated words.</span><span class="delimiter">"</span></span>)
[<span class="string"><span class="delimiter">'</span><span class="content">this</span><span class="delimiter">'</span></span>, <span class="string"><span class="delimiter">'</span><span class="content">is</span><span class="delimiter">'</span></span>, <span class="string"><span class="delimiter">'</span><span class="content">a</span><span class="delimiter">'</span></span>, <span class="string"><span class="delimiter">'</span><span class="content">test</span><span class="delimiter">'</span></span>, <span class="string"><span class="delimiter">'</span><span class="content">sentence</span><span class="delimiter">'</span></span>, <span class="string"><span class="delimiter">'</span><span class="content">of</span><span class="delimiter">'</span></span>, <span class="string"><span class="delimiter">'</span><span class="content">whitespace</span><span class="delimiter">'</span></span>, <span class="string"><span class="delimiter">'</span><span class="content">separated</span><span class="delimiter">'</span></span>, <span class="string"><span class="delimiter">'</span><span class="content">words</span><span class="delimiter">'</span></span>]</code></pre>
</div>
</div>
<div class="paragraph">
<p>Instead of simply removing punctuation and then splitting words on whitespace, we could have used one of <a href="http://www.nltk.org/api/nltk.tokenize.html">the tokenizers provided by NLTK</a>. Specifically the <code>word_tokenize()</code> method, which first splits the text into sentences using a pre-trained English sentences tokenizer (<code>sent_tokenize</code>), and then finds words using regular expressions in the style of the Penn Treebank tokens.</p>
</div>
<div class="listingblock">
<div class="content">
<pre class="CodeRay highlight"><code data-lang="python"><span class="comment"># We could have done it this way (requires the</span>
<span class="comment"># 'punkt' data package):</span>
<span class="keyword">from</span> <span class="include">nltk.tokenize</span> <span class="keyword">import</span> <span class="include">word_tokenize</span>
tokens = word_tokenize(norm)</code></pre>
</div>
</div>
<div class="paragraph">
<p>The main advantage of <code>word_tokenize()</code> is that it will turn contractions into separate tokens. But using Python’s standard <code>split()</code> is good enough for our purposes.</p>
</div>
</div>
<div class="sect2">
<h3 id="_counting_word_forms">Counting word forms</h3>
<div class="paragraph">
<p>We can use the tokenizer defined above to get a list of words from any string, so now we need a way to count how many times each word occurs. Those that occur only once are our word-form hapaxes.</p>
</div>
<div class="listingblock wide">
<div class="title"> [hapaxes.py: 121-135]</div>
<div class="content">
<pre class="CodeRay highlight"><code data-lang="python"><span class="keyword">def</span> <span class="function">word_form_hapaxes</span>(tokens):
<span class="docstring"><span class="delimiter">"""</span><span class="content">
</span><span class="content"> Takes a list of tokens and returns a list of the</span><span class="content">
</span><span class="content"> wordform hapaxes (those wordforms that only appear once)</span><span class="content">
</span><span class="content">
</span><span class="content"> For wordforms this is simple enough to do in plain</span><span class="content">
</span><span class="content"> Python without an NLP package, especially using the Counter</span><span class="content">
</span><span class="content"> type from the collections module (part of the Python</span><span class="content">
</span><span class="content"> standard library).</span><span class="content">
</span><span class="content"> </span><span class="delimiter">"""</span></span>
counts = Counter(tokens) <i class="conum" data-value="1"></i><b>(1)</b>
hapaxes = [word <span class="keyword">for</span> word <span class="keyword">in</span> counts <span class="keyword">if</span> counts[word] == <span class="integer">1</span>] <i class="conum" data-value="2"></i><b>(2)</b>
<span class="keyword">return</span> hapaxes</code></pre>
</div>
</div>
<div class="colist arabic">
<table>
<tr>
<td><i class="conum" data-value="1"></i><b>1</b></td>
<td>Use the convenient <code><a href="https://docs.python.org/3/library/collections.html#collections.Counter">Counter</a></code> class from Python’s standard library to count the occurrences of each token. <code>Counter</code> is a subclass of the standard <code>dict</code> type; its constructor takes a list of items from which it builds a dictionary whose keys are elements from the list and whose values are the number of times each element appeared in the list.</td>
</tr>
<tr>
<td><i class="conum" data-value="2"></i><b>2</b></td>
<td>This <a href="https://docs.python.org/3/tutorial/datastructures.html#list-comprehensions">list comprehension</a> creates a list from the Counter dictionary containing only the dictionary keys that have a count of 1. These are our hapaxes.</td>
</tr>
</table>
</div>
</div>
<div class="sect2">
<h3 id="_stemming_and_lemmatization">Stemming and Lemmatization</h3>
<div class="paragraph">
<p>If we use our two functions to first tokenize and then find the hapaxes in our example text, we get this output:</p>
</div>
<div class="listingblock">
<div class="content">
<pre class="CodeRay highlight"><code data-lang="python">>>> text = <span class="string"><span class="delimiter">"</span><span class="content">Cory Linguist, a cautious corpus linguist, in creating a corpus of courtship correspondence, corrupted a crucial link. Now, if Cory Linguist, a careful corpus linguist, in creating a corpus of courtship correspondence, corrupted a crucial link, see that YOU, in creating a corpus of courtship correspondence, corrupt not a crucial link.</span><span class="delimiter">"</span></span>
>>> tokens = normalize_tokenize(text)
>>> word_form_hapaxes(tokens)
[<span class="string"><span class="delimiter">'</span><span class="content">now</span><span class="delimiter">'</span></span>, <span class="string"><span class="delimiter">'</span><span class="content">not</span><span class="delimiter">'</span></span>, <span class="string"><span class="delimiter">'</span><span class="content">that</span><span class="delimiter">'</span></span>, <span class="string"><span class="delimiter">'</span><span class="content">see</span><span class="delimiter">'</span></span>, <span class="string"><span class="delimiter">'</span><span class="content">if</span><span class="delimiter">'</span></span>, <span class="string"><span class="delimiter">'</span><span class="content">corrupt</span><span class="delimiter">'</span></span>, <span class="string"><span class="delimiter">'</span><span class="content">you</span><span class="delimiter">'</span></span>, <span class="string"><span class="delimiter">'</span><span class="content">careful</span><span class="delimiter">'</span></span>, <span class="string"><span class="delimiter">'</span><span class="content">cautious</span><span class="delimiter">'</span></span>]</code></pre>
</div>
</div>
<div class="paragraph">
<p>Notice that ‘corrupt’ is counted as a hapax even though the text also includes two instances of the word ‘corrupted’. That is expected because ‘corrupt’ and ‘corrupted’ are different word-forms, but if we want to count word roots regardless of their inflections we must process our tokens further. There are two main methods we can try:</p>
</div>
<div class="ulist">
<ul>
<li>
<p><a href="https://en.wikipedia.org/wiki/Stemming">Stemming</a> uses an algorithm (and/or a lookup table) to remove the suffix of tokens so that words with the same base but different inflections are reduced to the same form. For example: ‘argued’ and ‘arguing’ are both stemmed to ‘argu’.</p>
</li>
<li>
<p><a href="https://en.wikipedia.org/wiki/Lemmatisation">Lemmatization</a> reduces tokens to their lemmas, their canonical dictionary form. For example, ‘argued’ and ‘arguing’ are both lemmatized to ‘argue’.</p>
</li>
</ul>
</div>
<div class="sect3">
<h4 id="_stemming_with_nltk">Stemming with NLTK</h4>
<div class="paragraph">
<p>In 1980 Martin Porter published <a href="https://tartarus.org/martin/PorterStemmer/index.html">a stemming algorithm</a> which has become a standard way to stem English words. His algorithm was implemented so many times, and with so many errors, that he later created <a href="https://snowballstem.org/">a programming language called Snowball</a> to help clearly and exactly define stemmers. NLTK includes a Python port of the Snowball implementation of an improved version of Porter’s original stemmer:</p>
</div>
<div class="listingblock wide">
<div class="title"> [hapaxes.py: 137-153]</div>
<div class="content">
<pre class="CodeRay highlight"><code data-lang="python"><span class="keyword">def</span> <span class="function">nltk_stem_hapaxes</span>(tokens):
<span class="docstring"><span class="delimiter">"""</span><span class="content">
</span><span class="content"> Takes a list of tokens and returns a list of the word</span><span class="content">
</span><span class="content"> stem hapaxes.</span><span class="content">
</span><span class="content"> </span><span class="delimiter">"""</span></span>
<span class="keyword">if</span> <span class="keyword">not</span> nltk: <i class="conum" data-value="1"></i><b>(1)</b>
<span class="comment"># Only run if NLTK is loaded</span>
<span class="keyword">return</span> <span class="predefined-constant">None</span>
<span class="comment"># Apply NLTK's Snowball stemmer algorithm to tokens:</span>
stemmer = SnowballStemmer(<span class="string"><span class="delimiter">"</span><span class="content">english</span><span class="delimiter">"</span></span>)
stems = [stemmer.stem(token) <span class="keyword">for</span> token <span class="keyword">in</span> tokens]
<span class="comment"># Filter down to hapaxes:</span>
counts = nltk.FreqDist(stems) <i class="conum" data-value="2"></i><b>(2)</b>
hapaxes = counts.hapaxes() <i class="conum" data-value="3"></i><b>(3)</b>
<span class="keyword">return</span> hapaxes</code></pre>
</div>
</div>
<div class="colist arabic">
<table>
<tr>
<td><i class="conum" data-value="1"></i><b>1</b></td>
<td>Here we check if the <code>nltk</code> module was loaded; if it was not (presumably because it is not installed), we return without trying to run the stemmer.</td>
</tr>
<tr>
<td><i class="conum" data-value="2"></i><b>2</b></td>
<td>NLTK’s <code><a href="http://www.nltk.org/_modules/nltk/probability.html">FreqDist</a></code> class subclasses the <code>Counter</code> container type we used above to count word-forms. It adds some methods useful for calculating frequency distributions.</td>
</tr>
<tr>
<td><i class="conum" data-value="3"></i><b>3</b></td>
<td>The <code>FreqDist</code> class also adds a <code>hapaxes()</code> method, which is implemented exactly like the list comprehension we used to count word-form hapaxes.</td>
</tr>
</table>
</div>
<div class="paragraph">
<p>Running <code>nltk_stem_hapaxes()</code> on our tokenized example text produces this list of stem hapaxes:</p>
</div>
<div class="listingblock">
<div class="content">
<pre class="CodeRay highlight"><code data-lang="python">>>> nltk_stem_hapaxes(tokens)
[<span class="string"><span class="delimiter">'</span><span class="content">now</span><span class="delimiter">'</span></span>, <span class="string"><span class="delimiter">'</span><span class="content">cautious</span><span class="delimiter">'</span></span>, <span class="string"><span class="delimiter">'</span><span class="content">that</span><span class="delimiter">'</span></span>, <span class="string"><span class="delimiter">'</span><span class="content">not</span><span class="delimiter">'</span></span>, <span class="string"><span class="delimiter">'</span><span class="content">see</span><span class="delimiter">'</span></span>, <span class="string"><span class="delimiter">'</span><span class="content">you</span><span class="delimiter">'</span></span>, <span class="string"><span class="delimiter">'</span><span class="content">care</span><span class="delimiter">'</span></span>, <span class="string"><span class="delimiter">'</span><span class="content">if</span><span class="delimiter">'</span></span>]</code></pre>
</div>
</div>
<div class="paragraph">
<p>Notice that ‘corrupt’ is no longer counted as a hapax (since it shares a stem with ‘corrupted’), and ‘careful’ has been stemmed to ‘care’.</p>
</div>
</div>
</div>
<div class="sect2">
<h3 id="_lemmatization_with_nltk">Lemmatization with NLTK</h3>
<div class="paragraph">
<p>NLTK provides a lemmatizer (the <code>WordNetLemmatizer</code> class in <a href="http://www.nltk.org/_modules/nltk/stem/wordnet.html">nltk.stem.wordnet</a>) which tries to find a word’s lemma form with help from the <a href="https://wordnet.princeton.edu/">WordNet</a> corpus (which can be downloaded by running <code>nltk.download()</code> from an interactive python prompt — refer to <a href="http://www.nltk.org/data.html">“Installing NLTK Data”</a> for general instructions).</p>
</div>
<div class="paragraph">
<p>In order to resolve ambiguous cases, lemmatization usually requires tokens to be accompanied by part-of-speech tags. For example, the word lemma for <em>rose</em> depends on whether it is used as a noun or a verb:</p>
</div>
<div class="listingblock">
<div class="content">
<pre class="CodeRay highlight"><code data-lang="python">>>> lemmer = WordNetLemmatizer()
>>> lemmer.lemmatize(<span class="string"><span class="delimiter">'</span><span class="content">rose</span><span class="delimiter">'</span></span>, <span class="string"><span class="delimiter">'</span><span class="content">n</span><span class="delimiter">'</span></span>) <span class="comment"># tag as noun</span>
<span class="string"><span class="delimiter">'</span><span class="content">rose</span><span class="delimiter">'</span></span>
>>> lemmer.lemmatize(<span class="string"><span class="delimiter">'</span><span class="content">rose</span><span class="delimiter">'</span></span>, <span class="string"><span class="delimiter">'</span><span class="content">v</span><span class="delimiter">'</span></span>) <span class="comment"># tag as verb</span>
<span class="string"><span class="delimiter">'</span><span class="content">rise</span><span class="delimiter">'</span></span></code></pre>
</div>
</div>
<div class="paragraph">
<p>Since we are operating on untagged tokens, we’ll first run them through an automated part-of-speech tagger provided by NLTK (it uses a pre-trained perceptron tagger originally by Matthew Honnibal: <a href="https://explosion.ai/blog/part-of-speech-pos-tagger-in-python">“A Good Part-of-Speech Tagger in about 200 Lines of Python”</a>). The tagger requires the training data available in the 'averaged_perceptron_tagger.pickle' file which can be downloaded by running <code>nltk.download()</code> from an interactive python prompt.</p>
</div>
<div class="listingblock wide">
<div class="title"> [hapaxes.py: 155-176]</div>
<div class="content">
<pre class="CodeRay highlight"><code data-lang="python"><span class="keyword">def</span> <span class="function">nltk_lemma_hapaxes</span>(tokens):
<span class="docstring"><span class="delimiter">"""</span><span class="content">
</span><span class="content"> Takes a list of tokens and returns a list of the lemma</span><span class="content">
</span><span class="content"> hapaxes.</span><span class="content">
</span><span class="content"> </span><span class="delimiter">"""</span></span>
<span class="keyword">if</span> <span class="keyword">not</span> nltk:
<span class="comment"># Only run if NLTK is loaded</span>
<span class="keyword">return</span> <span class="predefined-constant">None</span>
<span class="comment"># Tag tokens with part-of-speech:</span>
tagged = nltk.pos_tag(tokens) <i class="conum" data-value="1"></i><b>(1)</b>
<span class="comment"># Convert our Treebank-style tags to WordNet-style tags.</span>
tagged = [(word, pt_to_wn(tag))
<span class="keyword">for</span> (word, tag) <span class="keyword">in</span> tagged] <i class="conum" data-value="2"></i><b>(2)</b>
<span class="comment"># Lemmatize:</span>
lemmer = WordNetLemmatizer()
lemmas = [lemmer.lemmatize(token, pos)
<span class="keyword">for</span> (token, pos) <span class="keyword">in</span> tagged] <i class="conum" data-value="3"></i><b>(3)</b>
<span class="keyword">return</span> nltk_stem_hapaxes(lemmas) <i class="conum" data-value="4"></i><b>(4)</b></code></pre>
</div>
</div>
<div class="colist arabic">
<table>
<tr>
<td><i class="conum" data-value="1"></i><b>1</b></td>
<td>This turns our list of tokens into a list of 2-tuples: <code>[(token1, tag1), (token2, tag2)…​]</code></td>
</tr>
<tr>
<td><i class="conum" data-value="2"></i><b>2</b></td>
<td>We must convert between the tags returned by <code>pos_tag()</code> and the tags expected by the WordNet lemmatizer. This is done by applying the <code>pt_to_wn()</code> function (defined below) to each tag.</td>
</tr>
<tr>
<td><i class="conum" data-value="3"></i><b>3</b></td>
<td>Pass each token and POS tag to the WordNet lemmatizer.</td>
</tr>
<tr>
<td><i class="conum" data-value="4"></i><b>4</b></td>
<td>If a lemma is not found for a token, then it is returned from <code>lemmatize()</code> unchanged. To ensure these unhandled words don’t contribute spurious hapaxes, we pass our lemmatized tokens through the word stemmer for good measure (which also filters the list down to only hapaxes).</td>
</tr>
</table>
</div>
<div class="paragraph">
<p>As noted above, the tags returned by <code>pos_tag()</code> are <a href="https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html">Penn Treebank style tags</a> while the WordNet lemmatizer uses its own tag set (defined in the <code>nltk.corpus.reader.wordnet</code> module, though that is not very clear from the NLTK documentation). The <code>pt_to_wn()</code> function converts Treebank tags to the tags required for lemmatization:</p>
</div>
<div class="listingblock wide">
<div class="title"> [hapaxes.py: 178-209]</div>
<div class="content">
<pre class="CodeRay highlight"><code data-lang="python"><span class="keyword">def</span> <span class="function">pt_to_wn</span>(pos):
<span class="docstring"><span class="delimiter">"""</span><span class="content">
</span><span class="content"> Takes a Penn Treebank tag and converts it to an</span><span class="content">
</span><span class="content"> appropriate WordNet equivalent for lemmatization.</span><span class="content">
</span><span class="content">
</span><span class="content"> A list of Penn Treebank tags is available at:</span><span class="content">
</span><span class="content"> https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html</span><span class="content">
</span><span class="content"> </span><span class="delimiter">"""</span></span>
<span class="keyword">from</span> <span class="include">nltk.corpus.reader.wordnet</span> <span class="keyword">import</span> <span class="include">NOUN</span>, <span class="include">VERB</span>, <span class="include">ADJ</span>, <span class="include">ADV</span>
pos = pos.lower()
<span class="keyword">if</span> pos.startswith(<span class="string"><span class="delimiter">'</span><span class="content">jj</span><span class="delimiter">'</span></span>):
tag = ADJ
<span class="keyword">elif</span> pos == <span class="string"><span class="delimiter">'</span><span class="content">md</span><span class="delimiter">'</span></span>:
<span class="comment"># Modal auxiliary verbs</span>
tag = VERB
<span class="keyword">elif</span> pos.startswith(<span class="string"><span class="delimiter">'</span><span class="content">rb</span><span class="delimiter">'</span></span>):
tag = ADV
<span class="keyword">elif</span> pos.startswith(<span class="string"><span class="delimiter">'</span><span class="content">vb</span><span class="delimiter">'</span></span>):
tag = VERB
<span class="keyword">elif</span> pos == <span class="string"><span class="delimiter">'</span><span class="content">wrb</span><span class="delimiter">'</span></span>:
<span class="comment"># Wh-adverb (how, however, whence, whenever...)</span>
tag = ADV
<span class="keyword">else</span>:
<span class="comment"># default to NOUN</span>
<span class="comment"># This is not strictly correct, but it is good</span>
<span class="comment"># enough for lemmatization.</span>
tag = NOUN
<span class="keyword">return</span> tag</code></pre>
</div>
</div>
</div>
<div class="sect2">
<h3 id="_finding_hapaxes_with_spacy">Finding hapaxes with spaCy</h3>
<div class="paragraph">
<p>Unlike the NLTK API, spaCy is designed to tokenize, parse, and tag a text all by calling the single function returned by <code>spacy.load()</code>. The spaCy parser returns a ‘document’ object which contains all the tokens, their lemmas, etc. According to the spaCy documentation, “Lemmatization is performed using the WordNet data, but extended to also cover closed-class words such as pronouns.” The function below shows how to find the lemma hapaxes in a spaCy document.</p>
</div>
<div class="admonitionblock note">
<table>
<tr>
<td class="icon">
<i class="fa icon-note" title="Note"></i>
</td>
<td class="content">
spaCy’s models load quite a bit of data from disk which can cause script startup to be slow making it more suitable for long-running programs than for one-off scripts like ours.
</td>
</tr>
</table>
</div>
<div class="listingblock wide">
<div class="title"> [hapaxes.py: 211-234]</div>
<div class="content">
<pre class="CodeRay highlight"><code data-lang="python"><span class="keyword">def</span> <span class="function">spacy_hapaxes</span>(rawtext):
<span class="docstring"><span class="delimiter">"""</span><span class="content">
</span><span class="content"> Takes plain text and returns a list of lemma hapaxes using</span><span class="content">
</span><span class="content"> the spaCy NLP package.</span><span class="content">
</span><span class="content"> </span><span class="delimiter">"""</span></span>
<span class="keyword">if</span> <span class="keyword">not</span> spacy:
<span class="comment"># Only run if spaCy is installed</span>
<span class="keyword">return</span> <span class="predefined-constant">None</span>
<span class="comment"># Load the English spaCy parser</span>
spacy_parse = spacy.load(<span class="string"><span class="delimiter">'</span><span class="content">en_core_web_sm</span><span class="delimiter">'</span></span>)
<span class="comment"># Tokenize, parse, and tag text:</span>
doc = spacy_parse(rawtext)
lemmas = [token.lemma_ <span class="keyword">for</span> token <span class="keyword">in</span> doc
<span class="keyword">if</span> <span class="keyword">not</span> token.is_punct <span class="keyword">and</span> <span class="keyword">not</span> token.is_space] <i class="conum" data-value="1"></i><b>(1)</b>
<span class="comment"># Now we can get a count of every lemma:</span>
counts = Counter(lemmas) <i class="conum" data-value="2"></i><b>(2)</b>
<span class="comment"># We are interested in lemmas which appear only once</span>
hapaxes = [lemma <span class="keyword">for</span> lemma <span class="keyword">in</span> counts <span class="keyword">if</span> counts[lemma] == <span class="integer">1</span>]
<span class="keyword">return</span> hapaxes</code></pre>
</div>
</div>
<div class="colist arabic">
<table>
<tr>
<td><i class="conum" data-value="1"></i><b>1</b></td>
<td>This list comprehension collects the lemma form (<code>token.lemma_</code> of all tokens in the spaCy document which are not punctuation (<code>token.is_punct</code>) or white space (<code>token.is_space</code>).</td>
</tr>
<tr>
<td><i class="conum" data-value="2"></i><b>2</b></td>
<td>An alternative way to do this would be to first get a count of lemmas using the <code><a href="https://spacy.io/docs/api/doc#count_by">count_by()</a></code> method of a spaCy document, and then filtering out punctuation if desired: <code>counts = doc.count_by(spacy.attrs.LEMMA)</code> (but then you’d have to map the resulting attributes (integers) back to words by looping over the tokens and checking their <code>orth</code> attribute).</td>
</tr>
</table>
</div>
</div>
<div class="sect2">
<h3 id="_make_it_a_script">Make it a script</h3>
<div class="paragraph">
<p>You can play with the functions we’ve defined above by typing (copy-and-pasting) them into an interactive Python session. If we save them all to a file, then that file is a Python module which we could <code>import</code> and use in a Python script. To use a single file as both a module and a script, our file can include a construct like this:</p>
</div>
<div class="listingblock">
<div class="content">
<pre class="CodeRay highlight"><code data-lang="python"><span class="keyword">if</span> __name__ == <span class="string"><span class="delimiter">"</span><span class="content">__main__</span><span class="delimiter">"</span></span>:
<span class="comment"># our script logic here</span></code></pre>
</div>
</div>
<div class="paragraph">
<p>This works because when the Python interpreter executes a script (as opposed to importing a module), it sets the top-level variable __name__ equal to the string "__main__" (see also: <a href="https://stackoverflow.com/questions/419163/what-does-if-name-main-do">What does if __name__ == “__main__”: do?</a>).</p>
</div>
<div class="paragraph">
<p>In our case, our script logic consists of reading any input files if given, running all of our hapax functions, then collecting and displaying the output. To see how it is done, scroll down to the full program listing below.</p>
</div>
<div class="sect3">
<h4 id="_running_it">Running it</h4>
<div class="paragraph">
<p>To run the script, first download and save <a href="hapaxes.py">hapaxes.py</a>. Then:</p>
</div>
<div class="listingblock">
<div class="content">
<pre class="CodeRay highlight"><code data-lang="sh">$ python hapaxes.py</code></pre>
</div>
</div>
<div class="paragraph">
<p>Depending on which NLP packages you have installed, you should see output like:</p>
</div>
<div class="listingblock">
<div class="content">
<pre> Count
Wordforms 9
NLTK-stems 8
NLTK-lemmas 8
spaCy 8
-- Hapaxes --
Wordforms: careful, cautious, corrupt, if, not, now, see, that, you
NLTK-stems: care, cautious, if, not, now, see, that, you
NLTK-lemmas: care, cautious, if, not, now, see, that, you
spaCy: careful, cautious, if, not, now, see, that, you</pre>
</div>
</div>
<div class="paragraph">
<p>Try also running the script on an arbitrary input file:</p>
</div>
<div class="listingblock">
<div class="content">
<pre class="CodeRay highlight"><code data-lang="sh">$ python hapaxes.py somefilename
# run it on itself and note that
# source code doesn't give great results:
$ python hapaxes.py hapaxes.py</code></pre>
</div>
</div>
</div>
</div>
</div>
</div>
<div class="sect1">
<h2 id="_hapaxes_py_listing">hapaxes.py listing</h2>
<div class="sectionbody">
<div class="paragraph">
<p>The entire script is listed below and available at <a href="hapaxes.py">hapaxes.py</a>.</p>
</div>
<div class="listingblock wide">
<div class="title">hapaxes.py</div>
<div class="content">
<pre class="CodeRay highlight"><code data-lang="python"><span class="line-numbers"> 1</span><span class="docstring"><span class="delimiter">"""</span><span class="content"></span></span>
<span class="line-numbers"> 2</span><span class="docstring"><span class="content"></span><span class="content">A sample script/module which demonstrates how to count hapaxes (tokens which</span><span class="content"></span></span>
<span class="line-numbers"> 3</span><span class="docstring"><span class="content"></span><span class="content">appear only once) in an untagged text corpus using plain python, NLTK, and</span><span class="content"></span></span>
<span class="line-numbers"> 4</span><span class="docstring"><span class="content"></span><span class="content">spaCy. It counts and lists hapaxes in five different ways:</span><span class="content"></span></span>
<span class="line-numbers"> 5</span><span class="docstring"><span class="content"></span><span class="content"></span></span>
<span class="line-numbers"> 6</span><span class="docstring"><span class="content"></span><span class="content"> * Wordforms - counts unique spellings (normalized for case). This uses</span><span class="content"></span></span>
<span class="line-numbers"> 7</span><span class="docstring"><span class="content"></span><span class="content"> plain Python (no NLTK required)</span><span class="content"></span></span>
<span class="line-numbers"> 8</span><span class="docstring"><span class="content"></span><span class="content"></span></span>
<span class="line-numbers"> 9</span><span class="docstring"><span class="content"></span><span class="content"> * NLTK stems - counts unique stems using a stemmer provided by NLTK</span><span class="content"></span></span>
<span class="line-numbers"> 10</span><span class="docstring"><span class="content"></span><span class="content"></span></span>
<span class="line-numbers"> 11</span><span class="docstring"><span class="content"></span><span class="content"> * NLTK lemmas - counts unique lemma forms using NLTK's part of speech</span><span class="content"></span></span>
<span class="line-numbers"> 12</span><span class="docstring"><span class="content"></span><span class="content"> * tagger and interface to the WordNet lemmatizer.</span><span class="content"></span></span>
<span class="line-numbers"> 13</span><span class="docstring"><span class="content"></span><span class="content"></span></span>
<span class="line-numbers"> 14</span><span class="docstring"><span class="content"></span><span class="content"> * spaCy lemmas - counts unique lemma forms using the spaCy NLP module.</span><span class="content"></span></span>
<span class="line-numbers"> 15</span><span class="docstring"><span class="content"></span><span class="content"></span></span>
<span class="line-numbers"> 16</span><span class="docstring"><span class="content"></span><span class="content">Each of the NLP modules (nltk, spaCy) are optional; if one is not</span><span class="content"></span></span>
<span class="line-numbers"> 17</span><span class="docstring"><span class="content"></span><span class="content">installed then its respective hapax-counting method will not be run.</span><span class="content"></span></span>
<span class="line-numbers"> 18</span><span class="docstring"><span class="content"></span><span class="content"></span></span>
<span class="line-numbers"> 19</span><span class="docstring"><span class="content"></span><span class="content">Usage:</span><span class="content"></span></span>
<span class="line-numbers"> 20</span><span class="docstring"><span class="content"></span><span class="content"></span></span>
<span class="line-numbers"> 21</span><span class="docstring"><span class="content"></span><span class="content"> python hapaxes.py [file]</span><span class="content"></span></span>
<span class="line-numbers"> 22</span><span class="docstring"><span class="content"></span><span class="content"></span></span>
<span class="line-numbers"> 23</span><span class="docstring"><span class="content"></span><span class="content">If 'file' is given, its contents are read and used as the text in which to</span><span class="content"></span></span>
<span class="line-numbers"> 24</span><span class="docstring"><span class="content"></span><span class="content">find hapaxes. If 'file' is omitted, then a test text will be used.</span><span class="content"></span></span>
<span class="line-numbers"> 25</span><span class="docstring"><span class="content"></span><span class="content"></span></span>
<span class="line-numbers"> 26</span><span class="docstring"><span class="content"></span><span class="content">Example:</span><span class="content"></span></span>
<span class="line-numbers"> 27</span><span class="docstring"><span class="content"></span><span class="content"></span></span>
<span class="line-numbers"> 28</span><span class="docstring"><span class="content"></span><span class="content">Running this script with no arguments:</span><span class="content"></span></span>
<span class="line-numbers"> 29</span><span class="docstring"><span class="content"></span><span class="content"></span></span>
<span class="line-numbers"> 30</span><span class="docstring"><span class="content"></span><span class="content"> python hapaxes.py</span><span class="content"></span></span>
<span class="line-numbers"> 31</span><span class="docstring"><span class="content"></span><span class="content"></span></span>
<span class="line-numbers"> 32</span><span class="docstring"><span class="content"></span><span class="content">Will process this text:</span><span class="content"></span></span>
<span class="line-numbers"> 33</span><span class="docstring"><span class="content"></span><span class="content"></span></span>
<span class="line-numbers"> 34</span><span class="docstring"><span class="content"></span><span class="content"> Cory Linguist, a cautious corpus linguist, in creating a corpus of</span><span class="content"></span></span>
<span class="line-numbers"> 35</span><span class="docstring"><span class="content"></span><span class="content"> courtship correspondence, corrupted a crucial link. Now, if Cory Linguist,</span><span class="content"></span></span>
<span class="line-numbers"> 36</span><span class="docstring"><span class="content"></span><span class="content"> a careful corpus linguist, in creating a corpus of courtship</span><span class="content"></span></span>
<span class="line-numbers"> 37</span><span class="docstring"><span class="content"></span><span class="content"> correspondence, corrupted a crucial link, see that YOU, in creating a</span><span class="content"></span></span>
<span class="line-numbers"> 38</span><span class="docstring"><span class="content"></span><span class="content"> corpus of courtship correspondence, corrupt not a crucial link.</span><span class="content"></span></span>
<span class="line-numbers"> 39</span><span class="docstring"><span class="content"></span><span class="content"></span></span>
<span class="line-numbers"> 40</span><span class="docstring"><span class="content"></span><span class="content">And produce this output:</span><span class="content"></span></span>
<span class="line-numbers"> 41</span><span class="docstring"><span class="content"></span><span class="content"></span></span>
<span class="line-numbers"> 42</span><span class="docstring"><span class="content"></span><span class="content"> Count</span><span class="content"></span></span>
<span class="line-numbers"> 43</span><span class="docstring"><span class="content"></span><span class="content"> Wordforms 9</span><span class="content"></span></span>
<span class="line-numbers"> 44</span><span class="docstring"><span class="content"></span><span class="content"> Stems 8</span><span class="content"></span></span>
<span class="line-numbers"> 45</span><span class="docstring"><span class="content"></span><span class="content"> Lemmas 8</span><span class="content"></span></span>
<span class="line-numbers"> 46</span><span class="docstring"><span class="content"></span><span class="content"> spaCy 8</span><span class="content"></span></span>
<span class="line-numbers"> 47</span><span class="docstring"><span class="content"></span><span class="content"></span></span>
<span class="line-numbers"> 48</span><span class="docstring"><span class="content"></span><span class="content"> -- Hapaxes --</span><span class="content"></span></span>
<span class="line-numbers"> 49</span><span class="docstring"><span class="content"></span><span class="content"> Wordforms: careful, cautious, corrupt, if, not, now, see, that, you</span><span class="content"></span></span>
<span class="line-numbers"> 50</span><span class="docstring"><span class="content"></span><span class="content"> NLTK-stems: care, cautious, if, not, now, see, that, you</span><span class="content"></span></span>
<span class="line-numbers"> 51</span><span class="docstring"><span class="content"></span><span class="content"> NLTK-lemmas: care, cautious, if, not, now, see, that, you</span><span class="content"></span></span>
<span class="line-numbers"> 52</span><span class="docstring"><span class="content"></span><span class="content"> spaCy: careful, cautious, if, not, now, see, that, you</span><span class="content"></span></span>
<span class="line-numbers"> 53</span><span class="docstring"><span class="content"></span><span class="content"></span></span>
<span class="line-numbers"> 54</span><span class="docstring"><span class="content"></span><span class="content"></span></span>
<span class="line-numbers"> 55</span><span class="docstring"><span class="content"></span><span class="content">Notice that the stems and lemmas methods do not count "corrupt" as a hapax</span><span class="content"></span></span>
<span class="line-numbers"> 56</span><span class="docstring"><span class="content"></span><span class="content">because it also occurs as "corrupted". Notice also that "Linguist" is not</span><span class="content"></span></span>
<span class="line-numbers"> 57</span><span class="docstring"><span class="content"></span><span class="content">counted as the text is normalized for case.</span><span class="content"></span></span>
<span class="line-numbers"> 58</span><span class="docstring"><span class="content"></span><span class="content"></span></span>
<span class="line-numbers"> 59</span><span class="docstring"><span class="content"></span><span class="content">See also the Wikipedia entry on "Hapex legomenon"</span><span class="content"></span></span>
<span class="line-numbers"> 60</span><span class="docstring"><span class="content"></span><span class="content">(https://en.wikipedia.org/wiki/Hapax_legomenon)</span><span class="content"></span></span>
<span class="line-numbers"> 61</span><span class="docstring"><span class="content"></span><span class="delimiter">"""</span></span>
<span class="line-numbers"> 62</span>
<span class="line-numbers"> 63</span><span class="comment">### Imports</span>
<span class="line-numbers"> 64</span><span class="comment">#</span>
<span class="line-numbers"> 65</span><span class="comment"># Import some Python 3 features to use in Python 2</span>
<span class="line-numbers"> 66</span><span class="keyword">from</span> <span class="include">__future__</span> <span class="keyword">import</span> <span class="include">print_function</span>
<span class="line-numbers"> 67</span><span class="keyword">from</span> <span class="include">__future__</span> <span class="keyword">import</span> <span class="include">unicode_literals</span>
<span class="line-numbers"> 68</span>
<span class="line-numbers"> 69</span><span class="comment"># gives us access to command-line arguments</span>
<span class="line-numbers"> 70</span><span class="keyword">import</span> <span class="include">sys</span>
<span class="line-numbers"> 71</span>
<span class="line-numbers"> 72</span><span class="comment"># The Counter collection is a convenient layer on top of</span>
<span class="line-numbers"> 73</span><span class="comment"># python's standard dictionary type for counting iterables.</span>
<span class="line-numbers"> 74</span><span class="keyword">from</span> <span class="include">collections</span> <span class="keyword">import</span> <span class="include">Counter</span>
<span class="line-numbers"> 75</span>
<span class="line-numbers"> 76</span><span class="comment"># The standard python regular expression module:</span>
<span class="line-numbers"> 77</span><span class="keyword">import</span> <span class="include">re</span>
<span class="line-numbers"> 78</span>
<span class="line-numbers"> 79</span><span class="keyword">try</span>:
<span class="line-numbers"> 80</span> <span class="comment"># Import NLTK if it is installed</span>
<span class="line-numbers"> 81</span> <span class="keyword">import</span> <span class="include">nltk</span>
<span class="line-numbers"> 82</span>
<span class="line-numbers"> 83</span> <span class="comment"># This imports NLTK's implementation of the Snowball</span>
<span class="line-numbers"> 84</span> <span class="comment"># stemmer algorithm</span>
<span class="line-numbers"> 85</span> <span class="keyword">from</span> <span class="include">nltk.stem.snowball</span> <span class="keyword">import</span> <span class="include">SnowballStemmer</span>
<span class="line-numbers"> 86</span>
<span class="line-numbers"> 87</span> <span class="comment"># NLTK's interface to the WordNet lemmatizer</span>
<span class="line-numbers"> 88</span> <span class="keyword">from</span> <span class="include">nltk.stem.wordnet</span> <span class="keyword">import</span> <span class="include">WordNetLemmatizer</span>
<span class="line-numbers"> 89</span><span class="keyword">except</span> <span class="exception">ImportError</span>:
<span class="line-numbers"> 90</span> nltk = <span class="predefined-constant">None</span>
<span class="line-numbers"> 91</span> print(<span class="string"><span class="delimiter">"</span><span class="content">NLTK is not installed, so we won't use it.</span><span class="delimiter">"</span></span>)
<span class="line-numbers"> 92</span>
<span class="line-numbers"> 93</span><span class="keyword">try</span>:
<span class="line-numbers"> 94</span> <span class="comment"># Import spaCy if it is installed</span>
<span class="line-numbers"> 95</span> <span class="keyword">import</span> <span class="include">spacy</span>
<span class="line-numbers"> 96</span><span class="keyword">except</span> <span class="exception">ImportError</span>:
<span class="line-numbers"> 97</span> spacy = <span class="predefined-constant">None</span>
<span class="line-numbers"> 98</span> print(<span class="string"><span class="delimiter">"</span><span class="content">spaCy is not installed, so we won't use it.</span><span class="delimiter">"</span></span>)
<span class="line-numbers"> 99</span>
<span class="line-numbers">100</span><span class="keyword">def</span> <span class="function">normalize_tokenize</span>(string):
<span class="line-numbers">101</span> <span class="docstring"><span class="delimiter">"""</span><span class="content"></span></span>
<span class="line-numbers">102</span><span class="docstring"><span class="content"></span><span class="content"> Takes a string, normalizes it (makes it lowercase and</span><span class="content"></span></span>
<span class="line-numbers">103</span><span class="docstring"><span class="content"></span><span class="content"> removes punctuation), and then splits it into a list of</span><span class="content"></span></span>
<span class="line-numbers">104</span><span class="docstring"><span class="content"></span><span class="content"> words.</span><span class="content"></span></span>
<span class="line-numbers">105</span><span class="docstring"><span class="content"></span><span class="content"></span></span>
<span class="line-numbers">106</span><span class="docstring"><span class="content"></span><span class="content"> Note that everything in this function is plain Python</span><span class="content"></span></span>
<span class="line-numbers">107</span><span class="docstring"><span class="content"></span><span class="content"> without using NLTK (although as noted below, NLTK provides</span><span class="content"></span></span>
<span class="line-numbers">108</span><span class="docstring"><span class="content"></span><span class="content"> some more sophisticated tokenizers we could have used).</span><span class="content"></span></span>
<span class="line-numbers">109</span><span class="docstring"><span class="content"></span><span class="content"> </span><span class="delimiter">"""</span></span>
<span class="line-numbers">110</span> <span class="comment"># make lowercase</span>
<span class="line-numbers">111</span> norm = string.lower()
<span class="line-numbers">112</span>
<span class="line-numbers">113</span> <span class="comment"># remove punctuation</span>
<span class="line-numbers">114</span> norm = re.sub(<span class="string"><span class="modifier">r</span><span class="delimiter">'</span><span class="content">(?u)[^</span><span class="content">\w</span><span class="content">\s</span><span class="content">]</span><span class="delimiter">'</span></span>, <span class="string"><span class="delimiter">'</span><span class="delimiter">'</span></span>, norm) <span class="comment"># <1></span>
<span class="line-numbers">115</span>
<span class="line-numbers">116</span> <span class="comment"># split into words</span>
<span class="line-numbers">117</span> tokens = norm.split()
<span class="line-numbers">118</span>
<span class="line-numbers">119</span> <span class="keyword">return</span> tokens
<span class="line-numbers">120</span>
<span class="line-numbers">121</span><span class="keyword">def</span> <span class="function">word_form_hapaxes</span>(tokens):
<span class="line-numbers">122</span> <span class="docstring"><span class="delimiter">"""</span><span class="content"></span></span>
<span class="line-numbers">123</span><span class="docstring"><span class="content"></span><span class="content"> Takes a list of tokens and returns a list of the</span><span class="content"></span></span>
<span class="line-numbers">124</span><span class="docstring"><span class="content"></span><span class="content"> wordform hapaxes (those wordforms that only appear once)</span><span class="content"></span></span>
<span class="line-numbers">125</span><span class="docstring"><span class="content"></span><span class="content"></span></span>
<span class="line-numbers">126</span><span class="docstring"><span class="content"></span><span class="content"> For wordforms this is simple enough to do in plain</span><span class="content"></span></span>
<span class="line-numbers">127</span><span class="docstring"><span class="content"></span><span class="content"> Python without an NLP package, especially using the Counter</span><span class="content"></span></span>
<span class="line-numbers">128</span><span class="docstring"><span class="content"></span><span class="content"> type from the collections module (part of the Python</span><span class="content"></span></span>
<span class="line-numbers">129</span><span class="docstring"><span class="content"></span><span class="content"> standard library).</span><span class="content"></span></span>
<span class="line-numbers">130</span><span class="docstring"><span class="content"></span><span class="content"> </span><span class="delimiter">"""</span></span>
<span class="line-numbers">131</span>
<span class="line-numbers">132</span> counts = Counter(tokens) <span class="comment"># <1></span>
<span class="line-numbers">133</span> hapaxes = [word <span class="keyword">for</span> word <span class="keyword">in</span> counts <span class="keyword">if</span> counts[word] == <span class="integer">1</span>] <span class="comment"># <2></span>
<span class="line-numbers">134</span>
<span class="line-numbers">135</span> <span class="keyword">return</span> hapaxes
<span class="line-numbers">136</span>
<span class="line-numbers">137</span><span class="keyword">def</span> <span class="function">nltk_stem_hapaxes</span>(tokens):
<span class="line-numbers">138</span> <span class="docstring"><span class="delimiter">"""</span><span class="content"></span></span>
<span class="line-numbers">139</span><span class="docstring"><span class="content"></span><span class="content"> Takes a list of tokens and returns a list of the word</span><span class="content"></span></span>
<span class="line-numbers">140</span><span class="docstring"><span class="content"></span><span class="content"> stem hapaxes.</span><span class="content"></span></span>
<span class="line-numbers">141</span><span class="docstring"><span class="content"></span><span class="content"> </span><span class="delimiter">"""</span></span>
<span class="line-numbers">142</span> <span class="keyword">if</span> <span class="keyword">not</span> nltk: <span class="comment"># <1></span>
<span class="line-numbers">143</span> <span class="comment"># Only run if NLTK is loaded</span>
<span class="line-numbers">144</span> <span class="keyword">return</span> <span class="predefined-constant">None</span>
<span class="line-numbers">145</span>
<span class="line-numbers">146</span> <span class="comment"># Apply NLTK's Snowball stemmer algorithm to tokens:</span>
<span class="line-numbers">147</span> stemmer = SnowballStemmer(<span class="string"><span class="delimiter">"</span><span class="content">english</span><span class="delimiter">"</span></span>)
<span class="line-numbers">148</span> stems = [stemmer.stem(token) <span class="keyword">for</span> token <span class="keyword">in</span> tokens]
<span class="line-numbers">149</span>
<span class="line-numbers">150</span> <span class="comment"># Filter down to hapaxes:</span>
<span class="line-numbers">151</span> counts = nltk.FreqDist(stems) <span class="comment"># <2></span>
<span class="line-numbers">152</span> hapaxes = counts.hapaxes() <span class="comment"># <3></span>
<span class="line-numbers">153</span> <span class="keyword">return</span> hapaxes
<span class="line-numbers">154</span>
<span class="line-numbers">155</span><span class="keyword">def</span> <span class="function">nltk_lemma_hapaxes</span>(tokens):
<span class="line-numbers">156</span> <span class="docstring"><span class="delimiter">"""</span><span class="content"></span></span>
<span class="line-numbers">157</span><span class="docstring"><span class="content"></span><span class="content"> Takes a list of tokens and returns a list of the lemma</span><span class="content"></span></span>
<span class="line-numbers">158</span><span class="docstring"><span class="content"></span><span class="content"> hapaxes.</span><span class="content"></span></span>
<span class="line-numbers">159</span><span class="docstring"><span class="content"></span><span class="content"> </span><span class="delimiter">"""</span></span>
<span class="line-numbers">160</span> <span class="keyword">if</span> <span class="keyword">not</span> nltk:
<span class="line-numbers">161</span> <span class="comment"># Only run if NLTK is loaded</span>
<span class="line-numbers">162</span> <span class="keyword">return</span> <span class="predefined-constant">None</span>
<span class="line-numbers">163</span>
<span class="line-numbers">164</span> <span class="comment"># Tag tokens with part-of-speech:</span>
<span class="line-numbers">165</span> tagged = nltk.pos_tag(tokens) <span class="comment"># <1></span>
<span class="line-numbers">166</span>
<span class="line-numbers">167</span> <span class="comment"># Convert our Treebank-style tags to WordNet-style tags.</span>
<span class="line-numbers">168</span> tagged = [(word, pt_to_wn(tag))
<span class="line-numbers">169</span> <span class="keyword">for</span> (word, tag) <span class="keyword">in</span> tagged] <span class="comment"># <2></span>
<span class="line-numbers">170</span>
<span class="line-numbers">171</span> <span class="comment"># Lemmatize:</span>
<span class="line-numbers">172</span> lemmer = WordNetLemmatizer()
<span class="line-numbers">173</span> lemmas = [lemmer.lemmatize(token, pos)
<span class="line-numbers">174</span> <span class="keyword">for</span> (token, pos) <span class="keyword">in</span> tagged] <span class="comment"># <3></span>
<span class="line-numbers">175</span>
<span class="line-numbers">176</span> <span class="keyword">return</span> nltk_stem_hapaxes(lemmas) <span class="comment"># <4></span>
<span class="line-numbers">177</span>
<span class="line-numbers">178</span><span class="keyword">def</span> <span class="function">pt_to_wn</span>(pos):
<span class="line-numbers">179</span> <span class="docstring"><span class="delimiter">"""</span><span class="content"></span></span>
<span class="line-numbers">180</span><span class="docstring"><span class="content"></span><span class="content"> Takes a Penn Treebank tag and converts it to an</span><span class="content"></span></span>
<span class="line-numbers">181</span><span class="docstring"><span class="content"></span><span class="content"> appropriate WordNet equivalent for lemmatization.</span><span class="content"></span></span>
<span class="line-numbers">182</span><span class="docstring"><span class="content"></span><span class="content"></span></span>
<span class="line-numbers">183</span><span class="docstring"><span class="content"></span><span class="content"> A list of Penn Treebank tags is available at:</span><span class="content"></span></span>
<span class="line-numbers">184</span><span class="docstring"><span class="content"></span><span class="content"> https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html</span><span class="content"></span></span>
<span class="line-numbers">185</span><span class="docstring"><span class="content"></span><span class="content"> </span><span class="delimiter">"""</span></span>
<span class="line-numbers">186</span>
<span class="line-numbers">187</span> <span class="keyword">from</span> <span class="include">nltk.corpus.reader.wordnet</span> <span class="keyword">import</span> <span class="include">NOUN</span>, <span class="include">VERB</span>, <span class="include">ADJ</span>, <span class="include">ADV</span>
<span class="line-numbers">188</span>
<span class="line-numbers">189</span> pos = pos.lower()
<span class="line-numbers">190</span>
<span class="line-numbers">191</span> <span class="keyword">if</span> pos.startswith(<span class="string"><span class="delimiter">'</span><span class="content">jj</span><span class="delimiter">'</span></span>):
<span class="line-numbers">192</span> tag = ADJ
<span class="line-numbers">193</span> <span class="keyword">elif</span> pos == <span class="string"><span class="delimiter">'</span><span class="content">md</span><span class="delimiter">'</span></span>:
<span class="line-numbers">194</span> <span class="comment"># Modal auxiliary verbs</span>
<span class="line-numbers">195</span> tag = VERB
<span class="line-numbers">196</span> <span class="keyword">elif</span> pos.startswith(<span class="string"><span class="delimiter">'</span><span class="content">rb</span><span class="delimiter">'</span></span>):
<span class="line-numbers">197</span> tag = ADV
<span class="line-numbers">198</span> <span class="keyword">elif</span> pos.startswith(<span class="string"><span class="delimiter">'</span><span class="content">vb</span><span class="delimiter">'</span></span>):
<span class="line-numbers">199</span> tag = VERB
<span class="line-numbers">200</span> <span class="keyword">elif</span> pos == <span class="string"><span class="delimiter">'</span><span class="content">wrb</span><span class="delimiter">'</span></span>:
<span class="line-numbers">201</span> <span class="comment"># Wh-adverb (how, however, whence, whenever...)</span>
<span class="line-numbers">202</span> tag = ADV
<span class="line-numbers">203</span> <span class="keyword">else</span>:
<span class="line-numbers">204</span> <span class="comment"># default to NOUN</span>
<span class="line-numbers">205</span> <span class="comment"># This is not strictly correct, but it is good</span>
<span class="line-numbers">206</span> <span class="comment"># enough for lemmatization.</span>
<span class="line-numbers">207</span> tag = NOUN
<span class="line-numbers">208</span>
<span class="line-numbers">209</span> <span class="keyword">return</span> tag
<span class="line-numbers">210</span>
<span class="line-numbers">211</span><span class="keyword">def</span> <span class="function">spacy_hapaxes</span>(rawtext):
<span class="line-numbers">212</span> <span class="docstring"><span class="delimiter">"""</span><span class="content"></span></span>
<span class="line-numbers">213</span><span class="docstring"><span class="content"></span><span class="content"> Takes plain text and returns a list of lemma hapaxes using</span><span class="content"></span></span>
<span class="line-numbers">214</span><span class="docstring"><span class="content"></span><span class="content"> the spaCy NLP package.</span><span class="content"></span></span>
<span class="line-numbers">215</span><span class="docstring"><span class="content"></span><span class="content"> </span><span class="delimiter">"""</span></span>
<span class="line-numbers">216</span> <span class="keyword">if</span> <span class="keyword">not</span> spacy:
<span class="line-numbers">217</span> <span class="comment"># Only run if spaCy is installed</span>
<span class="line-numbers">218</span> <span class="keyword">return</span> <span class="predefined-constant">None</span>
<span class="line-numbers">219</span>
<span class="line-numbers">220</span> <span class="comment"># Load the English spaCy parser</span>
<span class="line-numbers">221</span> spacy_parse = spacy.load(<span class="string"><span class="delimiter">'</span><span class="content">en_core_web_sm</span><span class="delimiter">'</span></span>)
<span class="line-numbers">222</span>
<span class="line-numbers">223</span> <span class="comment"># Tokenize, parse, and tag text:</span>
<span class="line-numbers">224</span> doc = spacy_parse(rawtext)
<span class="line-numbers">225</span>
<span class="line-numbers">226</span> lemmas = [token.lemma_ <span class="keyword">for</span> token <span class="keyword">in</span> doc
<span class="line-numbers">227</span> <span class="keyword">if</span> <span class="keyword">not</span> token.is_punct <span class="keyword">and</span> <span class="keyword">not</span> token.is_space] <span class="comment"># <1></span>
<span class="line-numbers">228</span>
<span class="line-numbers">229</span> <span class="comment"># Now we can get a count of every lemma:</span>
<span class="line-numbers">230</span> counts = Counter(lemmas) <span class="comment"># <2></span>
<span class="line-numbers">231</span>
<span class="line-numbers">232</span> <span class="comment"># We are interested in lemmas which appear only once</span>
<span class="line-numbers">233</span> hapaxes = [lemma <span class="keyword">for</span> lemma <span class="keyword">in</span> counts <span class="keyword">if</span> counts[lemma] == <span class="integer">1</span>]
<span class="line-numbers">234</span> <span class="keyword">return</span> hapaxes
<span class="line-numbers">235</span>
<span class="line-numbers">236</span><span class="keyword">if</span> __name__ == <span class="string"><span class="delimiter">"</span><span class="content">__main__</span><span class="delimiter">"</span></span>:
<span class="line-numbers">237</span> <span class="docstring"><span class="delimiter">"""</span><span class="content"></span></span>
<span class="line-numbers">238</span><span class="docstring"><span class="content"></span><span class="content"> The code in this block is run when this file is executed as a script (but</span><span class="content"></span></span>
<span class="line-numbers">239</span><span class="docstring"><span class="content"></span><span class="content"> not if it is imported as a module by another Python script).</span><span class="content"></span></span>
<span class="line-numbers">240</span><span class="docstring"><span class="content"></span><span class="content"> </span><span class="delimiter">"""</span></span>
<span class="line-numbers">241</span>
<span class="line-numbers">242</span> <span class="comment"># If no file is provided, then use this sample text:</span>
<span class="line-numbers">243</span> text = <span class="string"><span class="delimiter">"""</span><span class="content">Cory Linguist, a cautious corpus linguist, in creating a</span><span class="content"></span></span>
<span class="line-numbers">244</span><span class="string"><span class="content"></span><span class="content"> corpus of courtship correspondence, corrupted a crucial link. Now, if Cory</span><span class="content"></span></span>
<span class="line-numbers">245</span><span class="string"><span class="content"></span><span class="content"> Linguist, a careful corpus linguist, in creating a corpus of courtship</span><span class="content"></span></span>
<span class="line-numbers">246</span><span class="string"><span class="content"></span><span class="content"> correspondence, corrupted a crucial link, see that YOU, in creating a</span><span class="content"></span></span>
<span class="line-numbers">247</span><span class="string"><span class="content"></span><span class="content"> corpus of courtship correspondence, corrupt not a crucial link.</span><span class="delimiter">"""</span></span>
<span class="line-numbers">248</span>
<span class="line-numbers">249</span> <span class="keyword">if</span> <span class="predefined">len</span>(sys.argv) > <span class="integer">1</span>:
<span class="line-numbers">250</span> <span class="comment"># We got at least one command-line argument. We'll ignore all but the</span>
<span class="line-numbers">251</span> <span class="comment"># first.</span>
<span class="line-numbers">252</span> <span class="keyword">with</span> <span class="predefined">open</span>(sys.argv[<span class="integer">1</span>], <span class="string"><span class="delimiter">'</span><span class="content">r</span><span class="delimiter">'</span></span>) <span class="keyword">as</span> <span class="predefined">file</span>:
<span class="line-numbers">253</span> text = <span class="predefined">file</span>.read()
<span class="line-numbers">254</span> <span class="keyword">try</span>:
<span class="line-numbers">255</span> <span class="comment"># in Python 2 we need a unicode string</span>
<span class="line-numbers">256</span> text = <span class="predefined">unicode</span>(text)
<span class="line-numbers">257</span> <span class="keyword">except</span>:
<span class="line-numbers">258</span> <span class="comment"># in Python 3 'unicode()' is not defined</span>
<span class="line-numbers">259</span> <span class="comment"># we don't have to do anything</span>
<span class="line-numbers">260</span> <span class="keyword">pass</span>
<span class="line-numbers">261</span>
<span class="line-numbers">262</span> <span class="comment"># tokenize the text (break into words)</span>
<span class="line-numbers">263</span> tokens = normalize_tokenize(text)
<span class="line-numbers">264</span>
<span class="line-numbers">265</span> <span class="comment"># Get hapaxes based on wordforms, stems, and lemmas:</span>
<span class="line-numbers">266</span> wfs = word_form_hapaxes(tokens)
<span class="line-numbers">267</span> stems = nltk_stem_hapaxes(tokens)
<span class="line-numbers">268</span> lemmas = nltk_lemma_hapaxes(tokens)
<span class="line-numbers">269</span> spacy_lems = spacy_hapaxes(text)
<span class="line-numbers">270</span>
<span class="line-numbers">271</span> <span class="comment"># Print count table and list of hapaxes:</span>
<span class="line-numbers">272</span> row_labels = [<span class="string"><span class="delimiter">"</span><span class="content">Wordforms</span><span class="delimiter">"</span></span>]
<span class="line-numbers">273</span> row_data = [wfs]
<span class="line-numbers">274</span>
<span class="line-numbers">275</span> <span class="comment"># only add NLTK data if it is installed</span>
<span class="line-numbers">276</span> <span class="keyword">if</span> nltk:
<span class="line-numbers">277</span> row_labels.extend([<span class="string"><span class="delimiter">"</span><span class="content">NLTK-stems</span><span class="delimiter">"</span></span>, <span class="string"><span class="delimiter">"</span><span class="content">NLTK-lemmas</span><span class="delimiter">"</span></span>])
<span class="line-numbers">278</span> row_data.extend([stems, lemmas])
<span class="line-numbers">279</span>
<span class="line-numbers">280</span> <span class="comment"># only add spaCy data if it is installed:</span>
<span class="line-numbers">281</span> <span class="keyword">if</span> spacy_lems:
<span class="line-numbers">282</span> row_labels.append(<span class="string"><span class="delimiter">"</span><span class="content">spaCy</span><span class="delimiter">"</span></span>)
<span class="line-numbers">283</span> row_data.append(spacy_lems)
<span class="line-numbers">284</span>
<span class="line-numbers">285</span> <span class="comment"># sort happaxes for display</span>
<span class="line-numbers">286</span> row_date = [row.sort() <span class="keyword">for</span> row <span class="keyword">in</span> row_data]
<span class="line-numbers">287</span>
<span class="line-numbers">288</span> <span class="comment"># format and print output</span>
<span class="line-numbers">289</span> rows = <span class="predefined">zip</span>(row_labels, row_data)
<span class="line-numbers">290</span> row_fmt = <span class="string"><span class="delimiter">"</span><span class="content">{:>14}{:^8}</span><span class="delimiter">"</span></span>
<span class="line-numbers">291</span> print(<span class="string"><span class="delimiter">"</span><span class="char">\n</span><span class="delimiter">"</span></span>)
<span class="line-numbers">292</span> print(row_fmt.format(<span class="string"><span class="delimiter">"</span><span class="delimiter">"</span></span>, <span class="string"><span class="delimiter">"</span><span class="content">Count</span><span class="delimiter">"</span></span>))
<span class="line-numbers">293</span> hapax_list = []
<span class="line-numbers">294</span> <span class="keyword">for</span> row <span class="keyword">in</span> rows:
<span class="line-numbers">295</span> print(row_fmt.format(row[<span class="integer">0</span>], <span class="predefined">len</span>(row[<span class="integer">1</span>])))
<span class="line-numbers">296</span> hapax_list += [<span class="string"><span class="delimiter">"</span><span class="content">{:<14}{:<68}</span><span class="delimiter">"</span></span>.format(row[<span class="integer">0</span>] + <span class="string"><span class="delimiter">"</span><span class="content">:</span><span class="delimiter">"</span></span>, <span class="string"><span class="delimiter">"</span><span class="content">, </span><span class="delimiter">"</span></span>.join(row[<span class="integer">1</span>]))]
<span class="line-numbers">297</span>
<span class="line-numbers">298</span> print(<span class="string"><span class="delimiter">"</span><span class="char">\n</span><span class="content">-- Hapaxes --</span><span class="delimiter">"</span></span>)
<span class="line-numbers">299</span> <span class="keyword">for</span> row <span class="keyword">in</span> hapax_list:
<span class="line-numbers">300</span> print(row)
<span class="line-numbers">301</span> print(<span class="string"><span class="delimiter">"</span><span class="char">\n</span><span class="delimiter">"</span></span>)
<span class="line-numbers">302</span></code></pre>
</div>
</div>
</div>
</div>A tutorial on simple NLP tasks with Python which serves as an introduction to the NLTK and spaCy libraries.