Using Unicode with Sphinx, reStructuredText, and PDF Output

I’m working on updating my book, and besides actually writing the content one of the things I have to do is generate new LaTeX files to deliver to the publisher. I’ve written about my toolchain elsewhere, so I won’t repeat all of that information here. The short version is that I use Paver to drive Sphinx to convert reStructuredText input files to HTML for the website, LaTeX for the compositor at Pearson, and PDF for reviewers. Since the updated version covers Python 3, and one of the key benefits of Python 3 is better Unicode support, I want to include some characters outside of the normal ASCII set in my examples.

When you ask Sphinx’s latex builder to generate LaTeX output the result is a directory with a *.tex file containing your content and some other files with all of the parts you need to convert that LaTeX to other formats, including a Makefile with instructions for building PDF and DVI. By default that Makefile uses pdflatex to convert the *.tex output files it writes to PDF. My article for the random module includes an example of shuffling a card deck. The Python 2 version used letters to represent the card suits, but for Python 3 I switched to using Unicode symbols like what would appear on the cards. Making that work for HTML was easy, but the PDF output proved trickier.

The input file contains a section like this:

.. code-block:: none

    $ python3 random_shuffle.py

    Initial deck:
     2♥  2♦  2♣  2♠  3♥  3♦  3♣  3♠  4♥  4♦  4♣  4♠  5♥
     5♦  5♣  5♠  6♥  6♦  6♣  6♠  7♥  7♦  7♣  7♠  8♥  8♦
     8♣  8♠  9♥  9♦  9♣  9♠ 10♥ 10♦ 10♣ 10♠  J♥  J♦  J♣
     J♠  Q♥  Q♦  Q♣  Q♠  K♥  K♦  K♣  K♠  A♥  A♦  A♣  A♠

While processing the *.tex input file, pdflatex gave me the error:

! Package inputenc Error: Unicode char ♥ (U+2665)
(inputenc)                not set up for use with LaTeX.

See the inputenc package documentation for explanation.
Type  H <return>  for immediate help.
 ...

l.8822 ...3♣  3♠  4♥  4♦  4♣  4♠  5♥

The problem was those Unicode symbols: ♣, ♠, ♥, and ♦. The “not set up for use” error appears to mean that the font being used by LaTeX did not have glyphs for the relevant Unicode code points.

The two solutions I found after searching for tips on using Unicode with LaTeX were either to use the “utf8x” option with inputenc or to switch to XeTeX for processing the input file. I tried to enable “utf8x” first by adding this to the preamble passed from Sphinx to LaTeX, as Georg Brandl recommended in this Sphinx forum post.

latex_elements = {
    'preamble': '''
\usepackage{pstricks}  % since the dash is rendered by pstricks!
\usepackage[postscript]{ucs}
\usepackage[utf8x]{inputenc}
''',
}

That produced a different error:

! LaTeX Error: Option clash for package inputenc.

Looking at the top of the *.tex file generated by Sphinx, I see two sets of instructions for inputenc on lines 7 and 34.

% Generated by Sphinx.
\def\sphinxdocclass{report}
\documentclass[letterpaper,10pt,english]{sphinxmanual}
\usepackage{iftex}

\ifPDFTeX
  \usepackage[utf8]{inputenc}
\fi
\ifdefined\DeclareUnicodeCharacter
  \DeclareUnicodeCharacter{00A0}{\nobreakspace}
\fi
\usepackage{cmap}
\usepackage[T1]{fontenc}
\usepackage{amsmath,amssymb,amstext}
\usepackage{babel}
\usepackage{times}
\usepackage[Bjarne]{fncychap}
\usepackage{longtable}
\usepackage{sphinx}
\usepackage{multirow}
\usepackage{eqparbox}


\addto\captionsenglish{\renewcommand{\figurename}{Fig.\@ }}
\addto\captionsenglish{\renewcommand{\tablename}{Table }}
\SetupFloatingEnvironment{literal-block}{name=Listing }

\addto\extrasenglish{\def\pageautorefname{page}}

\setcounter{tocdepth}{1}

\usepackage{pstricks}  % since the dash is rendered by pstricks!
\usepackage[postscript]{ucs}
\usepackage[utf8x]{inputenc}

Adding to the Sphinx settings to override the inputenc value like this

latex_elements = {
    'preamble': '''
\usepackage{pstricks}  % since the dash is rendered by pstricks!
\usepackage[postscript]{ucs}
\usepackage[utf8x]{inputenc}
''',
    'inputenc': '',
}

Removed the duplicate instruction, but gave me another new error message

! Undefined control sequence.
\u-postscript-9829 #1->\ding
                             {"AA}
l.8825 ...3♣  3♠  4♥  4♦  4♣  4♠  5♥

It looks like pdflatex is getting tripped up reading the input characters.

I’m running the MacTeX 2016 distribution which comes with XeTeX, so after several folks online recommended it I decided to give it a try. I removed the changes to my conf.py and based on this article and an example referenced there I made a few other changes.

latex_elements = {
    'preamble': '''
% Enable unicode and use Courier New to ensure the card suit
% characters that are part of the 'random' module examples
% appear properly in the PDF output.
\usepackage{fontspec}
\setmonofont{Courier New}
''',

 # disable font inclusion
 'fontpkg': '',
 'fontenc': '',

 # Fix Unicode handling by disabling the defaults for a few items
 # set by sphinx
 'inputenc': '',
 'utf8extra': '',
}

That gave me this header for the .tex file produced by Sphinx:

% Generated by Sphinx.
\def\sphinxdocclass{report}
\documentclass[letterpaper,10pt,english]{sphinxmanual}
\usepackage{iftex}



\usepackage{cmap}

\usepackage{amsmath,amssymb,amstext}
\usepackage{babel}

\usepackage[Bjarne]{fncychap}
\usepackage{longtable}
\usepackage{sphinx}
\usepackage{multirow}
\usepackage{eqparbox}


\addto\captionsenglish{\renewcommand{\figurename}{Fig.\@ }}
\addto\captionsenglish{\renewcommand{\tablename}{Table }}
\SetupFloatingEnvironment{literal-block}{name=Listing }

\addto\extrasenglish{\def\pageautorefname{page}}

\setcounter{tocdepth}{1}

% Enable unicode and use Courier New to ensure the card suit
% characters that are part of the 'random' module examples
% appear properly in the PDF output.
\usepackage{fontspec}
\setmonofont{Courier New}

Then I modified the Makefile that Sphinx generated to use “xelatex” instead of “pdflatex”

PDFLATEX = xelatex

And the PDF build worked!

To avoid having to modify the Makefile myself each time, I changed sphinxcontrib-paverutil so that it calls passes an environment variable for PDFLATEX to override the setting in the Makefile. I then modified my local pavement.py to set the value to “xelatex”, and now I have repeatable PDF builds again.

The line to set the mono-spaced font using \setmonofont is no accident. I tried the default font (which built without error, but dropped the unknown characters from the output), Courier (which rendered the characters as little boxes), my terminal font Menlo Regular (which worked but may look odd in print), before finally settling on Courier New (which works properly and looks like what one would expect to find in print).