2. The basic structure of a Python package

To start, I will explain the basic structure of a Python package, which provides a skeleton onto which you can add the features that we will discuss later in these notes (documentation, automated testing, etc.). One way to learn about package structure is to peruse Python packages on GitHub, e.g., galpy’s GitHub page. But because these packages have many complex maintenance features implemented and they may use files specific to GitHub integration, this can be a confusing way to start learning the structure of Python packages (however, once you know the basics, looking at other packages is a great way to discover new features that you may want to use in your own package; in general, you can learn a lot by reading other people’s code).

Similarly, there exists a wide variety of package generators, pieces of code that will generate skeleton packages that you can fill in to create your own package. A popular class of such package templates are those generated using the cookiecutter command-line utility, which allows you to generate package skeletons for many different languages / layouts simply by calling cookiecutter with the URL of a template. For example, astropy provides a cookiecutter package template specific to packages in the astropy eco-system. In general, I shy away from using such templates, certainly for beginning packagers, because these templates come with a confusing amount of advanced features that obscure the basic structure of a package and that distract from the basic development of the package (when I generate a template using the astropy template, I don’t even know where to start putting code!). Cookiecutter templates are useful for advanced users creating many packages, but for the purpose of learning about packaging, I think it is better to build the package from the ground up and add each advanced feature individually later.

2.1. Naming your package

The first decision you have to make when creating a package is what to name it. Because it is annoying to rename a package later, this is an important decision to make early on and it is worth spending at least a few moments thinking about a good name. A memorable, catchy name will help your package gain attention. Indeed, I am very happy to have snatched the “galpy” name when it was available and even then I only ended up with “galpy”, because the name I had originally wanted to use was pygd (for “Python galactic dynamics”), but this name was already taken by a project on sourceforge.net.

Besides being catchy, it is also important for a name to be unique, so you’ll want to check that the name is not already in use by another project. For Python packages, it’s essential to check that there is no package of the name you are thinking about available on the Python Packaging Index (PyPI), by searching their database. This is essential, because eventually you’ll want to be able to install your package using a simple pip install PACKAGE_NAME, because that is the first thing that users will try when they learn that they need to use your package. If pip install PACKAGE_NAME installs a different package, many users will end up being very confused. So while you can have a different PyPI name from your package name, in the case of a conflict it is better to move away from your intended package name and choose one that is available. For a Python package, being available on PyPI is the most important consideration, but you might also want to check sourceforge.net to check more generally against names of open-source projects (not necessarily in Python) and search GitHub (although in the case of GitHub, the most important conflicts would be with packages that actually appear to be used by a wider community). To make sure that the name does not disappear while you are developing, you may want to register your package on PyPI as soon as possible, by publishing a first release.

As for what to choose as a name, tastes differ. Many Python packages choose to end in py to make it clear that they are Python packages (e.g., numpy, scipy, astropy, galpy), but this is not a rule and a package name can be anything (indeed, the number of good, available names ending in “py” is rapidly dwindling). You can choose a name that succinctly describes what your package does (this has long been my own preferred naming convention, leading to such dryly named packages as apogee, mwdust, gaia_tools) or you can choose a clever name or acronym (my own forays in this direction are wendy and kimmy, although nobody ever seems to get them…; also illustrating that you can just end in “y”!). But I would recommend keeping the name of your package relatively short, because even in the age of tab-completion, people using your code will end up typing its name a lot.

So we will not have to tediously refer to PACKAGE_NAME as the name of our under-construction package, from now on we will use exampy as the example (get it?) package. I will use exampy throughout these notes to illustrate everything that is being discussed. The exampy package is available here on GitHub and here on PyPI.

2.2. Package layout

Once you have decided on a name, it is time to start building your package. Make a directory that will hold your package, which I typically give the name of the package, but this is not required. Later, we will host this entire directory on GitHub and I will refer to it as the “top-level directory”. In this top-level directory, your package will be contained in a sub-directory that has the name of your package, in our example case this is exampy/. This directory will contain all of your code. Other sub-directories of the top-level directory will hold documentation and tests and sub-directories will also be automatically be generated when you build and distribute your code (more on that later). We will be using this example package throughout the rest of these notes to illustrate documentation and testing tools, so you may want to follow along and implement this simple package yourself to be able to keep using it in the next chapters. You might want to add it to GitHub as exampy-GITHUBUSERNAME to distinguish it from the original package.

Files in the top-level directory largely hold meta-information about your package. The top-level directory should have a README file with basic information about the package, it will hold the license file, eventually it will contain configuration files for automated documentation generation and for continuous integration of tests (but not yet!), if you host the package on GitHub it may have one or more files specific to GitHub integration, and it will hold a few files related to the installation and distribution of your code, the most important being the setup.py file.

Because we are building the package from the ground up, at first our package will have the following structure

TOP-LEVEL_DIRECTORY/
    exampy/
    setup.py

To make the package into an importable Python module, the package directory needs to contain an __init__.py file, which can simply be an empty file created using touch exampy/__init__.py. So a full-fledged, bare-bones Python package looks like

TOP-LEVEL_DIRECTORY/
    exampy/
        __init__.py
    setup.py

Without writing any further code under exampy/ (but with a basic setup.py file that we will describe below), this example package could be installed and imported in a Python session.

The __init__.py file contains everything that is imported by import exampy or from exampy import * (which you should never do!). You can put functions and classes directly in the __init__.py file or you can write them in other files (to organize your code more clearly) and import them in __init__.py to make them easily accessible. For example, say that we implement a first set of basic math functions in _math.py and our package now looks like

TOP-LEVEL_DIRECTORY/
    exampy/
        __init__.py
        _math.py
    setup.py

then without adding code to __init__.py we need to from exampy import _math to gain access to the functions in _math.py; import exampy would, for example, not allow access to exampy._math. If you want the functions to be available under import exampy directly, you can import them in the __init__.py as follows:

# __init__.py
from ._math import *

(although better would be to explicitly import all of the functions that you want to import). This will make functions in _math.py, say you have a function def square(x): return x**2, as exampy.square, available through, e.g., from exampy import square. Alternatively, if you want to retain the “_math” part of the function, you can do

# __init__.py
from . import _math

which makes the square function available as exampy._math.square. In both of these cases, we get the square function using a simple import exampy. I discuss below why I chose to start the _math.py filename with an underscore.

When your code grows in complexity, you likely will want to separate functionality into different submodules, such as exampy.integrate, which will contain functions to integrate mathematical functions. As we saw above, such a structure can be generated by having a single file integrate.py under the main exampy/ directory, but to allow for integrate to consist of multiple files, it is better to make a directory integrate under exampy and use an __init__.py file in that directory to make it a submodule. In this case, our example package’s layout becomes

TOP-LEVEL_DIRECTORY/
    exampy/
        integrate/
            __init__.py
        __init__.py
        _math.py
    setup.py

Everything that we have discussed so far for the main exampy/ directory contents holds for this submodule as well: we can either write code in integrate/__init__.py directory or in different files in that directory. For example, imagine that we have a file integrate/_integrate.py that implements a simple Riemann sum def riemann(func,a,b,n=10): return np.sum(func(np.linspace(a,b,n))*(b-a)/n). Then with an empty integrate/__init__.py file we have to import exampy.integrate._integrate to gain access to exampy.integrate._integrate.riemann (or from exampy.integrate import _integrate or similar), or we can again import the riemann function in integrate/__init__.py to make it accessible through a simple from exampy import integrate call.

The convention I personally follow is to define submodules as much as possible through subdirectories rather than as files, pulling all of a (sub)module’s functionality into its __init__.py file to make it accessible to the user. This is why I gave the non-__init__.py files in the example above names that start with an underscore. This indicates in the Python universe that these are internal parts that should not be accessed directly by users; their functionality is exposed to users by importing it into the (sub)module’s __init__.py file. But this is largely a matter of taste, the most important considerations being keeping things simple for the user and keeping the code easily understandable for yourself (in that order!).

The considerations in naming submodules are similar to those discussed in naming the package as a while above: choose short, descriptive names (not clever ones in this case; great examples are scipy.integrate, scipy.interpolate, which immediately make clear what these submodules do and don’t do).

2.3. The setup.py file

Next, we want to make our package installable using standard Python installation tools. The main tool used for Python packaging is setuptools. To use setuptools, we write a setup.py file that includes all of the information necessary to build, install, and package the code.

Some packages use a setup.cfg configuration file to define the necessary information, but even in that case one still needs to write a setup.py file that ingests the configuration file and hooks it up to setuptools. While this has some advantages, for beginning users I think it is easier to directly write the setup.py file, which is instructive and also allows for extensive customization later. Another downside of using a setup.cfg file is that it makes it that python se[TAB] no longer auto-completes to python setup.py! Advanced setup.py files can become quite complicated (e.g., take a look at galpy’s setup.py file), so while it is again instructive to look at other packages’ setup.py files, for beginners this is likely to be highly confusing.

The main thing a setup.py file has to do is to call setuptools.setup(), which then takes care of supporting all of the basic installation and packaging tools. For our example package exampy above, a simple, bare-bones setup.py file is the following

# setup.py
import setuptools

setuptools.setup(
    name="exampy",
    version="0.1",
    author="Jo Bovy",
    author_email="bovy@astro.utoronto.ca",
    description="A small example Python package",
    packages=["exampy","exampy/integrate"]
)

This basic setup.py file defines the name of the package, its version, some basic information about the author and the package, and it tells setuptools what the actual package is. If you add this file to the example package, you will now be able to install it, by doing python setup.py install, but see below for more on how to install code.

Because installation proceeds by running the setup.py as a Python script, setup.py can contain arbitrary code to help install your code. Let’s take a look at what other keywords we can provide to the setup() function. We can provide:

  • A long_description: This is a detailed description of what the code does (longer than the description, which should be a single sentence) and what eventually would be published on the package’s PyPI site (e.g., see galpy’s PyPI page). Typically, one takes advantage of the fact that we can run arbitrary code in the setup.py file to read the contents of the README and use it as the long_description, using

    # setup.py
    with open("README.md", "r") as fh:
        long_description = fh.read()
    setuptools.setup(
        ...
        long_description=long_description,
        long_description_content_type="text/markdown",
        ...
    )
    

    in case the README’s format is Markdown, and we specify the format as well.

  • url= with the homepage of the package: typically this is the GitHub site. Additional URLs can be specified as project_urls=.

  • license= with the name of the open-source license (e.g., license='New BSD' or license='MIT').

  • classifiers= which contain meta-data about your project used by PyPI to categorize your package. Commonly-used classifiers concern the development status of your code (e.g., Development Status :: 4 - Beta, Development Status :: 6 - Mature), the intended audience (e.g., Intended Audience :: Science/Research), the license (again) (e.g., License :: OSI Approved :: MIT License), the programming language used (e.g., Programming Language :: Python or more specifically, Programming Language :: Python :: 3.7), and the operating system(s) the code works on (e.g., Operating System :: OS Independent for all). As far as I know, nobody ever uses these classifiers and I find it difficult to remember to update them (e.g., between Python versions, or when the code matures to a higher development status), but it is considered good practice to include them. For example, you could have

    # setup.py
    ...
    setuptools.setup(
        ...
        classifiers=[
            "Development Status :: 6 - Mature",
            "Intended Audience :: Science/Research",
            "License :: OSI Approved :: MIT License",
            "Operating System :: OS Independent",
            "Programming Language :: Python :: 3.5",
            "Programming Language :: Python :: 3.6",
            "Programming Language :: Python :: 3.7",
            "Topic :: Scientific/Engineering :: Astronomy",
            "Topic :: Scientific/Engineering :: Physics"]
    )
    

    for a mature package used in astrophysics that works on recent Python versions on all operating systems. A full list of classifiers is available here.

These are the main descriptive, meta-data keywords used by the setup() function.

Further options of the setup function help setuptools deal with your package’s installation and distribution:

  • packages= lists the modules and submodules included in your package. For the example above, this would be packages=["exampy","exampy/integrate"]. Rather than listing modules manually, you can use packages=setuptools.find_packages() to find them automatically, making sure to only include your own package by doing something like packages=setuptools.find_packages(include=['exampy','exampy.*']).

  • python_requires= specifies the Python versions supported by your code, mainly for use by the pip installer. If you are not too worried about this, you can omit this, but if you only support Python 3 (very reasonably these days), you can specify python_requires='>=3'.

  • install_requires= lists the basic dependencies of your code, dependencies without which your code cannot run. When users install your code using pip, pip uses this list to install any missing dependencies. For example, to specify that your code requires numpy and scipy, do install_requires=["numpy","scipy"]. You can specify version requirements, such as numpy>=1.7, using the standard pip syntax. If you have a dependency that is not on PyPI (thus, not pip installable), but is, for example, on GitHub, you can specify it in install_requires and give the URL in the dependency_links= keyword, e.g., dependency_links=["http://github.com/jobovy/galpy/tarball/master#egg=galpy"] to link to galpy’s GitHub source (of course, galpy is pip installable). In the example exampy package introduced above, we used numpy in the exampy.integrate.riemann function, so we need to specify install_requires=["numpy"].

  • package_data= is a dictionary with any data files that are part of your package(s) that need to be copied over to the installation directory (only .py files are normally copied to the installation directory) and that will be distributed when the time comes to publish your package. To copy data files to directories outside of the installation directory, use data_files=. To include the README.md and the LICENSE file, do package_data={"": ["README.md","LICENSE"]}.

  • entry_points= gives non-standard entry points to your code. For example, if you are distributing a command-line script, you can install that and make it executable on a user’s PATH, by specifying

    # setup.py
    
    ...
    setuptools.setup(
        ...
        entry_points={
        'console_scripts': [
            'my_script=my_script:main',
        ]
    }
        ...
    )
    

    which makes the main function of my_script an entry point.

More information on setup()’s keywords can be found on the setuptools documentation page.

2.4. Installing your code

Now that we have the basic outline of an example package and we have written the setup.py file, we are ready to install the code! The standard method for installing a package from its source directory (the top-level directory that contains setup.py) is to call

python setup.py install

which installs it in your system’s installation directory (typically under /usr/local on UNIX-style systems). You can specify an alternative installation location using, e.g.,

python setup.py install --user

which installs the code in a directory in your home folder (typically under ~/.local on UNIX-style systems, with modules installed in ~/.local/lib/pythonX.Y/site-packages). You can also directly set a prefix using

python setup.py install --prefix=~/.local

where the chosen prefix here is to have the equivalent of the --user option (but --prefix can be any directory). An alternative to directly calling python setup.py is to use pip even for local packages. For example, you can install a local project using

pip install .

However, when you are actively developing a package, installing in the way discussed above means that every time you update the code, you have to re-install it to gain access to any changes you have made. To avoid this, you can install the package in “develop” mode, using

python setup.py develop

or

pip install -e .

if you are using pip. In “develop” mode, the source is not copied to the installation directory, but rather an entry is made in the installation directory to find the code back in the original directory. This means that any changes you make are immediately available system-wide without requiring a re-installation. Of course, if you have the package already loaded in a Python session, you still have to exit and re-start the session (or use importlib.reload). If your package includes compiled code and you make changes to the source code that need to be compiled, you do have to re-compile the code by running python setup.py develop again.

2.5. Code licenses

Before moving onto the next chapter where I will discuss how to start sharing your code online with others, it is important to briefly discuss code licenses. All code that is shared online should have a license. Without a license specifying the terms of the code’s use and re-distribution, all code is considered to be copyrighted to the author, without allowing re-use or re-distribution (code that you put online without a license is not in the public domain, indeed, the opposite is the case). Thus, you should choose a license for your code and put the license file in your code’s top-level directory. If code is on GitHub without a license, the GitHub Terms of Service allow people to view and fork the code, but no modifications or re-distribution are permitted (see the No License GitHub help page). License your code.

There are two main categories of open-source licenses: permissive free software licenses and copy-left licenses.

Permissive licenses, as their name implies, are very generous in their terms. Typically they allow arbitrary use, modification, and re-distribution provided that the original license is retained, the original author is properly credited, while any liability related to any use of the code is explicitly denied. Examples of permissive licenses are the MIT License and the BSD 3-clause License, with the MIT License appearing to be the permissive license of choice of recent projects. Permissive licenses allow the broadest use of your code, because they require very little of people using your code. Most of the major Python projects that you know and love use permissive licensing (e.g., numpy, scipy, astropy).

Copy-left licenses are open-source licenses that in addition to denying liability and requesting credit for the original author also require that any modifications of the code be re-distributed under a similar copy-left license. The main used example of a copy-left license is the GNU General Public License version 3 (there is also an older version 2 which is somewhat more permissible). Thus, you can only use copy-left-licensed code in packages that are themselves copy-left licensed. In practice, this tends to decrease adoption of such packages, even though the philosophy behind this style of license is laudable (it aims to make sure open-source software remains open-source).

Creative Commons Licenses are not typically used for software, even though they are in heavy use for sharing other creative content such as websites, class materials, scientific papers, blog posts, etc.

The most important thing is that you give your code a license, with the type of license being of secondary importance; any license is better than no license. While it may seem silly to you, explicitly denying liability is an important thing to do when you put code online, to legally protect yourself from mis-use of your code (not that this has ever happened to me, but you never know…). When in doubt, choose a permissive license like the MIT License.