1. Introduction: What makes a good scientific software package?

The purpose of these notes is to introduce you to the main tools available for creating, developing, maintaining, and releasing a Python software package, with a particular emphasis on software packages with an academic, scientific audience. While many scientists these days write and run computer code for much of their working hours, relatively few scientists even now write software packages that gain much use outside of their own group of collaborators, even though many of us write code that is generally useful and all of science would benefit if scientists shared their tools more widely. These notes aim to help move the community of scientists along toward more open-source, open-development, well-tested, and well-documented code releases.

My own field is that astrophysics and my experience in writing code for astrophysical applications inevitably colors all opinions expressed in these notes. I have written a few software packages that have found more wide-spread use, primarily galpy, a Python software package for galactic dynamics, but also various other packages for data handling and analysis and for machine learning. Many of the thoughts expressed in these notes, especially the more opinionated ones, come from my experience developing and maintaining these packages, with many of the lessons espoused in these notes learned the hard way by me. But once you get the hang of how to package your code in a way to make it useful to people, it becomes far easier to create new packages. Making the most of the many wonderful automation services that are freely available these days is a crucial part of making developing and maintaining code easier, and these notes therefore go into significant detail on how to use these services.

Before we begin, it is important to mention that these notes are not intended to teach you how to program in Python, neither the basics nor more advanced concepts, and a mature understanding of Python is assumed. Much of what is discussed would generalize to packages written in languages other than Python, but no attempt is made at generalization.

1.1. “Why should I (want to) release my code?”

Even today, many scientists are often reluctant to release code they use, both more general software packages that they use in many projects or the code specific to the analysis in a specific project. The reasons given vary from embarrassment at the state of the code and the fear of being scooped to it being too much of a burden to release code and wanting to keep code private as a competitive advantage. About embarrassment there is little to say except that if you are so embarrassed by your code that you would not want to share it, then perhaps the code is not fit for use in a scientific paper. We should all think about the code we write to do our projects as equal in status to the paper that describes the project and its results, and therefore one should not end up with a code that one is embarrassed about, just like nobody submits a paper that they are embarrassed about. With the increasing complexity of all scientific analyses, actually having access to the code is the only hope we have of being able to reproduce analyses from previous work, since published papers fall woefully short in providing a full description of all steps involved. But if we had at least a snapshot of the code used for each paper, we would have a much better chance of reconstructing what happened (perfect reproducibility is unfortunately a very thorny problem, with ever changing versions of the many software packages one relies on and even of the underlying operating systems).

(This point was brought home once again in a recent exchange with an author whose work my collaborators and I are relying on in an ongoing project; we asked the author to clarify a calculation in their paper and its relation to a similar calculation in a previous paper and they replied that they could not state whether the calculation was correct or not, because the paper was written seven years ago. If the actual author of a paper cannot reconstruct what they did a few years later, what hope is there for outside readers? But if the code had been available, it would have been easy to check.)

About the fear of being scooped all I can say is that I do not believe I have ever heard of anybody being scooped by people using their publicly-available code.

Because of the need for reproducibility, keeping code private once it has been used in a scientific publication is unethical. Even if you do not make your code publicly available online, interested parties should be able to request code that you used as part of a scientific publication, provided that it can be shared without placing undue burden on the requestee (but it is hard to imagine that sharing a code that is so precious that one would want to keep it private would be difficult to share).

Thus, it is clear that there are many good reasons from the standpoint of scientific practice and ethics for sharing and releasing your code, but what about positive reasons for releasing your code?

I believe that you should want to release your code because it will make you do better science, cause you to gain collaborators, increase your professional profile, help you share your work with as wide a community as possible, and allow you to help the next generation of scientists bloom.

One of the great advantages of sharing your code and having it be used by other people is that they will find issues in the code. As they say, that’s not a bug, it’s a feature 😀. All code has bugs (this is why we will discuss in detail how to test your code), and the more eyes on the code and the more people using it, the more likely that major bugs with get caught and fixed (and fixed forever using continuous integration). In the end, this leads to better science, because code that’s used and re-used is more likely to be correct.

If you release your code and it gets used by a community of people, this will increase your professional profile, because people who use your code will know and remember you, and they will remember you fondly, because you help them out in their research without asking anything in return. People working with your code will contact you and describe their own use case, and this will invariably lead to new collaborations.

Sharing your code and sharing the results of your research as re-usable code is also a great alternative way to gain exposure for the science that you are doing. People using your code will want to find out what you did with the code and thus they will learn about the research that you are doing. If your code is more generally applicable than the specific science area that you work in, this will attract attention from people who might otherwise not learn about your research. And through your code, you can influence how people do science, for example, by setting defaults to values that you find reasonable.

Finally, an important reason for why I release code is to help make science more accessible. Well-documented, well-tested code that is easy to use can provide an important jumping off point to people new to the field you are in, whether they are students just starting out or more seasoned researchers exploring a different field. By making state-of-the art tools easily available, we increase the overall level of science done and help researchers explore their ideas. By making code open source and developing in the open, this is the case no matter whether people are at a prestigious institution, or students in remote areas of the world.

1.2. The dos and don’ts of software package development

There are many ways in which scientific software packages become successful. Your package may solve a problem that is very difficult to solve and/or that requires specialized routines and a level of code optimization not generally known among scientists; in this case, users will likely put up with a lot of inconveniences. Most high-performance-computing codes fall in this category and they are not typically known for being easy to install or use. However, much more common these days is that a code solves a problem that is not easy to solve oneself, but doable by most scientists working in the field with a moderate amount of effort, at least to write a basic version of the code. Examples of these types of problems in astrophysics are (i) reading different data formats, (ii) doing astronomical coordinate conversions, (iii) convolving spectra with photometric bands to generate photometry, etc. But there is an enormous advantage to the community for these tasks to be performed by generally-used software packages, because, through testing and use, packages become very robust against bugs and because much-used packages have an incentive to get all of the subtleties in calculations correct, which would likely be ignored by an individual writing a basic version of a calculation. In astrophysics, this is a key reason why astropy has been such a success, despite in many ways implementing basic astronomical tools that each on their own could be implemented by a graduate student: the combined effort from many individual contributors has made astropy robust and general to such a degree that it far surpasses what any individual could write and it therefore becomes an essential research tool.

In my opinion, the single-most important reason behind a code’s success in gaining wide-spread use is being easy and intuitive to use. This should be the main driver behind any decision regarding details of implementation, documentation, and installation of your code. Always put the user first. Put yourself in the user’s position and ask how you would want to use the code if you had not written it. That’s difficult to do, but it’s important to try. Users want a code to be easy to install, because unless a code solves a problem they need to use and it would take them weeks or months to solve it themselves, users will give up and implement the code themselves if they cannot easily install your code. Functions, classes, methods, and variables in your code should have intuitive names, so it is easy to remember how to use your code without having to constantly look at the (hopefully good and comprehensive!) documentation. The basic functionality of your code should be able to be run with a few lines of code, nobody wants to have to start with a hundred line code-block to get any output from your code (that’s okay for advanced use, but basic use should be brief). And whenever you face a choice between simplifying the implementation of your code or simplifying its use, you should go for the latter even if at the expense of the former (simple, intuitive implementation is important from a maintenance perspective, but user experience is more important in my opinion).

To summarize, here is a list of dos and don’ts for developing your package:

  • do allow your code to be installed using standard installation commands (pip install X for PyPI packages, python setup.py build/install/develop for building from source, ./configure, make, make install for compiled code).

  • do have your code and auxiliary files installed in standard locations. Nobody wants to have to use your code from the directory it was downloaded it in; code should be able to be installed and be available system-wide if desired.

  • don’t require use of files in non-standard locations: small data files should simply be copied to the code’s directory; the exception to this rule are large data files that are necessary, which may require the user to give a preferred location for these (e.g., through an environment variable or a configuration parameter).

  • do attempt to make your code work on commonly-used operating systems: Linux/UNIX, Mac OS, Windows if possible (Windows can be tricky, although for pure Python code it should be relatively straightforward if one pays attention to some details, like using os.path.join to create file paths rather than writing them out directly).

  • do support the last few minor Python and numpy versions (there is a numpy NEP with guidelines). To be as stable as possible, avoid using new features in the latest Python version, or at least test for Python version in your code so it does not entirely fail on older versions.

  • don’t require too many dependencies: What’s great about Python is the enormous variety of packages available and you may be tempted to include many dependencies that your code can use. But dependencies make code hard to maintain. Most packages you depend on are under ongoing development and will change calling sequences, deprecate features, move code between submodules, etc. and all these changes will break your own code (for example, scipy has moved the logsumexp function twice since I started developing galpy, requiring multiple if statements to deal with different scipy versions that galpy users might have).

  • don’t require hard-to-install dependencies: Dependency-problems are compounded when dependencies are difficult to install. I am old enough to remember when simply installing numpy or scipy was difficult; luckily that is no longer the case, but many packages can still be difficult to install, especially if they require compiling code and linking to external libraries. If a dependency is difficult to install, it is likely that many of your potential users will fail to install the dependencies; if this means that they cannot run your package, they will not end up using it.

  • do make all but the basic dependencies optional: Dependencies can be made optional by enclosing their imports in a try: ... except: ... statement and only raising an error when your code truly cannot function without the non-imported dependency. This is a great way to make sure that hard-to-install dependencies do not fully block use of your code. If it is likely that users will lack a dependency and this lack would make a simple import YOUR_PACKAGE fail, it is a good idea to make that dependency optional. A good practice is to make any dependency that cannot be reliably installed using a simple pip install optional.

  • do document new features from when you write a first, basic versions of them. Documentation should not be an after-thought, something that you will “get to once the code is mature”. Keep documentation up-to-date with changes in the code.

  • do keep a changelog, documenting all but the most minor changes of your package’s functionality.

  • do use the standard Python mechanism for reporting errors and exceptions and for raising warnings. Avoid excessive warnings, but err on the side of warning too often rather than too little. Use warnings to point out changes to the code that a user may not be aware of, or to warn about possible unintended usage. Using the standard Python exception/warning syntax allows users to catch and ignore errors and warnings as they see fit.

  • do write a comprehensive test suite for your package and use continuous integration services to run it automatically whenever you update the code.

This is just an incomplete list of dos and don’ts and I will give more tips in the remainder of these notes.