Skip to Main Content

University Library, University of Illinois at Urbana-Champaign

Research & Publication in Medicine & Health

About Reproducibility

Guidelines

Rules for Reproducible Data Science

Excerpted From: Creating Reproducible Data Science Projects. Justin Boylan-Toomey, July 25, 2019. https://towardsdatascience.com/creating-reproducible-data-science-projects-1fa446369386

  • Use Version Control
    Use a version control system such as GitHub or GitLab, to provide a remote backup of your codebase, track changes in your code and collaborate effectively as a team. Try to use git best practises, frequently committing small changes that solve a specific problem.
  • Agree a Common Project Structure
    Consider using tools such as Cookiecutter to generate a standard data science project folder structure for you.If a specific projects requirements mean you need to use a different structure than your team normally uses, document the new structure in your repositories README.md file.
  • Use Virtual Environments
    Use conda or Python’s built in venv environments to keep track of your projects dependencies and Python version information.
  • Clearly Document Everything
    Clearly documenting your projects and code will save you time if you have to revisit the project at a later date. It will also make it far easier for others to use your code or follow and build on your analysis. At a minimum include a README.md file at the root level of your repository. The contents will vary between projects but should include a description of the project and an overview of the methodology and techniques used. (see original article for additional information)
  • Use Jupyter Notebooks Wisely
    Consider moving your core logic out of your Jupyter Notebooks and into separate importable Python module files. This will enable the sharing of code across your team, avoiding duplicate and slightly edited versions of core data science code being scattered across your teams’ notebooks. Code quality will also improve as you can easily collaborate, run tests and conduct code reviews on your shared modules. (see original article for additional information)
  • Keep Your Code Sytlish
    Agree coding standards. Try to write Pythonic code in line with Python’s PEP8 style guide.Using a fully featured IDE such as PyCharm or Visual Studio Code with built in linting, will highlight any poorly styled code and help identify and syntactic errors in your code. Using an automatic code formatter such as Black will ensure that the code in your teams projects has a consistent style, improving readability.
  • Test Your Code
    Use a unit testing framework such as PyTest to catch any unexpected errors and test that your logic executes as expected. Where appropriate consider using Test Driven Development, this will ensure your code is error free and satisfies your requirements as you write it. It is also a good idea to use a tool such as Coverage to measure the proportion of your code covered by your unit tests. Python IDEs such as PyCharm have built in testing and coverage support, even automatically highlighting which lines of code are covered by your tests.
  • Use Continuous Integration
    Consider using continuous integration tools such as Travis CI or Circle CI, to automatically test your code when merged to your master branch. Not only does this prevent broken code from reaching master, it also simplifies the code review process. You can even use Black with a pre-commit hook to automatically format committed code, removing any debates over code style from the review process and ensuring a standard code style across your repositories.
  • Sharing Data & Models
    For larger or more complex projects consider using a cloud storage solution such as AWS S3, Azure Blob or locally hosted network storage to store your model and data. This can be combined with DVC a version control system designed to effectively version control the output of machine learning projects, without pushing your large data and model files to GIT.
  • Data Pipeline Management
    Try to make your data pipeline code modular, breaking your pipeline into modules for each discrete process and unit testing each of them. For larger more complex pipelines consider using a workflow management tool such as Spotify’s Luigi or Apache Airflow to execute your Python modules as chained batch jobs in a directed acyclic graph.