At DWP Digital we use the latest tools, techniques and technologies to make our services better for users. And in our Data Science Community we’ve recently begun using containerisation technology to help us deliver our data projects.
It started with a sandbox
When I started working at the Sheffield hub in November 2017, we were using a data science sandbox environment that had been developed so we could use open source tools such as R and Python with DWP data in a secure way.
Although it was really useful, it was also cut off from the internet – or ‘air-gapped’ – for security reasons. This meant that installing new tools into the environment was difficult.
This was a problem for me because my first project was using Natural Language Processing, and many of the tools I needed hadn't been installed.
So I was interested to discover that Docker was used to configure the environment. Docker is a containerisation technology used to build, ship and run distributed applications. It’s something that I’d used in my previous role as a Computational Biologist working for Cancer Research UK's Drug Discovery Unit in Manchester.
What is containerisation?
Containerisation technology means that applications and software can be wrapped up in their own mini virtual computer and run in a variety of hosting settings such as a laptop, a computer cluster, or on the cloud.
In computational biology the data can be very large indeed. Projects such as The Cancer Genome Atlas are generating Whole Genome Sequencing data from the tumours of thousands of cancer patients around the world, and this data is put in the public domain for scientists around the world to use in their research.
However, the data are so huge that it isn't feasible for individual scientists or even institutes to download their own copy. Instead, the data is stored in the cloud and researchers develop their pipelines locally, wrap them up in Docker, and send their analysis to the data.
Seeing the potential of Docker
At a Data Practice community meeting I got chatting to Stephen Southern and Pablo Suau who work in the Newcastle Data Science Hub. Stephen had recently been appointed as Platform Lead for Data Science and was taking over control of the sandbox environment and thinking about our requirements more generally. And Pablo was already using Docker to develop repeatable and reproducible analytical pipelines for the analysis of open labour market data.
We all saw the potential of Docker as a key technology to help us deliver our projects and worked together over the next few months to achieve this, greatly aided by the well-timed arrival of Infrastructure Engineer Adrian Stone to our Sheffield team.
We now use Docker to define environments that contain the appropriate set of tools for a given project. We mostly do this initial work either on a laptop with a connection to the internet using either open data or simulated data. Once we have a Docker image that we’re happy with, we can then deploy this either into our secure sandbox environment where it has access to DWP private data, or to the public cloud where it can access open data.
This gives us huge scope in the range of tools available to data scientists within DWP; if we can build it in a Docker image, then we can apply it on our data!
We can also manage our environments to ensure that we use them consistently, and can use them again if necessary in future. Data Science tools are evolving very rapidly, so we need to be able to ensure that an analysis pipeline we run today can still be run reproducibly a year or two later.
At DWP Digital we’re using data to solve some of the country’s biggest digital challenges. If you like the sound of what we’re doing, why not take a look at some of our current Data Scientist vacancies on our careers site?