We're Infrastructure Site Reliability Engineers (SREs) here at DWP Digital. SRE is a specialised role that focuses on the reliability and maintainability of large networks and systems. It is meant to be disruptive and cause impact.
A major focus of SREs is the aspiration to never see the same issue twice, often using automation as a resolution. We spend a large amount of time on reducing human labour, sharing knowledge among teams, and creating a blameless environment.
We're currently working on network standardisation, documentation and rationales. This will help form a strong structure and basis on which we can successfully deploy network automation across DWP using RedHat Ansible managed using Ansible Tower.
One of the biggest impacts we’re trying to make is in adapting the business and change processes to progress digital transformation into the future. This will help us reduce issues caused by ‘human error’ within the network infrastructure environment by automating tasks, queries and troubleshooting practices through Ansible. And it’ll also help us reduce toil – tasks that tend to be manual, repetitive, automatable and that scale linearly. We’re reporting success and failure using the Ansible Tower platform.
An SRE and DevOps approach
In our teams we’re continually collaborating with our DevOps and SRE knowledge to solve problems and improve the network for the benefit of the department and our users.
We work in multi-disciplinary teams to ensure that our diverse ideas and technical inspiration can be incorporated into our daily work stream, helping us to provide the best service and outcomes. We also attend daily stand-up meetings for individually assigned workflows and application improvements so we can maintain flexibility and adapt during team absences. By doing this we can ensure we’re aligned with individual department platform needs and function.
Services we work on
We’re currently developing Cisco device management in RedHat Ansible in a pre-production environment. We have pre and post change capture modules with comparisons, configuration modules which run from a device template, and reachability and network health tasks. These are all currently developed and waiting for the next steps to compile and push into live environments.
We envisage a series of Ansible playbooks for each OPH platform which can be run and monitored by the UXC team 24/7.
Network SREs and NetDevOps
Network automation SREs are doing the same daily tasks that network engineers do, but we’re using tools like Ansible to achieve them. This role or process is often known as NetDevOps.
Network implementations and changes are no longer being done manually on the network hardware itself, but automating the process to reduce human error and toil.
Most network engineers today configure networks the same way they have historically done; they either use a console cable or a jump box with CLI access, where they will happily configure one device at a time with configuration files saved on their laptops.
The present is changing rapidly - configuration files are no longer being kept on engineers’ laptops, they’re being version controlled in Git. Changes are being sanctioned with pull requests and implemented via pipelines with full testing in pre-production before being rolled out and pushed into production environments.
If this sounds like the kind of work you’d enjoy, check out the DevOps and SRE roles we currently have available.