Reconciling DevOps with Enterprise Architecture
Mavericks versus traditionalists? Freedom versus Control? Chaos versus Organised?
Conflict is a signal for change
In a typical enterprise of the last 20 years or more, Enterprise Architecture has been lord of all it surveys. It controlled or at least had massive influence on the tools, the infrastructure, the data, the procurement, the methods and most fundamental decisions within the IT arena.
The role it played satisfied an organisation's natural tendency to square everything away in definable roles, budgets and spheres of responsibility in a way that aligned things, formed a basis for planning and resource allocation along with producing wonderfully detailed diagrams of IT assets, roadmaps, deliverables and that "someone" is doing the "security thing".
Then along comes DevOps, the semi-anarchic cowboy demanding that Product Teams have autonomy, that there is no fixed way for all development to happen across the company, that things are fluid, insisting that responding to change is far more important than a 3 year roadmap of deliverables and releases.
This is crazy scream the architects! How can we plan? Who will "own" this box? How can we align? How can we structure things around such madness?
At first glance these seem like two diametrically opposed tribes doomed for civil war. How can we possibly balance autonomy (anarchy from the Enterprise perspective) from the DevOps camp with the valid concerns of enterprise (constraint and toil from the DevOps perspective)? Some have even likened these seemingly conflicting forces as the Planned Economy of Communism versus the Free Market of Capitalism!
But really - it boils down to the fact that DevOps could (and should!) impact and change every corner of the organisation - not just be "the thing the software engineers do to be hip." There is no half-measures in a successful digital transformation - and the successful ones are based on total commitment throughout the leadership and organisation to the principles and lessons of DevOps.
That includes re-aligning Enterprise Architecture and in most cases freeing EA of responsibilities it should never really have enjoyed having in the first place.
What is this autonomy we crave?
I'll start with saying that Product Team autonomy at its most fundamental level is not "we'll do what we like."
It is born from the lesson of the last 20+ years that one size does not fit all. In fact it usually doesn't even fit some!
But first we need to break down some terms and principles.
What is TOIL?
Yes uppercase because it is so important! Toil actually started as an SRE term but I think its critical in everything we do in DevOps - even when doing things completely unrelated to technology
In essence, TOIL is waste. Waste in energy, waste in time, waste in processes, waste in literally anything whilst getting something done. Typically defined along the lines of manual and/or repetitive work that doesn't contribute directly to achieving the business value the product is trying to deliver.
My personal key factors of TOIL that I am looking for are anything that impacts the iteration speed of a team (see Boyd's Law of Iteration for more information). Typically these are:
Manual & Repetitive Activities
Not only are manual tasks boring, slow and error prone - they also drain the energy of the team.
Not "Oh My God that was Exhausting" but more like slowly chipping away at the energy level of the team. Engineers involved in manual work will be feeling bored by the banality of it, stressed by the risk of mistake and generally just unproductive. Again, refer to Boyd's Law but also Agile and LEAN will tell you the same thing.
Constraint
You don't tell the carpenter you hired to use your cheapo saw instead of their specialised and higher quality professional saw - and you certainly don't tell them to use a screwdriver's handle to hammer in nails.
Whilst hammering in nails with a screwdriver handle will work - and saves needing to buy two tools instead of one... you know the result will not be as nice, the task will take longer and the carpenter will not be happy.
External Dependencies
The entire premise of a Product Team is that they have all the skills and resources they need in the team. They don't need to raise a ticket to another team to get something done, they don't need to wait for the operations team to deploy for them, they don't need to integrate with dozens of external tools (which they likely need to raise tickets to interact with).
For every external dependency (whether they be systems or teams / individuals) you introduce hundreds if not thousands of hours of WASTE in waiting for things to be done and in mistakes due to misinterpretation of instructions or requirements. External dependencies add roadblocks and diminish the teams ability to act.
Oversized Activities
A foundation of both Agile and LEAN principles with many studies behind it shows that the smaller the units of the work, the faster the throughput.
When a unit of work becomes large and complex - with too many problems to solve simultaneously, an engineer gets easily bogged down in parallel thought processes, procrastination or difficulty contextualising all the pieces required. It causes great difficulty in both defining and meeting "Done" state.
What is a Product Team?
A product team's purpose is to be obsessively hyper-focused on delivering the absolute best single "product." That product can be a micro-service, an API, a business application or even a teddy bear.
The point being that they are allowed to have complete focus and responsibility for delivering the most performant, secure, cost-effective and complete delivery of that single service or application.
By giving the Product Team focus and responsibility - we eliminate huge amounts of TOIL. We give them complete control so that they will not have to deal with external dependencies, units of work are smaller, they can work without constraint and they have both the time and internal expertise to engineer away manual and repetitive work.
Focus
A product team is given the focus by not having responsibilities spread across multiple projects thus increasing the team's bandwidth to concentrate on delivering all aspects related to their product, not just the code. They gain the time to become a specialist in the requirements and wishes of the orderer of this specific product. They become experts in the very precise, super detailed, super optimised methods to best deliver this. one. thing.
Responsibility
The product team is given the responsibility by allowing them to recruit all the roles they need specifically to their team, the control of how to utilise and distribute their resources (including budget!) in the best manner to achieve their goal. That responsibility extends to being completely transparent and measurable so that they can be measured on the results that they achieve and not the tools and processes they used to achieve them.
The product team AND ONLY the product team will, via their specialised knowledge and experience gained in this specific delivery, be able to identify the best tools, technologies and tricks to deliver this product in the best, most flexible and efficient way.
Responsibility versus Mandate
You can't have responsibility for something without the mandate to take your own decisions and act upon them.
Story time!
Some years ago my Web Operations team was given the responsibility by the Network team for making sure that all firewall rules relating to our service were correct, secure and up to date.
Great! We thought, we can totally use this as a chance to optimise, align and really take control of this important area of our service.
But there was a catch!
We were not allowed access to firewall to view the current rules directly. If we wanted to see the rules then we needed to raise a ticket and our rules would be exported to a spreadsheet (sometime within 1-3 days) and we CERTAINLY would never be allowed to modify and publish our rules directly.
Instead we needed to raise a ticket describing the changes we wanted and someone would interpret those instructions (usually incorrectly), implement them for us (because hey, we couldn't possibly learn and understand how to do something so specialised and complex ourselves!) and then ask us to check the result (which remember requires another ticket to get the latest view of the rules!)
So the Network team wanted to give us the RESPONSIBILITY (in other words, the workload and the blame if things were wrong) without the MANDATE (the ability to decide, act and implement ourselves).
This of course was the worst of all worlds - we suddenly had responsibility, little license to act and huge amounts of TOIL added to the load of the team.
Autonomy
So back on topic, what is Autonomy?
Autonomy is having both the RESPONSIBILITY and the MANDATE TO ACT for all aspects of our Software Delivery Lifecycle.
The product team needs to be able to decide how best to plan, design, build, test, secure, deploy, monitor, scale, distribute, support and any other aspect of the DevOps loop without excessive constraint or dependency on other teams.
Autonomy is not...
Working in isolation - of course not! We should learn from others and share our own knowledge. Transparency is key and we shout about both our successes and our failures for other teams to learn from and perhaps they can offer help!
Refusing to align with other teams - again, it's the opposite!. We should build inter-team relationships (especially to those closest to us) and generally be good citizens in the enterprise as a whole. We should join SIG's and Guilds whilst delivering in ways that are compatible with the world at large
Refusing to align with Enterprise Requirements - Again - the opposite! In fact probably the first principles in the design of the product will be about how the service or data or application will be consumed by other Product Teams and help the Enterprise reach its goals!
Re/Inventing everything - It would be TOIL to invent everything or being different for the sake of being different. We should be different where it adds value to either the TEAM's work or the product.
So where is the conflict?
Its just nature
Most studies around the reaction to change, whether it be a pandemic, a company re-org, cultural change or any other environmental change, is DEFENSE DEFENSE DEFENSE!
Its a natural reaction in humans and animals alike, hard-wired and primeval so unless you are aware of this fact of life - you will behave predictably.
More specifically in organisational and cultural shifts the most common reactions include insecurity, protection of what you have, immobilisation, denial and refusal.
Let's bring it back to what we are talking about here and that will be reflected as protection of your role, your job, sense of purpose and its difficult to let go of things .... it feels like loss!
What is IT Enterprise Architecture?
Lets go back to basics here... pick a definition... there's plenty out there and everyone will have a favourite variation - but I think the following does a decent middle of the road job:
Enterprise architecture (EA) is the practice of analyzing, designing, planning and implementing enterprise analysis to successfully execute on business strategies. EA helps businesses structure IT projects and policies to achieve desired business results and to stay on top of industry trends and disruptions using architecture principles and practices, a process also known as enterprise architectural planning
Now the key parts to this are "helping business to structure IT projects and policies to achieve desired business results" and "stay on top of industry trends and disruptions".
Comparing EA with Reality
In the last 20+ years, there has been an obvious creep in that definition. Depending on the organisation, that may include all those things we listed in the first part of this story. Typically today EA will be:
defining rather than analysing
enforcing instead of recommending
deciding infrastructure, tooling, and architecture
In that same time period we have gone from
A handful of programming languages to 100's
Limited tooling to dozens of options for everything
Product and project lifecycles of years to sometimes days!
Fixed physical infrastructure to transient and implied infrastructure!
Development teams of 100's to 2 pizzas
From specialised and centralised teams to distributed generalists
Here is where they clash
So now we are in place where DevOps and Product Team mentality takes a whole bunch of decisions previously "owned" by Enterprise Architecture and hands them to the Product Team - OUCH!!!
There has also been, for a large part of the last 2 decades, a Vendor = Technology relationship. If you wanted virtual machines then you had exactly one enterprise choice which was VMWare. You want a database? Then there was Oracle and MSSQL. If you wanted a supported Application Server for Java you were looking at either IBM WebSphere or Oracle Weblogic.
This meant that strategic (EA) decisions were naturally linked to the selection of vendors and products and this is where the role creep starts. If EA is identifying a key technology, then it was by default also selecting vendors which in turn often trickled easily down into Solution Architecture of individual applications and services, which turn into decisions made all the way down the stack.
But the fact of the matter is that you can no longer say Oracle for DB's, WebSphere for Application Servers and Java is our language of choice. It didn't even work particularly well 10 years ago and every organisation is full of stories of how what might have seemed like a great efficiency from a centralised perspective turns out to be an exponentially higher cost in compromises, toil and poor optimisation at both the technical and people levels.
So from a purely systems approach - moving responsibility for decision making to the Product Team makes total sense. There is such huge variation of products being delivered by a large organisation that there will be gigantic spread of tools and technologies used to provide tailored architecture and solutions for each and every product.
But of course for those people whose livelihoods and career development have been built on a different model, it will feel like a threat to their role and job if they take the wrong mental perspective.
How can it be resolved?
EA lives in the future
"What will we do now?" should become "Finally we can concentrate on our core purpose".
EA was never supposed to be so deeply involved in the minutiae of individual solutions and designs. Remember the definition ..."structure IT projects and policies" and "stay on top of industry trends and disruptions" is the role ... not choosing application servers and getting involved in Github versus Gitlab or Google GKE versus OpenShift or Postgres versus MariaDB versus NoSql alternatives. You can't hope to make generalised product, solution or tooling decisions for the hundreds or thousands of projects in the organisation.
They are supposed to be the far sighted explorers. What comes after containerisation and FaaS? Is something better than GIT being worked on and how would it improve our source control? What paradigms will result from the ML revolution and is there really a case for Block chain in our accounting work ... etc
From this work - they can inform with confidence on important strategic decisions such as standards and protocols that teams aim to produce in order to be compatible with other teams and the world at large.
The focal length of EA work is supposed to be 3-10 years!
DevOps Product Teams live in the here and now
Product teams have immediate concerns. They have features to deliver in weeks or even days, bugs to fix in hours, visions to deliver with a focal length of 0-18 months. In a truly agile world they are working with needs and designs that could and should change week by week.
But they will do so in knowing that they don't need to decide whether we should support REST or GraphQL for our API endpoint because thats already an EA defined STANDARD for the organisation as THIS IS THE POINT OF ALIGNMENT.
Whether the team uses Golang or Java, Kubernetes or Cloud Run, CloudSQL or Datastore is not important to the enterprise - what matters.... truly matters is the protocols and standards of the endpoint so that other teams and consumers can interact with it.
Separating the Vendor / Product from the Technology
And this is what it essentially boils down to right?
Whereas 10 years you could perhaps choose a vendor (because there was only one or two to choose from) and perhaps even be fairly confident it will be the best choice for a few years, that is no longer the case.
How many different container orchestrators popped up in recent years? Each one might have been "best" for a period of 6 months at most. Competition is high and binding yourself to a specific solution will always lead to remorse down the line.
But what remained constant was that containers and Kubernetes underpinned most of them and thats what I as a product team member would hope that Enterprise Architecture would discover for me. Not that we should specifically use Open Shift in all cases as that would have painted us into a very bad corner.
The second aspect is that it is extremely rare for a specific vendor or product to be available in ALL cloud providers in ALL regions and jurisdictions. I'm kinda screwed if I am pinned to specifically GKE but want to run in China. But if I have stuck to vanilla(ish) K8s I am golden and can run in any cloud as they will all provide the TECHNOLOGY via different providers.
Oh God! Here comes the car analogy
Well, not quite - in fact a Traffic Authority analogy.
Every country needs a traffic authority who will define the general rules of the road.
Drive on the Right
Take Roundabouts counter-clockwise
Red means stop and green means go
Vehicles should have more than X and less that Y wheels
Vehicles must have certain safety equipment
The traffic authority may even provide data and input to decisions about roads that need to be built
They do this to provide alignment - to ensure that when we meet, we will not crash into each other and die!
Yet they manage to do it without:
telling drivers which brand of car they must buy
demanding road builders can only use Caterpillar diggers
or insisting that we must use Asphalt Supplier A for the roads
So, my EA friends... If I am a young guy looking to impress girls (a frontend) I will probably use a hatchback. If I am delivery driver (API) I might use a diesel and if I am haulage company (Machine Learning) I am going to need a big old truck!
Last updated