Bottlenecks in Team Topologies: Overcoming the Inevitable with Satinderpal Sikh
I had an amazing discussion with my good friend and the AVP and Chief Architect of Cornerstone OnDemand, Satinderpal Sikh. We dive deep into four main challenges when converting from monolith to microservices.
Now here is the secret bonus in the talk. It is really not just about how to get to a microservices architecture, but also about how to create an org design where infrastructure services do not bottleneck the groups that rely on them.
Satinder describes how to make team topologies practical. He has thought a lot about– and has successfully implemented– a team that innersources and scales.
00:00 - From Monolith to Microservices.
15:00 - Merging QA/engineers/SRE teams.
16:30 - Creating infrastructure build/frameworks to speed engineering builds.
17:30 - Creating a microservices infra team.
18:15 - Making microservices is hard. Implementing a Devkit to automate coming engineering practices.
20:15 - Why build the tools? Accelerating microservice development.
21:30 - Revisiting the journey from monolith to microservices.
24:00 - Scaling the platform team - how to get 10 people to support 600 engineers.
26:45 - What to do when the platform team becomes a bottleneck, and how to scale it.
28:00 - When offering new platform services, how do you develop a roadmap for the platform team?
29:00 - The first challenge: Conflicts within the platform team roadmap. The solution: Use Insourcing to remove the conflict.
31:00 - The second challenge: many people are contributing code. The solution: Use PR approvers. But now, the platform team again becomes a bottleneck.
32:00 - The third challenge: Relieving the second bottleneck. The solution: Outsourcing PR approval.
36:20 - Getting architecture aligned.
37:00 - Developer CoP.
39:00 - A quick summary of what we’ve covered so far.
40:00 - Addressing non-functional requirements from others: quality, data sovereignty, and security.
41:45 - The fourth challenge and the next bottleneck: security and compliance.
43:15 - The solution - Security Champions.
44:30 - Approving, training, and revoking security champion.
47:00 - Preventing IAM policy bottlenecks: breaking it up into two types and automating IAM policy approval.
So we're going to talk about Innersource and how it works, how it should work, and what you've learned both good and bad. Before we get into it, can you tell me a little about Cornerstone, where it was before you started and where you were trying to get to?
Cornerstone provides an enterprise SaaS solution, and has been doing that for the last 21 years. We are an HCM provider-- Human Character Management-- and solution. We pioneer in the learning management system.
Many of the Fortune 500 companies are our customers. We deal with massive traffic. There are peaks, right? When the hiring season is going on, or when performance appraisals are happening, or reviews are happening... there are certain customers where hiring is insane throughout the year. So there is a massive load on our system.
We are a global solution. We have traffic in about 196 countries and in about 42 different languages, and we have customers throughout the globe, primarily in U.S. and in European geography, where we are the pioneers. So that's what Cornerstone does.
When I joined Cornerstone about five years ago, we were a very monolithic architecture. Acuity and monolithic architecture, web layer, middle tier, and a SQL server database. But ever since I joined, we introduced concepts of microservices and breaking things down. We knew the kind of traffic we wanted to grow-- monolith was not the answer.
It has been starting to reach a point where we need to start breaking things down based on how complex certain parts of monolith are-- or how resource-consuming certain parts of monolith are-- and break those down. This allows us to grow those certain areas of the system independently.
About three or, I want to say, four years ago, we announced publicly that any net new development that we are going to do, or any major enhancements that we are going to do in our product are all going to be on microservices.
Now, why did we start our journey with microservices? We introduced caching, we introduced messaging. We started leveraging different tech stacks. One thing we started to struggle with was when we started to talk about introducing containers. When we wanted to introduce containers, we're like, wait a minute, we can't end up building every single technology stack within our data center, which is where we started putting a lot of emphasis on moving to cloud and doing everything cloud-first, which would give us a lot of these services managed out-of-the-box, which saves us a lot of time and helps tremendously with our level of productivity.
So we are not spending time on a separate team, spinning things up for us. So like five years ago, we were separate teams. Operations was a separate team. QA was a separate team. Engineering was a separate team. There was this traditional concept of hands-off, you know, you do something and then you hand off to another team.
When we started our journey with cloud, we were very clear. We wanted to dramatically increase our developer productivity, which we saw when we started moving to the cloud. So, five years fast-forward, we have started taking down a lot of monolithic to microservices. We are using a lot of-- almost all of managed services of AWS.
We are very high on security and compliance. A lot of our customers in different verticals-- call it banking, call it finance, call it European, call it what you may. Everything-- even healthcare-- we are serving every single vertical.
By being in every single vertical, security and compliance become a big thing; this is where cloud, I believe, is helping us also, because a lot of these services come with high security and compliance.
A lot of controls are reapplied, which saves us a lot of time when we are doing development or when we are choosing a new tech to be introduced in our stack. So that's when we are right now, breaking things down into microservices.
And so how many developers when you started this process? How many developers now?Approximately?
So when we started, when I joined Cornerstone, we were-- I want to say that globally, engineering alone was about 300-400 people. And then we went to an acquisition last year which the company over 500 on cloud, and now our global engineering footprint is about 1200 engineers.
Okay. Now, so you had separate teams-- QA, ops, and engineering. When you were starting to make this transition, how did you think about those different groups? How were they organized with just one QA group, one ops group, one engineering group? And then, how did that look with that number of engineers-- if that was the case, or if there was different groupings-- what was the good news?
What was the bad news about all that stuff?
Right. So let's talk about bullying Alexa, what bad news had I recently learned? We should always say that your first idea was, if an engineer wanted to spin something up or even POC, IT, brand new tech... they were always dependent on infrastructure team or a tech ops team to provide them the virtual machine-- which, because they have their own set of priorities,
it would pay them. I'm not kidding. It will take anywhere between two or three months to get a machine, a typical problem in a company-- and this was not something new. I thought where I came from-- I was with you at Sinema-- but now we had similar problem. So one way to solve this problem was either they automated everything for us.
Tech ops team, such that if I want AVM, I can spin up whenever I want, but I still did not have much control over the five of the V. How much memory I need, how much CPU I need, what will work, what will not work me, not having access to most of the environment, bad news, right. Developers do not like that.
They, if they're even an idea in their head, they want to try it out right away. So that was one part of the bad news. The good news on the flip side. So on the tech ops side, on the QFI the bad news. As you can only know, right? If you have work in the industry, you know, even though we say QA is responsible for testing, there is still a disconnect between what the developer is thinking while writing the code.
And what QA is thinking when doing the queue, right. Then we'll be very often, there are certain use cases best out because a developer knows in his or her head why that use case was written. But the QA may not know about that use case. Invariably, we may end up missing testing that use-case live a quality home in the product that is delivered to production.
But let's talk about the good news, right? The good news in all this. Like I said before in cornerstone security and compliance is very, very heavy, right? So from that perspective, good news was, and then engineer, I did not have to worry too much about securing or applying all the compliance controls on this on any tech stack that I'm doing, don't get me wrong.
I do not mean to say here that an engineer is not responsible to write the secure code, the code that you're writing must still be. But security and compliance is beyond beyond writing secure code. It's like who has access to the production environment? How long do they have it? How are we auditing?
How are we preventing our infrastructure from. Non-malicious employee or a user, right? So all that was taken care of by the technology operations team, including the documentation. The SOP is for that particular technology where the engineers did not have to work. All they had to do was write code.
So that was the good news. No it
happened. Go ahead, go ahead. Sorry, please go ahead. Oh, so an engineer writes code, they deploy it, the servers and it's magic for them. The tech ops team runs it. Right. But what happens in this whole model?
But the problem with this old model is anything goes wrong. An engineer, if the last person.
Where the flipped here is engineer's should be the first to note, if something went wrong with their code, because something goes wrong. There is a knock team. There is an FRA team. They are trying to understand something went wrong. Then they are trying to page like 20 different people trying to get hold of somebody who can help solve the problem.
Yes, we have on-call engineers, everything. Right? But you wake up an on-call engineer. That person looks at the code and say, oh wait, this is. Our module's problem. It is somebody else's problem that the other person say it is somebody else's problem. I will not say that happens as frequently, right? If there is not an SRE team has spent enough time looking or monitoring the system, they know exactly not exactly, but they know where the problem areas are.
Right. But still there will be a lot of times, even if we are not completely out of. You went in, did we have this problem on rich on-call engineer? I'm waking up, right. So the problem is, as an engineer, if you don't know what went into production, what was the configuration of that production box? What were certain parameters that were enabled?
Not enable all those things are not known to an engineer, which creates a big hole in my opinion. So I have been a very formed believer. You built it, you own it, right? You nobody better than the engineer knows how this code should function. And when I say how this port should function, I do not mean in terms of functionality alone, in terms of how performative.
Whether it is secured or not. Whether the static analysis wasn't around, whether dynamic analysis was run on, whether my code does proper encoding of the question parameters that goes through every single thing. If, as an engineer, I have my lens into production environment. I will be able to follow the problems much better compared to somebody else telling me what the problem.
Okay. So now you start out with this model. If you start out with separate teams for QA ops and engineering, and you want to move to this microservices strategy, you want to move this strategy of engineers, build it. They own it. Talk to me about the steps that you go through to do that,
right? That was an interesting journey. So when we started our journey, our first step was our first step was. To make sure that we set up, we pick a product when we decided that we want to be on the cloud. We said, okay, we want to see if cloud works for us. And when I say here works for us, does not mean that whether we will be able to deploy the power of energy, we'll be able to run in the cloud, whether an organization.
Where security compliance is so heavy where we have so many separate teams, how will we bring all of them on the same page when we are running in the cloud? How will we build that culture of ownership and accountability for go the culture to be built right from scratch? Right? So for us, what we did, we said, okay, we are going to pick a product, a real product, which we want to do.
And we are going to build that product from scratch in the cloud, using as many meds as possible and try to understand what the journey looks like before with the rest of the organization. So. We started building it. We started building in the cloud. We started using infrastructure as code. We started using we started using all the good practices.
If you want to say off building in the cloud, following the best DevOps practices, all of those we implemented, we started seeing the results, right. Something that we used to release to production. Every three months we started living code every. There was a bug. We could fix the boat that day itself without anybody knowing when we started showing this to our product team.
So mind you, it's not just engineering team, only product team. Also, when we started showing them the power of owning something that you built, or the power of building the DevOps way, building the theory pipeline, deploying that it was like an eye-opener for a lot of people in darkness. Which is where we started our journey to faith.
Okay, wait a minute. If this is what we want, then the full ownership should be with the team that is building the product, which includes the pipeline, building the DevOps aspect, the ownership of not , whatever, have you all those. At that point, that was our turning point where we marched three organizations into one engineering.
We merged our tech operations. We merged our QA and we merged our engineering. And then within engineering with split the engineering into different product groups, which we started calling as different domains. They cannot. Martin Fowler, where we started putting a lot of emphasis on domain driven design, domain driven thinking, and domain driven development merging that organization was our first step first foray into where we were headed in desert.
Okay. So I get that with a new property. You still have this old monolith, right? And you have the, you have this old monolith that you need to break apart. Talk to me about how you start thinking about that. And I'm interested in how you start thinking about the architecture components that are, how you start breaking that apart, but then how you also get unconstrained with, Hey, I've got this model that now I'm going to build this infrastructure and microservices on top of that.
How does that, how does that infrastructure build, not constrain the microservices? So if you could talk about those two things.
Right. So let's talk about first our microservices journey, how we started that when we started building first. So the first place with Microsoft, which is our biggest started building was we cannot have every single team build certain framework aspects.
Like we wanted to make sure all our libraries go through a circuit breaker pattern. That was the concept of registry service discovery. Right. There was cashing library available out of. Logging available out of the box instrumentation. If I'm writing a microservice, my microservice must be able to instrument they one as an engineer, I should not worry about those things.
So first we laid down our microservices infrastructure team who started building all these pieces together. Back in the day, five, four years ago, four and a half years ago, we started building what we call it, visual studio. So we wrote a V V S template, which would spin up a micro service, which will come bundled with all of these things together.
Like that is caching. That is that is service discovery, service registry circuit breaker libraries. We also build what we call Nez domain client, which was responsible for the load balancing which was responsible for fault tolerance. So the big all of these components. Started building the templates, such that any team that wants to write a microservice uses this template.
It spins up a microservice, and now you can write your code and still be responsible to merge the PR deploy all of this, which we further automated, which what we called as development kit. We also internally coordinate F. Dev kit is a function or a tool that we built it, which allowed engineers to go on the screen and say, okay, I want this particular domain or this product area.
That is my Bitbucket repo here. I want to create. And my service could be just current operations could be current operations with bulk operations could be firmness operating. So we built a bunch of templates or a current operation on a dynamo table, whereas this current operation on an Aurora table.
So we started building those templates. What that the power of that tool was as an engineer. When I went to this tool and selected the template I want to build, all I had to do was just punch in the name of my service and select few things from the dropdown ethernet I hit create. This tool created a Bitbucket lipo for you out of the box.
It took a tablet which gave you credit operations. Let's just say that operation out of the box with sample. Hello world implemented in. It took that code and it deployed that core all the way to development environment. It pushed all the port created the pull requests, merged it as auto and deployed all the way to develop an integration environment.
And it spit it out. The swagger you are in for you. So as an engineer for me, the service is ready to use in the fence. I can start coding and everything. Around that infrastructure is taken care of for me for free. I don't have to do anything. So once we started building that power and I tell you why we started building this tool, we started building this tool because we started to notice the product team.
Invariably, would question, wait a minute. You want me to build this enhancement in a microservice, but in monolith building the same thing, basically much less at a time in microservices, it takes much longer. So we started hitting that area first. How can we accelerate development in microservices? So that nobody's coming to me and telling me I want to write code in monolith just because it is faster right there.
So once you build a framework, now engineering aspect is easy, right? Everybody wanted to jump on writing microservices. Everybody wanted to write on AWS, then come game the aspect of. Now I still have one that has to be a product roadmap, which allows me to start carving things out from monolith into microservices.
So like I said before, right. We defined it to answer the second part of your first part of your question, doing all new enhancements, all new development. If on microservices for the existing model. We still have a monitor. I would not say on monitors for our model, that we started picking areas, which are performance wise, very, very painful, which takes up a lot of the view, which is barely monolithic.
We started breaking both pieces out into Microsoft. In fact, we did not take the full blown feature and pulled out in bloom. Within that big blown things. We started using small, small pieces into microservices. To give you an example, we have a lot of backend doing processing that happens when you aren't in the learning software, because when a training is assigned to all your people, imagine if the training is to be applied to a hundred thousand employees at the same time, there is a lot of doing processing and it is just one customer.
And we have 7,000. So we started taking things out of, let's say our MSMQ. I started putting both pieces in SQS and SNS. We started leveraging cloud technologies, started writing data stream to connect, which allows me to start breaking things down piece by piece, started start picking the advantage. Now that the fifth will only come.
If you can show to your product team that. Even though you may qualify this as a technical project, but there's a huge product at one page that goes to the customer. I have learned that unless you can show it customer value, it is very hard for product team, even for customers to understand why we are migrating to cloud or why we are modernizing for that shape.
If a customer comes for all product incomes and says, Hey, everything is working just fine. Why do you want one tomorrow? These victories are going to help. So we, we today, even today in our monolith, we focus one by one on areas, which are either very complicated and may not be complicated. Break those down into microservices, or we constantly keep identifying areas where performance is about.
Any, any outage, if ever it happens, we start looking into an opportunity. Okay. Is there an opportunity to break this down into microservices? So that's how, that's how we try to address. Okay.
So you've got this magic button. You can push that creates deployment platform. The platform team develops it for you, developers.
How many, how many people get you with your platform too?
So, you know, it, it is actually, it varies. It started our platform seems started, I believe with about 10. It grew up to at one point 15 ish, but we also realized we cannot, we cannot be in this model. We cannot have our platform team growing in finite.
You can't do that. For what we started doing, we introduced what we call F insourcing, where we started leveraging our product development teams to, to contribute to the components that were otherwise going to be owned by platform teams.
Right. And then this is what I want. I just want, like, this is what I want to talk about is you've got 10 or 15 people supporting 600 or 12.
The infrastructure team provides leverage, but they're also going to be a bottleneck. Right. And so let's talk, let's talk about a bit about that,
right? So as you, as you write, yeah. Right. Number one. Yes, they are bottleneck. Number two. They will have their own priorities that they want to work on. Like our competent, it requires regular maintenance to that.
Then there are different domains. Who have different needs each domain beam in their own right way. Lik the feature that they are asking from platform became either priority. So first we started to say, okay, first thing, first, we started building what we call the architecture community of practice, where we brought one architect from E one or two architects from each.
Into a weekly up, which we meet every day, since last five years, what we call it, architecture community of practice.
Let me, let me pause you just for a second. Right? So you've got this platform team, this architecture community of practice start simultaneously with the platform team doing this, or the platform team started initially building this magic push button.
So first platform team started building. So first we laid down platform. Yes, we, until certain areas are buried, we cannot unleash our platform to the entire organization. We partnered with one domain team and there was this platform team who started building all the components, components, already platform team, sorry, a domain team.
One domain team is partnered. Now we have, we have a car which we can drive now. Right. But the challenge. When you ask, now, there is a second product team to do the similar development, but the second product team needs more things out of platforms before they can start leveraging. The third product team will need even more things.
So as we started to unleash slowly, they started to become a bottleneck. Like, like you rightly pointed out platform, team, limited team, bunch of priorities. They have their own features that they want to do. And product in have their features that they want them to develop. So huge challenge, which is where we started leveraging or started for me, architecture, community of practice, where we said, okay, on a weekly basis, we are going to me besides talking about new architectures, do our tech stacks.
We want to bring in cornerstone. We are also going to spend time going through what if the roadmap of our platform team and what is the priority right now? And so.
Sorry. So the architecture community practice it's made up of architects from each, each team, and they essentially become the product owners for the roadmap for the platform
That is right. Okay. And in fact, we had architects also joining from platform beam because they have their own needs, right. They have to finish their own beef that they started. In that community of practice, we go through our roadmap to fee, which one is priority and why? So a product, a person can come and say, Hey, we have this customer with this ARR or bef bunch of customers asking for this.
We need platform team to build it. So they have to make a case why it is a product that allowed us to a degree. It allowed us to prioritize. But as you can only imagine.
So you have, so the minute we just pause you for a second, you've got this community of practice. You've got the platform team that's going to execute. Community practices is creating the priorities, do this, do this, do this. Now there's a conflict, which is good.
Mixed it. Demands is where three product teams at the same time are able to justify that their feature is private, but we do not have enough people.
Great. So now you're just constrained by the platform team, not having enough bandwidth to serve everybody.
And then what do you do? Then we go back and ask for more people, which we don't get.
Okay. So what we do, we come up with this genius idea and a team. What if we start inserting, what if we said, wait a minute. You are not less qualified to work. What product platform team is working on and you have your product need and you are dependent on platform team. What if we broke that dependency?
What if we allowed product team developers or engineers to contribute to the platform? Good. The approval of the code will still be with platform. But now they are writing their own. Say in the sense, if they need a feature, they can contribute to the platform, a component that form team approves, the pull request and Vola, they have their change.
Okay. Let me, let me pause you for a second. So platform platform built the MVP. People want new features. You have the community of practice. The community practice says here's the order. You're gonna do things. Platform team is a constraint on doing them. And you say, great, here's the, oh, here's all the code.
It's inner source. Anyone can contribute to it. Awesome to maintain quality. The platform team are going to do the pull requests are going to view the review, the pull request.
That's right. Okay, great. Awesome. Now we run into second challenge. Now we have so many product beans. Everybody is contributing to the.
Now platform team is still about all that because they are the per request of floors for a pull request, which should ideally be approved within a couple of hours. Sometimes could take a day or days because they are busy and the kind of change that had been made. They want to spend some extra time trying to understand why this change was done, or if a brand new component was written or a brand new extension was.
They want to FF that the challenges that you can run into with this new component or new feature.
So he was going move the we've moved the bottleneck from days to weeks to get a machine and the original model two days to get a pull request approved. So it's better, but we still have that bottleneck for actually it was first days to get the machine.
Then, then maybe weeks to get something to the platform came to do. Now the platform team, isn't doing all those now it's days to get the poll request approved. And so now how you solve that.
So now to fall that problem, we said we are going to outsource even the pull request approval for what we think.
Within each product team, we have qualified engineers, architects, principals in each product team. We are going to pick one or two based on how much contribution they have done based on how much they have shown the understanding of the code and understand what it means to change the centralized components.
Because remember, even if product 18, if making the change to the platform component, that same component may be used by product B product. So it is still a centralized component, right? So we identify people within product team who can be pull request approvers, which means if engineer number 20, fix, if making change, who works in product a, the architect of product a can approve the pull request because.
Platform champion call it that way. If you want to the pull request is approved locally within the product team. So now your entire dependency of delivery is contained within your own product.
So now. If I'm in practice and I'm a developer and the architect and my team approves, it is that approval is that pole is that thing that was, that was polled.
Is that available to everybody or do other products, architects want eyes on that before that it's a generally released item,
right? So it depends in certain cases, if let's say I wrote a template or I wrote a piece of code, which is only an only used by my. So in that case, in your Bitbucket repo, we do not add other architects as additional reviewers or approvers, but then there are certain pieces which everybody is using.
For instance our domain client that we wrote, which is responsible to register your service, discover your service make a call to your service implements a good breaker. If it's the heartbeat of our system. Anything goes wrong in that component can bring down the system in that case. We, so this is an extreme daytime problem.
In this case, there will be one PR approval from the platform team, one from the domain team and another one from another domain, such that there are multiple pair of eyes that have reviewed the code. And in some cases, There is a product architect from warranty, product architect for another team, and they are the PR approval.
So it really depends from components, complexity on how many PR approvers we want and who are those PR approvals
who gets to decide that who gets to decide, who gets to approve the. So Paul request gets approved by somebody who decides, who approves the pull request, whether it's a global thing. So multiple people need to do it, or whether it's local and only one needs to it.
How does that decision happen?
So, because platform team owns the report. At the end of the day, they are accountable. They are the ones deciding, but they don't want that component to be approved by somebody whom. The respite, a very strong word here, but if they do not trust somebody to be able to take care of certain components, because it is most critical duty they only pick a handful of people to allow to approve.
Okay. So now, so I have this this this community of the architects, right? And the architecture media practice. And I have the platform folks, how many architects or developers, approximately ratio.
And we don't have the ratio per developer. We have architects per product.
Okay. Okay. Well, here's the basis of my question is, is how did the architect state, or is it easy for them to stay attuned to the needs of the developers?
So when they sit in the architecture communities of practice, That they know what should be, you know, should be the next thing the platform team does or do they not, and if they don't then how do you think about
that? Excellent. The question that actually leads to another community affective that we run, which we call Edmond, develop our community.
So like architect, community of breakfast is the product definer for the platform team. We have a developer community of practice who defines the requirements to the product, to the architect. So what every other week we have this setup of developer community of practice, which is open for all the engineers globally now, because we are a global team.
We run a couple of practice community of practices, one for the local time zone for India and one for Europe and one for us and a New Zealand geography in this demo, the craft community, a developer community. Okay. Any developer can come and present. There are challenges that they are facing with these particular platform competence, right?
In some cases, they come and present an alternate way to solve that problem, because nobody has time. Even between all this. This is not me. The model that we come together for ourselves still, it does not mean. Delivering everything on time, we always have time. We are able to do right then it's still going to be product development priority, then all that always supersedes all these things.
So, but in this development committed our practice. An engineer comes and presents a problem statement with our existing development stack and how they circumvent the problem, or they are not able to follow that problem. And it's a huge developer productivity. So all the architects are attending that developer cob, which we call, and that drives that injects into our requirements, to the product team thing.
Hey, look, here is a problem that all the engineers are facing that you don't know as platform team. And we don't know, but at the surface level, when they are doing the development, Or because these enhancements are missing, it takes them forever to do something so that injects into our platform roadmap.
Or sometimes the product team says, wait a minute, because now we can actively contribute. We are going to keep this in our roadmap. We are going to make this fix and get the pull request approved from.
Got it. Okay. So let me, let me just start the stock again. We went from a tech ops group, creating VMs took forever.
We had infrastructure group, they became a bottleneck. We inter sourced the development. The, the infrastructure group would do the approve of the PRS. That became a bottleneck. We had the, we had the architecture community of practice. That would, sorry. We have dark tissue, many parts to decide what the infrastructure group would do.
The infrastructure group became a bottleneck. We inner sourced it. We allowed people that we appointed to approve the poll request. Now we have a problem with the mass of things. The architectures architects know wasn't enough. And so we allowed the developers to request directly from the platform group if they couldn't do ourselves.
Great. So now we have an alarm. To developer needs. That's great. So that's one of our communities that we need to take care of. We've got this great security problem. You guys do multiple countries, different laws, different securities. There are other constituencies that need to make sure quality goes through.
It's not just quality of code. What does that look like? What has that affect, what gets built and
how you think. Right. It can be as pretty, as you can think, given in the world, we live these days where every country is looking for their own data. For reality, it's a big pain especially companies like us, which is in the human capital management space where we may have PII.
Sometimes we may have SPI data, right? So data protection is our number one priority. Within our organization. We spend a lot of. Energy effort in making sure we are trying to build a secure and as compliant product as we can, which means a lot of times I spend a lot of time talking to our European customers or attending customers or customers trying to understand what problems are they trying to solve.
Understanding that problem, talking to customers. Allow us to know, okay, what should our security and compliance roadmap look like based on what that roadmap is? Those requirements are injected either into the product roadmap or in the platform roadmap, but with security and compliance needs also because it's a smaller team.
We run into the same challenge as we run into with. Right where they become the bottleneck. They become the pull request, approvers. They become the bottleneck. So we leverage the same model for our security and compliance team. So if comes and says, Hey, we want these areas to be built into our code base.
For example, let me take an example of RDM, right? Any changes that we make, I am if identity and access management in KW. So any changes to the IAM policy could mean somebody either getting access, somebody, either not getting it, or they accepted that. I am policy, everything that I, a policy that we deploy in our cloud infrastructure today is reviewed by our security.
We want to be very, very sure with. Both access is going to be revoked or any new policy return. If this policy going to have, and a challenging impact on the changes we are doing, for example, very simple policy making an S3 bucket public, right? So those policy, we have our security and compliance team be the approver of IAM policies.
And now we are back to school. Security and compliance thing becomes the bottleneck because they are very limited in number. I am policy are happening. I'm not saying we are changing policies constantly, but we are deploying constantly. And our policies are X service level. Each service, each Lambda function gets its own IAM policy.
So lots of reviews, very few people. So we spread the fame development model with the security. We identified architects or engineers, whom we actually call security champions. Like I'm wearing my own, it effect I'm a security champion in my organization. So I become an authorized approver for the full request for IAM policy or reviewing any architecture from the lens of security and company.
From the excess disaster recovery backups, monitoring, alerting all those GDPR, all those functions are built and then approved within the product team so that they are not dependent on anybody else. So one. Go ahead. Go ahead. Because it's going to shift.
Yeah. Do you have formal policies for proving someone on a poll request, reviewer team or security reviewer team.
And do you have formal policies if someone's in the team not performing well, how to pull them off those teams?
Yes. Yes. So first of all, we are very, very active on hook and difficulty champions, no offense to anybody, everybody understand security quite well, I think, but some understand better than the others.
So we are very picky on who becomes difficult to champion. Once you articulate a champion so far, we have not seen anybody being revoked, being ethical to champion, but not to say that part has not crossed over night, but what we have. Every time we want to make somebody figure it champion. We don't just say, okay, from tomorrow you are a sitter.
We we have them approve certain IAM policies over a period of time while the secondary approver, if from security, once they have spent enough time, then security. Sometimes come through the recommendation that, Hey, look, I know this person, a who has been approving IAM policy changes. And last one month, two months, three months, I have not failed if single challenge coming in or fingers, policy approved, which was not in reasonable limits or this person has always shown, always identified issues in the policies for the team to go back and fix.
Even if it means his own beam or her own team is going to be delayed with the development, they are giving priority to the security. Some of those parameters are assessed. And then that person has medical exam. So far, we have not found a case where we had to be somebody, but if we were to revoke, these will be the same parameters that you are.
Look, you are failing to identify the areas or the security incident. And now we'll have to assess why it happened. What did a Mr. Smith, Mr. Smith or something new came up was not identified. So there are parameters, but so far I would say, knock on wood. We did not have to revoke anybody's security championship.
I had a question. I go, did you, did you have to, if you want to add from the previous thing?
Yes. So what I wanted to add was that. I spoke about, I am what it is, right? Like many, I am policies. Every service having its own IAM policy can be a lot of you are making a lot of people bottleneck to approve your IAM policies.
Right? So what we started to do was we started classifying IAM policies into. Category, which is an infrastructure level. I am policy and the category, which is product micro service level policy. What we started doing for the Bitbucket repos, which were patterns around product development related bucket report.
If there is an IAM policy inside that the automated approval of IAM policies by injecting our own patterns of identifying what could be an impact for a certain change to happen. And for instance, if I work in product a and product, a heavy service B and that's. XFS let's say a dynamo table audit SQS queue audit on an SNS topic.
Your access to that service must only be restricted to this dynamo DB table. This SQS queue, this FNS topic. Once you start looking into IAM policies, you realize that is a pattern. And once we started to identify that pattern within wait a minute, we don't want everything that I am policy to go to everything that you want.
We are going to automate. So what we started doing, we started, we invested some good amount of time in automating our IAM policy approval. And then once we were ready, we did it two fold. We said, okay, a human is going to approve it. Now let me learn. If this was automated, whether the automation will approve it or deny it, or if it will deny it, what are the basis of.
So over a period of time, we made that algorithm better and better and better. And now a lot of our IAM policies are actually approved as part of our pull requests approval process, et cetera, because the
infrastructure team, the automation was turned by the infrastructure team, or
no, no that the automation was done by security engineering.
You have AppSec team who are responsible for the automation. And then on top of it, So while they approved all of this, it was very clear that look being the human that will, you will always, you may miss out something somewhere, right? You can possibly not catch everything possible. Denial. It's not humanly possible.
Right? So we started writing what we call as the cloud bots. It started a security bots and eventually became cloud bots. These are Lambda from. That fit in an outside of our main AWS account into our security and operations account. This Lambda functions are running across our entire account ecosystem, looking for anomalies for every possible resource that was modified.
If the Lambda is on modification, it triggers the function. And this Lambda is the stump for corner store requirements. Which is married to Cornerstone's compliance policies and attributes. And we look for such anomalies and we have automated the process to a point where if let's say I'm making up a hypothetical use case, right.
Somebody created an S3 bucket, a brand new S3 bucket and forgot. Mark, that bucket is not public. It's not possible in our ecosystem because we don't let you deploy an S3 bucket with public access policy. Right. So let's say somebody did that. Somebody manually went ahead and did that not knowing what they were doing.
They got. P 10 box environment, but it was production whatever reason, right? So this cloud bot is watching for that policy. As soon as it identifies, it actually creates a JIRA ticket for that respective team based on the tags. It finds which team with bucket belongs to create the JIRA ticket automatically assigns to them.
And actually in this case, because it's an S3 bucket and mark public, it will go ahead and mark that policy back to. And then inform of JIRA ticket find the ticket to the respective team to say, Hey, look, we found this, go and fix your bucket. Policy creation cloud formation template, such that it is never public when you modify it next time.
If it happened accidentally, or if somebody manually modify it, we need to know why it happened. And there is we do the root cause analysis to try to understand what happened, but that's a framework idea. Okay.
well, thank you. That was, this is, this was a very interesting about the way you take large number of developers and give them empowered empowerment, and unconstrained them from creating infrastructure and then building teams that can build the infrastructure yet maintain the quality of it.
It's appreciate your time. And for those of you that don't know, it's a, it's a Tinder and I have known each other for a long time. If you haven't caught it already, amazingly brilliant guy. And so thank you.
Okay. Thank you, David, for giving me the opportunity to share our experiences happy to if anybody wants or anybody has any other questions, feel free to reach out to me on LinkedIn or through David.
Happy to share more information. Great. Thank you. Have a good one.