• Ran Isenberg

Cloud Platform Engineering - Supercharge Your Development



This blog discusses how creating a cloud platform engineering ('CPE') team in your organization will accelerate SaaS development, reduce organizational waste and improve knowledge sharing.

As an architect in the cloud platform engineering group at CyberArk, in this blog, I'll share our journey over the last three years, discuss what we do, how we reduce organizational waste (and turn it around), the challenges we face, and how we solved them.

 

Developing A SaaS Application - The Initial Problems

Building a SaaS application is not an easy task.

When organizations build multiple SaaS offerings, several teams handle the research and development. The teams face similar challenges and questions regarding cloud infrastructure capabilities.

How do you deploy to the cloud?

How do you handle logging and observability?

what are the security best practices?

How do you maintain tenant isolation? and many more.


Each SaaS application requires the same cloud infrastructure capabilities that make it production-ready. These capabilities are unrelated to a specific business domain.


Different teams often develop these cloud infrastructure capabilities concurrently as they continue their journey to a production-ready SaaS application.

This development may lead to multiple solutions, perhaps even a different technological stack, all within the same organization, resulting in gross organizational waste.


What if I told you, you could reduce this waste and turn it into a catalysator of innovation, organizational knowledge sharing, and development acceleration?

 

What Is Cloud Platform Engineering (CPE)

First, in a nutshell, the CPE groups' purpose is to reduce the cognitive load from other teams and developers in the organization, become a source of knowledge and help them focus on their business logic. It provides the other teams in the organization with the required services, SDKs, knowledge, and technological tech stack to start a new SaaS application from scratch with relative ease.

Second, it's the most exciting place to be in a company since you deal with the latest technologies, face many challenges, and get rewarded with vast organizational influence.


Third, it's not a DevOps group. I'd argue that in the SaaS/Serverless worlds and Infrastructure as code deployment capabilities, all developers are also DevOp developers who need to define their AWS Lambda functions, roles, and databases as part of their regular work. In addition, as you will see below, the CPE builds Serverless applications; some are external customer-facing and require Software Engineers like most standard SaaS applications.


CPEs' Responsibilities

The CPE group has both internal (in-house developers from other teams) and and "regular" external company customers.

The responsibilities include:

  1. Maintain production environments of external customer-oriented SaaS applications that may also serve internal teams (data lakes, shared user interface, customer onboarding, etc.).

  2. Maintain production environments of internal SaaS applications that provide value to other teams. Services such as authentication, authorization, observability, and more.

  3. Create SDK libraries consumed by all internal services - tenant isolation, logging, feature flags, etc.

  4. Provide Infrastructure as Code reusable templates (CDK constructs etc.) that allow internal teams to deploy cloud resources configured with the best security practices.

  5. Create cloud native best practices guides, SaaS training plans, consultation and guidance for new teams just starting their SaaS journey.


As you can infer from the list above, these responsibilities provide solutions to problems that every production-ready SaaS application requires.

These solutions have ONE owner, the CPE group.

Other teams can focus on their business logic and accelerate their road to the SaaS application GA.

 

How Did It Start For Us


CyberArk is a global leader in Identity security and privileged access management.

It was founded in 1999 and employs over 2300 employees across the globe at this time of writing.

The company's line of products has long been an on-premise solution.

There has been a global shift to SaaS-based products in the last years, and CyberArk is no stranger to this shift.

Several teams have started to build SaaS products and learn on the go.

They delivered excellent products. However, since there was no CPE group at the time, they built multiple solutions and infrastructures for the same modern SaaS application problems.

These solutions caused an increment of organizational waste and multiple tech stacks.

Some teams researched tenant isolation or developed a feature flags mechanism for AWS Lambda function.

All in all, it's a waste, the teams should have shared knowledge better, and one owner should have owned these non-business logic cloud infrastructures.


Entering Cloud Platform Engineering into the arena.

So the CPE group was created, and I joined as soon as I could.

 

Initial Goals

The idea was to find the best solutions as the entire company would depend on us.

Our technological stack consisted of Python based AWS Lambda functions and AWS Serverless technologies.

In addition, from a relatively early stage of the group, CyberArk decided that some group members would develop a new business SaaS application and consume the CPE services from the get-go.

This approach will help the CPE group understand the real needs and pains of a SaaS application and help persuade other teams in the company that there's value in CPE's work and they should integrate with it ASAP.

 

Why Is CPE Good For Us (And You!)


Today, when a team starts a new Serverless Saas application in CyberArk, they face numerous challenges:

  1. A new programming language (they usually come from C++) - Python

  2. Cloud & AWS Serverless development is unfamiliar territory

Instead of learning it all by themselves from scratch, they use CPE's SaaS workshop training and its guidance all along the way.

The workshop covers AWS fundamentals, the CPE's SDK and services usage, best practices for AWS Serverless, and Python tips & tricks.

Near the end of the workshop, the team creates a new GitHub repository for its new service from CPE's Serverless Application template and learns how to write tests, deploy new code to AWS, and how to debug in the cloud. Once completed, they should have all the tools required to start working in the cloud with the company's tech stack and CPEs' tools.


Of course, when questions or dilemmas arise, the team can consult CPE members.

The CPE members must provide help and act as advocates for their utilities and methods. Otherwise, nobody would use them.

Two Clicks For a New Serverless Application

So what is this magical CPE Serverless template?

It's a GitHub template project that you can use to create your Serverless application. However, it's not a standard template: you get a project with an Infrastructure as code (CDK v2) that deploys an AWS API Gateway and several AWS Lambda functions that form a simple CRUD REST API together.

In addition, the AWS Lambda handlers use all the best practices and CPE SDKs such as logging, observability, input validation, and more.


In case you missed it, I've created an open-source version of this project.

You can find it here.

You can find a blog describing it here and the official documentation here.


This project gives the new team a massive jumpstart in the AWS Serverless world and ensures they get the basics right. From there, they can focus on their business requirements and develop their application according to CPE guidelines.


 

Sound Great On Paper, But Does It Work?

Yes!

One team consisted of non-Python developers who had undergone this training and used the template was able to deliver a product in a blazing fast manner.

In just four months, they were able to create a new Serverless application and reach the design partners stages, meaning real customers evaluated their service in production.

In an enterprise as large as CyberArk, this was unheard of!


 

The CPE Challenges

CPE's customers are internal, CyberArk developers and external, CyberArk paying customers. These two entities bring an extra layer of complexity to the CPE team and my job as the CPE System Architect.

In addition, the CPE does not offer just one SaaS service. Its service portfolio is complex and unique as it contains several SaaS services, internal SDKs, and Infrastructure as code-shared resources.


The CPE team faces the following challenges:

  1. Build and maintain production-level services. Handle customer cases, bugs, and production errors and improve SaaS resilience. Just like any regular SaaS application team.

  2. Earn trust with other teams in the organization as a trusted advisor and a source of knowledge. If nobody uses your tools, it's as if you never existed. Provide value or disappear.

  3. Help new teams to get to production and integrate with CPE's services.

  4. Provide excellent developer experience and high service orientation mentality for both internal and external customers. However, the CPE team contacts internal customers, i.e., CyberArk developers, more often than external customers. It is required to help as soon as possible (keep a 'fire fighter' developer ready for critical issues) and be open to feedback or criticism. The CPE's job is to know how to keep its internal customers happy. Happy customers use its products, and unhappy customers branch out and develop competing solutions, contributing to organizational waste.

  5. Avoid becoming a development bottleneck. Since all teams use the same internal CPE SDKs and services, many feature requests and bugs can set back teams until CPE handles them.

  6. Share information and knowledge across the organization in an efficient way. If you develop a new feature, and nobody is using it because they are unaware of it, it has no value to the organization.


How We Overcome These Challenges


CPE challenges demand an internal change

Internal Customer Feedback


Developers don't like feeling left out or that opinions are not considered.

Feedback is key. Nobody likes to be handed a bunch of solutions and SDKs and be told that's how we work; deal with it. The communication here needs to be shared, making people's opinions heard and considered when designing a new SDK or service.


In the CPE team, we gather requirements from all parties. We publish a feature design document and ask for feedback. We make sure it answers everybody's needs.

We don't solve our issues or a specific internal customer issue; we produce solutions that the entire organization uses.

The solutions and discussion are written and documented in Confluence for future reference. Documentation can assist in convincing new teams in the future that the design is solid and that CPE took all alternatives into account.

Gathering feedback in the design stage is one thing, but it's even more critical to gather feedback once the feature is released. I gather feedback by asking my colleague architects for feedback or sending a questionnaire internally.

You need to understand whether the customers are using your solutions, why they are not using some of them and the how satisfied are they.



Open Source Mentality

We started to treat our repositories and the code as internal open-source code.

It's apparent is several means:

  1. We encourage developers from the entire organization to donate code, open pull requests, and fix bugs that the CPE cannot get to quickly enough. It requires a new organizational culture as managers from other teams need to provide time for their developers to donate code to the CPE. This is done in the belief that they help the entire company and advance not just their service but the organization as a whole. Once the code is merged, it belongs to the CPE, and it is up to the CPE team to maintain it. Therefore, these donations are coordinated with the CPE architects and team leaders.

  2. Improve visibility of changes - each repository now uses GitHub releases to publish new features and bug fixes. Each new release is broadcasted on the team's channels made for CPE updates.


GitHub release example
GitHub release example

Improve Documentation

You can have the best utilities and services, but nobody will use them unless they are easy to integrate with and are well documented. We used to have long readme-files in our GitHub repositories or scattered guide pages in Confluence, and it was hard for services to find information. We got that feedback from the teams and made a change.


We have created an internal CPE documentation hub where you can find links to every SDK and service documentation website. Each SDK/service has its GitHub pages based documentation website. If you don't know what GitHub pages documentation is, check out my Serverless template documentation here. It's straightforward to navigate, has an internal search engine, and should offer a better experience than a long readme file.


It's important to note that these GitHub pages don't replace other organizational tools for sharing knowledge (like Confluence) but come in addition. Each repository should have a website explaining what it does, how to use it with extensive code examples, and offer links to Confluence design pages.

CPE Advocates

Improving developer experience is part of my daily responsibilities as a System Architect in the CPE group. Documentation is excellent, and GitHub releases are fantastic too, but in the end, people try to use the code and might encounter difficulties. In such cases, it's essential to have dedicated CPE members that help other teams to integrate the CPE services and SDKs.


 

How Do You Decide That Something Belongs To CPE

If it's a feature that every business SaaS application requires, it belongs to CPE.

In this category, you can find logging, observability, tenant isolation, etc.

If there's a shared-use case between several products/teams, it might be the case for a shared library/service. However, sometimes, the decision is more complex.

For example, several services require a mechanism to deploy resources to customer AWS accounts. Each service can develop and maintain such a capability, leading to an organizational waste as all services create the same technical solution.

A better solution would be to let the CPE build and maintain a capability that fits all the services' requirements.

To remove CPE as a bottleneck, in some cases, other teams that might have a tighter schedule will start to develop the platform service (with guidance from its architects). Once completed, the service is moved to CPE full responsibility.


 

Where Are We Now

Every new SaaS application starts its journey with CPE training, SDK, documentation, and guidance.

The CPE group has received the organization 'mark of approval and quality.

We continue to work hard to keep and maintain the trust the organization has given us.

We keep on improving our documentation and reducing the integration effort.

We keep learning new technologies and AWS services and update our recommendations accordingly.


We receive code donations and PRs on a weekly basis. Feature requests and discussions usually develop into a code donation. The donators know that other teams will benefit from their work, and the best part is they don't need to maintain the code once it's merged.


 

When Should You Start A CPE Group

This section is strictly my own opinion.

The rule of thumb here is that the larger the company is, the sooner it requires a CPE group as the organizational waste gets larger and larger.

  • Small organization/startup - Where everybody does everything, there might be just one product and less waste. However, you can start a dedicated squad ad-hoc that provides a boost on required infrastructure and SDKs, which are later dismantled.

  • Medium/Large company that wishes to start its first SaaS application - create a CPE.

  • Already in the cloud, the company has started to develop multiple SaaS applications by numerous teams - to create a CPE.


 


Suppose you decided to start a CPE group organization; great!

Heed these final takeaways as they are to be the most important:

  1. As a critical player in the organization, it's crucial to keep the highest quality, whether it's the documentation, research, and implementation. If the CPE breaks, all the SaaS applications will break too.

  2. Feedback, feedback, and feedback again. Make everybody feel like a valued partner of the CPE. Get the design right or nobody will use it.

  3. The cloud is constantly changing. Be on the lookup for new AWS services, improvements, and new SDKs. You can lead a massive change in your company.


Good luck on your CPE endeavors!

For more questions, feel free to use the contact me form.





Join my mailing list, and never miss a blog post