Cloud and infrastructure organizations like AWS, Azure, and GCP offer various opportunities for Technical Program Managers (TPMs) across multiple services—Infrastructure-as-a-Service (IaaS), Platform-as-a-Service (PaaS), and Software-as-a-Service (SaaS). While IaaS roles in compute, storage, and networking are prominent, TPMs can explore PaaS and SaaS services, such as AWS’s fully managed RDS or its SaaS offerings like Desktop as a Service.

TPMs are crucial in areas like data center operations, capacity planning, and product development, ensuring efficiency and driving service growth. Whether supporting internal platform teams or specializing in areas like security and machine learning, TPMs play a key role in launching new services and maintaining essential infrastructure. In many organizations, specialized TPM teams—like those in security, analytics, or machine learning—work across different stages of development. These teams manage the complexity of launching new services or maintaining essential infrastructure.

Strategic Cloud Platform Providers

Image from – Gartner 2024 “Strategic Cloud Platform Providers”

The above image is Gartner’s 2024 “Strategic Cloud Platform Providers. It shows how each of the cloud hyperscaler is positioned in terms of capabilities and evaluates and ranks major cloud service providers based on their completeness of vision and ability to execute.

At most cloud organizations, though IaaS is the center, they also offers several PaaS and SaaS products. 

See also: Cloud Computing for Managers &; (TPMs, PMs, SDMs)

IaaS Services

Let’s deep dive into the IaaS side of the business and see how we can break it down. Taking AWS as an example again, Compute, Storage, and Networking are the three main pillars of IaaS. 

There are also additional product development teams like Security, Identity & Compliance, Developer Tools, Management and Governance (CloudFormation), Analytics (Elastic Map Reduce), Machine Learning, IoT, Application Integration (SQS/SNS), Media Services, Game Development (GameLift). AWS has over 200+ products, and many TPMs work with these teams. 

product development teams

Think of these teams as any other team at a product-based organization. Depending on the development team’s size, product maturity, and other factors, you will have a few hundred developers to several thousands of developers and a corresponding ratio of TPMs (The ratio is approximately 30 engineers to 1 TPM). Internal platform teams are also responsible for metering and billing, automation and orchestration, Security and Compliance, etc.   

IaaS Support Teams 

Outside of all IaaS products and service teams, there are a few more teams that have significantly larger engineers to TPM ratio. Some of these teams are significantly TPM heavy, having up to 70% of their total workgroup be TPMs. Here are some examples below:- 

Data Center Operations

The Data Center Operations TPMs are responsible for operational excellence and maintenance of Data Centers. They own service uptime, efficiency, and security of the physical facilities. Their overall goal is to reduce the number of incidents. 

This group is supported by a large number of TPMs who own all the metrics, communication plans, runbooks for escalation, and interactions with hundreds of product teams to ensure they are operating within the set guidelines. They also own collecting the Root Cause Analysis (RCA) across multiple organizations, hold the teams accountable for their corrective active action, and hold weekly, Monthly, and quarterly metric review meetings. The TPMs here also work on setting standard operating procedures for improving inter-team communication.

The DevOps engineers on this team manage the day-to-day operations of data centers worldwide. AWS and most other huperscalers have a SVP who own service availability and reliability across all cloud products. 

Metering & Billing

The team is responsible for ensuring accurate and reliable measurement of resource usage and generates corresponding billing information for customers. 

TPMs engage with teams across the organization to ensure new offerings and new products have the right metering and billing parameters.

Capacity Planning 

The team is responsible for managing the allocation of hardware resources to new and existing DCs to ensure optimal performance, scalability, and cost while keeping in mind the forecasting and planning to meet future customer demands.

This team consists of data analytics engineers, financial planners, and TPMs. There are a large number of TPMs working across multiple product/service teams and the data center build-out team to ensure every service has enough hardware to support customer demand while at the same time ensuring that there is no excess or unused capacity. 

The TPMs here also work with the vendor procurement, customs and shipment teams across several hardware suppliers to ensure that every DC is equipped to meet customer demands. 

DC Build Out TPMs

The team is responsible for physically building and hydrating new Data Centers. Depending on how efficient and automated the process is, a DC build-out could take anywhere from 2 months to 4 months. Generally, the DC build-out team simultaneously builds out several DCs at a time. 

This is a TPM-heavy organization. Setting up a new DC is immense work with many risks and high visibility, and it is considerably labor-intensive and time-consuming. Generally, cloud providers are continuously expanding the number of DCs they have worldwide. For example, AWS has over 100+ DCs and continuously seeks to expand its footprint. 

After a DC build-out site is selected, there are several steps, such as an initial infrastructure setup involving power, cooling, and physical security. After this, there is the physical build out of the server cages, server installation, cabling, testing, and DC hydration. The Hydration process alone could itself take up to a month or two. This would involve several dozen TPMs working across all the service teams to layer their services onto the DC.  

Special Projects / New Services & Products

The team is responsible for getting a new service to launch. A new service needs to be compliant and integrated into the existing cloud platform and ecosystem. 

Every year at re-invent AWS announces a slew of new products and services. Once a vision and a new product are built, they must go through several requirements checklists before they are customer-ready. Some TPMs specialize in this area as one would need to work across multiple teams like DC build, Governance, Risk and Compliance teams, Metering and billing teams, etc to get a new service customer ready. 

Marketplace 

The team is responsible for working with external partners, vendors and re-sellers to sell their software and services to AWS customers. 

Non-Commercial Infrastructure Organizations

While IaaS organizations like AWS, GCP, Azure, and OCI sell their services commercially for anyone to use, there are several other non-commercial infrastructure organizations that TPMs need to be aware of. These Organizations own and operate Data Centers for their own use. The reason for having their own data centers is so that they can optimize the DCs depending on their use cases. 

Meta, Apple, and Dropbox are three such organizations. They have a significantly large number of engineers, PM, and TPMs who work on the organization’s internal infrastructure platform and software teams used to run these large organizations. Meta for example, has a 1.5 Gigawatt cloud and is planning to make a 30+ Billion Dollar investment into its infrastructure. Meta is said to have over 8k engineers in the infrastructure organization, and several hundred TPMs support the organization. 

Internal Platform Infrastructure Teams

As companies started using IaaS providers, they needed a way to standardize, monitor, and control how their developers provisioned services from IaaS providers. Long before these types of teams existed, you had a team of SysAdmins who managed the provisioning of infrastructure resources for an organization’s private DCs. Allowing unrestricted access to an IaaS provider often causes infrastructure costs to skyrocket due to over-provisioning, unused resources, and a lack of standardization across the organization.

But now we have Internal Platform Infrastructure teams. They own and build internal tools that engineers across the organization use to provision and build their services. These teams ensure that the entire organization follow standards to maximize infrastructure use. They also balance between being in control vs giving the engineers full control to ensure the speed of development is not affected. 

The Infrastructure Platform team also owns infrastructure reliability, Disaster Recovery, Hybrid Cloud deployments, auto-scaling, capacity planning and reduction etc. 

These types of “Internal Platform Infrastructure Teams” are fairly new and are likely going to be in most organizations that consume large public IaaS services.

Additional Resources

Links to outside resources –

Structuring a Cloud Infrastructure Organization

Start your learning Journey – 

AWS Certified Cloud Practitioner 

Certified Solutions Architect Associate

Conclusion

I hope that gives you a good understanding of the various TPM roles within a global organization. For those looking to dive into the fast-evolving world of cloud and infrastructure, a TPM role offers not only technical challenges but the opportunity to make a lasting impact on the future of technology.

Ready for your next career adventure?

Get personalized advice from Mario to confidently choose the roles, companies, and skills that shape your future!