Key hands-on experience across core services:
Compute: Deployed and managed controllers in an HA configuration using open-source clustering and load-balancing tools. Configured compute nodes, managed hypervisor resources, and utilized the Placement API for tracking resource inventories (vCPU, memory, disk) to support efficient scheduling.
Block Storage: Modified and configured volume drivers to integrate with various storage backends (NFS/iSCSI). Configured iSCSI initiator/target settings for block device access from compute nodes.
Networking: Extensively configured networking for provider and tenant networks, virtual routers, floating IPs, security groups, and external gateway setups. Worked with ML2 plugins and various SDN backends.
Image Service: Managed VM image lifecycles using QCOW2 and ISO formats, configured backends, and maintained image versioning for deployments.
Bare Metal Provisioning: Used bare metal services extensively for node re-provisioning and reconfiguration, including PXE/iPXE boot configurations, node enrollment, and hardware driver management.
Object Storage: Configured and managed object storage backends, operationally managing a 24-petabyte replicated datacenter, handling ring management, replication policies, and capacity planning.
End-to-End Instance Lifecycle: Managed the full flow involving authentication, scheduling, resource tracking, network allocation, image retrieval, volume attachment, and bare metal paths.
Additionally, delivered large-scale training sessions to engineering and customer communities covering architecture, deployment, and operations.
What specific Compute components have you worked with and what role did each play?
Worked with all core compute components across multiple releases. The API service was deployed behind load balancers for high availability. The scheduler was tuned with filters and weighers for efficient workload distribution across large node counts (e.g., 30-node deployments supporting 600 VMs). The compute service managed the full VM lifecycle via hypervisors on various hardware sets, validating QoS per VM and compression features. The conductor service was used to isolate database access from compute nodes, and the Placement API was central to tracking resource inventories for accurately co-hosting containerized workloads and VMs across shared multi-tenant infrastructure.
How have you used Networking to define network architecture?
Defined network architecture for multi-tenant environments across various deployments. For provider networks, configured direct VM connectivity to physical network infrastructure for specialized applications and to expose storage traffic (iSCSI/NFS) directly onto the physical data network to support high IOPS requirements. For tenant networks, validated multi-tenancy by creating multiple isolated projects, each with their own networks, L3 routers, and floating IPs. For routing and connectivity, configured L3 routers, VPNaaS for secure cross-datacenter connectivity, and LBaaS for load balancing. Implemented high-availability networking architecture using dedicated servers for network services.
How have you configured or enforced security at the network level using Security Groups?
Security group configuration was a consistent practice to isolate workloads and control traffic at the virtual port level.
Multi-tenant Isolation: Used security groups as the primary tool for enforcing isolation between projects, configuring inbound and outbound rules to ensure workloads had no lateral visibility into other networks.
Mixed Workloads: Enforced strict traffic boundaries between different storage protocols and application workloads in environments co-hosting VMs and containers on the same large-scale platform.
Infrastructure Protection: Configured groups to protect overcloud traffic, specifically isolating application pods from the underlying infrastructure networks.
Compliance: In carrier-grade environments, security group enforcement was essential for isolating proxy servers, compute nodes, and load balancers from tenant-facing networks to meet strict security compliance.
Pipeline Integration: Deployed DevSecOps pipelines integrating vulnerability scans and penetration testing to complement infrastructure-level security.
What types of Image formats have you worked with, and how do you choose between them?
Worked primarily with QCOW2 and ISO formats.
QCOW2: The preferred format for most deployments due to copy-on-write capability, thin provisioning, and snapshot support, which aligned well with disaster recovery workflows.
RAW: Preferred when maximum I/O performance was needed with minimal overhead, particularly when working with backends that natively handle RAW images more efficiently to avoid double-layering.
ISO: Used primarily for initial OS provisioning and bare metal related workflows where a bootable image was required before a full cloud image could be deployed.
Can you compare when you've used block storage volumes versus ephemeral storage?
The choice was driven by whether the workload required data persistence beyond the instance lifecycle.
Ephemeral Storage: Used for stateless, disposable workloads like stress testing, validation topologies, and CI/CD pipelines where instances were repeatedly torn down. It kept operations simple and fast since data survival wasn't required.
Persistent Volumes: The choice for any workload requiring data persistence, performance guarantees, or backend storage integration. Configured with NFS and iSCSI drivers against various backends for production SaaS applications where data needed to survive restarts. Validated QoS per VM extensively to enforce performance guarantees for mixed workloads co-hosting VMs and containers on the same platform.
Have you deployed or supported workloads using Bare Metal services? What led to a bare metal over VM choice?
Worked extensively with bare metal provisioning, where it was fundamental to the deployment process itself—enrolling and managing physical nodes via PXE/iPXE boot.
Performance & Density: Bare metal was chosen for workloads requiring maximum compute density, direct hardware access, or predictable low latency that virtualization overhead could not guarantee.
High-Throughput Storage: Direct access to NVMe, SSD, and SAN storage without a hypervisor layer was critical for meeting high IOPS benchmarks. Similarly, distributed storage nodes were deployed on bare metal to avoid the latency sensitivity of replication operations in virtualized environments.
Network Intensity: Workloads requiring SR-IOV and direct NIC access for high-throughput data plane operations made bare metal the only viable option.
Management: Used for re-provisioning existing nodes with different roles or OS images without manual intervention, using driver integrations for out-of-band management across various hardware vendors.
Describe your experience with scripting at scale. What kind of problems did you solve?
Scripting has been a consistent thread, using various languages to automate infrastructure, test frameworks, and deployment pipelines for hundreds of servers.
Python: Built REST API test frameworks for validating cloud components and object storage, wrote scripts for profiling API performance at scale, and developed automation for large concurrent network topologies. Also automated UI test frameworks and integrated with cloud-native APIs.
Bash/Shell: Primary tools for day-to-day operational automation, including node enrollment, PXE boot sequences, managing storage configurations (iSCSI/NFS), rotating SSL certificates, and writing monitoring scripts for globally distributed datacenters.
Orchestration (Ansible/Saltstack): Transitioned deployments from manual processes to fully repeatable automated pipelines and built continuous delivery models for virtual infrastructure.
Impact: Solved problems ranging from eliminating manual errors and reducing provisioning time from days to hours to enforcing consistent configurations and building self-healing test infrastructure within CI/CD pipelines.
Have you used Jenkins or similar tools to run pipelines?
Yes.
Tell me about your familiarity with PXE booting. Which services were involved and how did you troubleshoot?
PXE booting was foundational to bare metal work, where the entire provisioning chain started with PXE using out-of-band management.
Services: Involved DHCP for IP assignment and bootloader delivery, TFTP for the initial network boot image, HTTP for kernel and ramdisk delivery via iPXE, and iSCSI for diskless boot and block storage connectivity.
Preferences: iPXE was preferred at scale as it avoided the packet loss and timeout issues common with TFTP/UDP over large switched networks.
Troubleshooting: Typically involved resolving DHCP misconfigurations from VLAN mismatches, TFTP timeouts from MTU issues, driver misconfigurations around hardware credentials, and security/firewall rules blocking provisioning ports.
What protocols or tools have you used to interact with Baseboard Management Controllers (BMCs)?
Worked with IPMI as the primary protocol for out-of-band management of bare metal nodes, using it for power management, node enrollment, and controlling boot sequences.
Automation: Integrated IPMI drivers to automate power cycles during initial provisioning and reconfiguration, eliminating manual intervention across large node counts.
Use Case: During deployments, handled scenarios where nodes failed to respond to power commands due to stale sessions or credential mismatches, requiring manual session resets and revalidation. Out-of-band access was also critical during on-call incidents where nodes became unresponsive at the OS level and required remote power cycling or console access.