1๏ธโฃ You mentioned youโve worked with Argo CD โ how do you use it?
2๏ธโฃ Thereโs a scenario where someone wants to deploy infrastructure on AWS via Terraform. Which tool will you use and why?
3๏ธโฃ How do you maintain Argo CD and use it to deploy on Kubernetes?
4๏ธโฃ How do you manage rollbacks of an application in Argo CD?
5๏ธโฃ In Kubernetes, how many master and worker nodes you had configure?
6๏ธโฃ Would you use an entirely different AWS account, a separate cluster, or just a namespace to differentiate Dev and Prod?
7๏ธโฃ If you have 1 master node and 1 worker node running an application, how would you upgrade Kubernetes on both?
8๏ธโฃ Do you store your Terraform state file locally or remotely? Why?
9๏ธโฃ If you deployed 12 EC2 instances and 2 were deleted manually, what happens when you run terraform apply again?
๐ How do you update your Terraform state file to match current AWS resources? Which command will you use?
1๏ธโฃ1๏ธโฃ How would you deploy to multiple AWS accounts like Dev, UAT, and Prod?
1๏ธโฃ2๏ธโฃ Can we do it with Terraform workspaces?
1๏ธโฃ3๏ธโฃ What does terraform init do?
1๏ธโฃ4๏ธโฃ If an EC2 instance has an IAM role to access S3 but gets โPermission Denied,โ how do you troubleshoot?
1๏ธโฃ5๏ธโฃ You use ALB for EKS external traffic โ how do you manage traffic inside EKS to different paths?
1๏ธโฃ6๏ธโฃ For cost optimization, how do AWS Savings Plans work?
1๏ธโฃ7๏ธโฃ You use rolling updates for deployments, but what if you want to use a Blue-Green approach with Argo CD inside Kubernetes?
Hereโs what I ask ๐
-
How does DNS resolution work inside a pod? โ And whatโs the first thing you check when a service isnโt reachable by name?
-
Please walk me through the controller managerโs role during a Deployment. โ Not just kubectl rollout status โ I want the reconciliation logic.
-
What happens if a node with local storage gets autoscaled down? โ Hint: this is where most real prod data loss stories begin.
-
Post-deploy, latency spikes for 30% of users. No errors. No logs. No alerts. โ Whatโs your 3-step triage?
-
How do you enforce runtime security in Kubernetes? โ PSP? AppArmor? OPA? Orโฆ just crossing fingers?
-
HPA vs VPA vs Karpenter โ when would you NOT use each? โ Bonus: How do you simulate HPA behavior in staging?
-
Tell me about the last Kubernetes outage you debugged. โ No postmortem? You probably werenโt in the war room.
Round 1 โ Systems at Scale, K8s, Cloud & Linux (45 mins) โข How would you implement fine-grained service discovery across 1000+ microservices using Envoy or Istio? โข Explain how youโd leverage eBPF + Cilium to enforce network security policies at runtime, and what the advantages are over traditional CNIs? โข Netflix runs multi-cloud. Describe your approach to cross-cloud routing, IAM, and secret syncing. โข What happens when systemd units fail intermittently on EKS nodes? How do you detect and heal? โข Your app teams demand custom AMIs. Whatโs your pre-prod vetting strategy at kernel and runtime? โข Walk through advanced kube-probe configurations to detect business logic failures, not just HTTP 200. โข How do you handle DNS-level outages inside a service mesh without a full app redeploy? โข Terraform remote_state backend suddenly times out. Whatโs your recovery and damage containment strategy?
Round 2 โ RCA, Fire Drills, and Netflix-Scale Chaos (75 mins) โข A new Envoy config rollout silently broke mTLS between edge and mesh. Whatโs your RCA trace? โข HPA refuses to scale even though Prometheus shows CPU > 80%. Diagnose with cloud + K8s metrics. โข You introduced a sidecar-based caching layer. Suddenly, tail latency spikes. Whatโs your debug path? โข 504 errors on the playback service. Cloud LB shows healthy, mesh sidecars pass, but users canโt stream. Triage? โข You notice surging NAT costs. No infra changes this week. What could silently trigger that?
Round 3 โ Leadership, Chaos Culture & Engineering Influence (30 mins) โข How do you build an engineering culture where SLOs are owned, not ignored? โข Youโre asked to ship a multi-region failover in 3 weeks no DNS layer allowed. Your plan? โข Describe how youโd test infra chaos and graceful degradation for a Netflix Originals release. โข How do you prove ROI of infra modernization to non-technical execs?
๐ How do you design a DevSecOps pipeline for regulated industries like banking or insurance? ๐ Whatโs your approach to auditing IAM policies and access controls across AWS or Azure in an enterprise setup? โ๏ธ Explain how youโd manage infrastructure provisioning across multiple environments using Terraform. ๐ How do you ensure rollback safety and transaction integrity in multi-step deployments? ๐ How do you approach drift detection and reconciliation in GitOps workflows? ๐ก๏ธ What steps do you take to implement network policies and pod security in Kubernetes clusters? ๐ How would you monitor and alert for critical services in a payment gateway system using Prometheus/Grafana or Splunk? ๐งฉ Have you worked on integrating CI/CD with security scanning tools like Snyk, Aqua, or Trivy? Explain the integration flow. ๐ How do you automate patching and OS hardening in cloud-based systems at scale? ๐ Whatโs your experience with designing HA (high availability) and DR (disaster recovery) architecture for enterprise apps? ๐ง How would you implement secrets rotation for database credentials across Kubernetes and CI/CD pipelines? ๐ฆ Explain the difference between Helm and Kustomize. When would you use one over the other? ๐ Can you walk us through how you set up Blue-Green or Canary deployments in Azure DevOps or ArgoCD? โ๏ธ How do you handle cost optimization in cloud-native infrastructure, especially in a FinTech context? ๐งฉ What are the compliance checks you automate before production deployments (e.g., CIS benchmarks, encryption validation)? ๐ Describe how you would manage and track configuration drifts in a large multi-account AWS environment. ๐ ๏ธ How do you automate SSL/TLS certificate renewal in Kubernetes ingress controllers? ๐ How do you implement Zero Trust principles in internal tooling and CI/CD systems? ๐ Have you implemented centralized logging and monitoring for audit trails and anomaly detection? Which stack did you use? ๐งช How would you use tools like LitmusChaos or Gremlin for testing system resilience in production-like environments?
-
Scalability โ Vertical vs horizontal; default to stateless horizontal scale
-
Latency & Throughput โ Know P50/P95/P99 and tradeoffs
-
Capacity Estimation โ Estimate QPS, storage, bandwidth
-
Networking Basics โ TCP, HTTP/HTTPS, TLS, UDP, DNS
-
SQL vs NoSQL โ Decide by access patterns, joins, consistency
-
Data Modeling โ Structure to optimize hot queries
-
Indexing โ Right indexes help reads, wrong ones hurt writes
-
Normalization โ Normalize for writes, denormalize for reads
-
Caching โ App/DB/CDN; write-through/back, TTLs
-
Cache Invalidation โ TTLs, versioning, stampede protection
-
Load Balancing โ L4/L7, RR, least-connections, hashing
-
CDN & Edge โ Serve static content near users
-
Sharding โ Hash/range/geo; handle hot keys, rebalancing
-
Replication โ Sync/async, leader/follower, read replicas
-
Consistency Models โ Strong, eventual, causal
-
CAP Theorem โ In partition, pick A or C
-
Concurrency โ Locks, MVCC, retries
-
Multithreading โ Pools, contention, switching
-
Idempotency โ Same request, same effect
-
Rate Limiting โ Fairness and protection
-
Queues & Streams โ Kafka, RabbitMQ; smooth spikes
-
Backpressure โ Manage slow consumers/load shedding
-
Delivery Semantics โ At-most/least/exactly-once
-
API Design โ REST vs gRPC; human vs machine optimized
-
API Versioning โ Additive only; never break clients
-
AuthN & AuthZ โ OAuth2, JWT, RBAC/ABAC
-
Resilience โ Circuit breakers, timeouts, retries
-
Observability โ Logs, metrics, traces, SLI/SLO
-
Health Checks โ Detect failures, auto-replace
-
Redundancy โ Avoid SPOFs; use multi-AZ/region
-
Service Discovery โ Dynamic endpoints, central config
-
Deployment โ Canary, blue/green, rollbacks
-
Monolith vs Microservices โ Split only if ops cost justified
-
Distributed Transactions โ Use sagas, avoid 2PC
-
Event Sourcing/CQRS โ Append-only logs, split models
-
Consensus โ Raft, Paxos, quorum reads/writes
-
Data Privacy โ Encryption, GDPR/CCPA
-
Disaster Recovery โ Backups, RPO/RTO, test it
-
Serialization โ JSON, Avro, Proto
-
Real-Time Delivery โ Polling, SSE, WebSockets
โ Tell me about yourself (Self-Intro) โ What source code management tool are you using? โ What CI/CD tools do you use? โ How do you troubleshoot issues in CI/CD pipelines? โ What kind of pipelines have you worked on? โ What is a Jenkinsfile? โ What are your daily DevOps activities? โ Explain Git merge and branching strategies. โ Describe a full CI/CD flow in your project. โ How do you build a Docker image? Explain the steps. โ How do you check CPU usage of a server? โ What operating systems have you worked with? โ What is Ansible? Why did you use it in your project? โ What is Terraform used for? โ What's the difference between Ansible and Terraform? โ Have you worked with Kubernetes? Explain your use case. โ What AWS services are you familiar with? โ Have you used monitoring tools? How and why are they used? โ What do you do if there's a Git issue while merging or pushing? โ What kind of Jenkins plugins have you used? โ How do you trigger an Ansible playbook? Whatโs the command? โ How to run a specific task in Ansible? โ How do you manage Jira tickets or handle tasks?
- Youโve got an app running in one AWS account that needs to access an S3 bucket in another account. How would you set that up securely?
- Can you write a Dockerfile for a Node.js app using multi-stage builds โ just something youโd actually use in a real project?
- Suppose your Terraform state file gets corrupted or out of sync. What steps would you take to recover from that?
- You have an EC2 in a private subnet and canโt use a NAT Gateway. How else can it reach the internet for updates or downloads?
- A container has crashed or exited suddenly โ how would you go about figuring out what went wrong?
- Thereโs an existing AWS VPC created manually, but now you want to manage it through Terraform. How would you import it and make sure nothing breaks?
- How would you roll out blue-green deployments in a Kubernetes cluster? What would that look like in practice?
- When using Terraform, how do you manage sensitive values like secrets or API keys without hardcoding them?
- In a Dockerfile, whatโs the practical difference between using COPY and ADD? When would you prefer one over the other?
- If you had to create resources across two different AWS accounts using Terraform, how would you set that up?
- Imagine you have a PHP app in a Docker container that needs MySQL credentials โ how would you pass those securely?
- If someone manually changed an S3 bucket policy that was originally created with Terraform, how would you deal with that kind of drift?
- How do you enforce rules in Kubernetes to control which pods can talk to each other?
- Could you write a small Python script that backs up all files older than 30 days from a folder?
- If your team is seeing a spike in cloud costs, how would you go about figuring out why โ and cutting cost without hurting performance?
- Say you want to serve users in different countries using AWS. How would you route traffic based on user location?
- Youโre on-call. A production Kubernetes cluster is a mess โ pods arenโt pulling images, some are getting evicted, and users are seeing errors. How do you troubleshoot this, and how would you prevent it next time?
๐๐จ๐ฎ๐ง๐ ๐: ๐๐๐ก๐๐ฏ๐ข๐จ๐ซ๐๐ฅ ๐๐จ๐ฎ๐ง๐
- What would you do if you were asked to work on a tool or technology youโve never used before?
- Can you share a time when you had to deliver something quickly, but didnโt have all the time or resources you needed? How did you manage it?
- Tell me about a mistake or outage you were involved in โ how did you respond, and what did you learn from it?
- Whatโs the hardest technical issue youโve faced so far? How did you go about solving it?
- If you wanted to convince your team or manager to adopt a new tool or process, how would you make your case?
Round 1: Tech Screening (45 mins) โข Explain your entire CI/CD, whereโs the blast radius? โข Difference between ALB and NLB, when does latency matter more than routing logic? โข Terraform taint vs terraform state rm, when would you use either? โข How does CoreDNS handle plugin chain resolution? โข Choose between EBS, EFS, and FSx for different workloads โข Bash: Find all .log files over 500MB modified in the last 48h โข DB migration without downtime? Explain rollback contingency โข Roll back a failed Helm release. Whatโs your exact command chain? โข Secure Vault secrets in GitHub Actions without hardcoding โข Node is Ready, but kubelet logs are silent. What now?
Round 2: Live Scenarios & RCA (75 mins) โข Your autoscaler doubled the infrastructure costs. Logs are normal. What do you check? โข A prod pod restarts 8 times/hour. Logs look fine. Whatโs your triage plan? โข Terraform drift detected. Who gets paged, and whatโs the fix sequence? โข Disaster recovery setup, what does it look like? When was it last tested? โข Simulate a DNS outage with the CoreDNS plugin misorder. how? โข Dockerfile: Multi-stage image with caching and minimized attack surface โข Secure gRPC traffic across K8s namespaces โข S3 bucket deleted by mistake, walk me through infra + IAM prevention โข Helm worked in staging but broke in prod. Explain the failure class. โข App returns 200 OK, but functionality is broken. What did you monitor?
Round 3: Behavioral Grit Check (30 mins) โข How do you onboard juniors without overwhelming them? โข Whatโs the worst production day youโve faced? What broke you? โข Ever shipped under pressure and regretted it? โข How do you protect engineering from non-technical deadlines? โข What does a โno blameโ postmortem look like in your team?
๐ก TL;DR: If you havenโt:
-
Simulated a DNS cache corruption
-
Recovered a Terraform state loss
-
Diagnosed a 200/OK but broken service
-
Debugged CoreDNS plugin chain at midnight
-
How do you design a secure and scalable CI/CD pipeline in Jenkins or Azure DevOps?
-
Explain the difference between build and release pipelines.
-
How do you handle rollback strategies in CI/CD?
-
What are Jenkins Shared Libraries? Why and how do you use them?
-
How would you implement multi-branch pipelines for microservices?
-
How do you handle pipeline-level caching to speed up builds?
-
How do you inject secrets into your pipeline safely?
-
What are manual approval gates, and where do you implement them?
-
Explain pipeline-as-code. How do you implement it in Azure DevOps YAML format?
-
How do you enforce compliance and quality checks in a CI/CD pipeline?
-
What is the Terraform state file? How do you manage it securely in teams?
-
Difference between terraform refresh, plan, and apply.
-
Explain how you modularize Terraform code for reuse.
-
What happens if you manually change infrastructure outside Terraform? How do you detect drift?
-
Have you used Terraform with Azure or AWS? How do you set up backend state in Azure Blob or S3?
-
How do you implement count, for_each, and dynamic blocks in Terraform?
-
What is the lifecycle block in Terraform used for?
-
Write a Dockerfile for a Node.js or Python application.
-
Explain how multi-stage Docker builds help with image optimization.
-
Whatโs the difference between a Docker volume and bind mount?
-
Whatโs the role of kubelet, kube-proxy, and kube-apiserver?
-
How do you perform a blue-green deployment in Kubernetes?
-
What is a StatefulSet vs Deployment in K8s?
-
What are readiness and liveness probes and how do they impact scaling?
-
How do you manage secrets in Kubernetes?
-
What is the difference between a private and service endpoint in Azure?
-
How do you manage access between services using IAM roles or Managed Identity?
-
Whatโs the difference between Azure VNet and AWS VPC?
-
How do you implement VNet peering or inter-region connectivity securely?
-
How do you handle high availability and disaster recovery in cloud deployments?
-
What logging and monitoring solutions have you implemented in Azure or AWS?
-
How do you enforce security baselines in your CI/CD pipelines?
-
Have you used tools like Trivy, Aqua, or Snyk for container scanning?
-
How do you integrate Azure Key Vault (or AWS Secrets Manager) in your pipeline securely?
-
How do you rotate secrets without downtime?
-
How do you set up RBAC in Jenkins, Azure DevOps, or Kubernetes?
-
You deployed to production and a service is down. How do you debug and rollback safely!
-
A junior DevOps engineer deleted a critical Terraform state file โ how do you recover?
-
A Docker container works locally but fails in Jenkins โ how would you troubleshoot it?
-
One pod in Kubernetes is restarting in a CrashLoopBackOff state โ how do you fix it?
๐ฅ๐ผ๐๐ป๐ฑ ๐ญ: ๐ง๐ฒ๐ฐ๐ต๐ป๐ถ๐ฐ๐ฎ๐น ๐ฆ๐ฐ๐ฟ๐ฒ๐ฒ๐ป๐ถ๐ป๐ด
- Explain the CI/CD workflow you follow and the kind of pipeline you use. How do you define and invoke pipelines in Jenkins?
- What are shared libraries in Jenkins, and how are they written and defined?
- What kind of applications do you deploy using Jenkins pipelines, and what deployment tools do you use?
- If the Jenkins pipeline runs but the build doesnโt happen, what possible issues could be causing it?
- What is the purpose of a webhook, and how is it used in a CI/CD pipeline?
- How do you create and manage Kubernetes clusters (using tools like Terraform), and what are the master and worker nodes?
- What are common Kubernetes errors youโve faced (like CrashLoopBackOff, ImagePullError), and how did you resolve them?
- What is the command to access a pod and how can you define or create a Kubernetes class or object?
- Explain the folder structure of a basic Helm chart. What commands do you use to deploy with Helm?
- What are the stages in a Docker image build? Why do we use ENTRYPOINT and CMD instructions?
- How do you manage and connect services like DBs, EC2, EKS, or ECS? Include the command to connect to ECS.
- Which container registry do you use for storing Docker images?
๐ฅ๐ผ๐๐ป๐ฑ ๐ฎ: ๐๐ป-๐ฑ๐ฒ๐ฝ๐๐ต ๐ง๐ฒ๐ฐ๐ต๐ป๐ถ๐ฐ๐ฎ๐น ๐ฆ๐ฐ๐ฟ๐ฒ๐ฒ๐ป๐ถ๐ปg
- What branching strategy do you follow, and how do you handle merges to avoid breaking the release branch? If a bug appears in production, whatโs your approach to resolving it?
- Describe your typical deployment flow and CI/CD workflow. What stages do you define in your Jenkins pipeline, and how do you ensure full quality checks during deployment?
- How do you use Jenkins shared libraries? Explain their typical structure and how they are integrated into your Jenkinsfiles.
- Are you aware of security scanning tools? How do you scan Docker imagesโboth during build and at the registry level? Are you using any extensions or tools for image scanning?
- How do you pass environment variables during Docker build commands? What services do you use for storing Docker images?
- How do you establish a connection with databases in your deployments or infrastructure setup?
- How do you handle authentication for EKS clusters and store secrets securely in your environment?
- How do you create AWS Lambda functions and manage the artifacts for deployment? What options do you use to push artifacts to Lambda?
- What is email signing and Helm chart signing? Which tools do you use to sign Helm charts?
- Explain the control plane components and their functionalities, and how they interact with each other.
- Explain how the kube-proxy component works on the node.
- Explain the CNI and CRI interfaces.
- How to troubleshoot etcd issues.
๐Intermediate Level [Kubernetes ] 5) Explain the core components and benefits of a service mesh like Istio or Linkerd. 6) Describe how traffic management (e.g., canary deployments, A/B testing) is implemented using a service mesh. 7) Discuss the security implications and best practices for implementing mutual TLS (mTLS) in a service mesh. 8) How do you troubleshoot a performance issue originating from the service mesh? 9) Explain the differences between ClusterIP, NodePort, LoadBalancer, and Ingress services. 10) Describe how you would implement a multi-cluster ingress solution. 11) Discuss the challenges of implementing network policies and how you would troubleshoot network connectivity issues within a Kubernetes cluster. 12) Explain how to implement and troubleshoot egress network policies. 13) Explain the differences between PersistentVolumes (PVs), PersistentVolumeClaims (PVCs), and StorageClasses.
๐Advanced Level [ Kubernetes ] 14) Describe how you would implement dynamic provisioning of persistent storage for stateful applications. 15) Discuss the challenges of managing stateful applications in Kubernetes and how you would ensure data consistency and availability. 16) How to take and restore a snapshot of a persistent volume? Security Best Practices: 17) Describe how you would implement role-based access control (RBAC) to restrict access to Kubernetes resources. 18) Discuss the importance of container image security and how you would implement image scanning and vulnerability management. 19) Explain how you would implement secrets management using Kubernetes Secrets or an external secrets management solution (e.g., HashiCorp Vault). 20) How would you implement and enforce pod security policies, and how have pod security standards changed the landscape? 21) Describe how you would implement a comprehensive monitoring and logging solution for a Kubernetes cluster. 22) Discuss the importance of distributed tracing and how you would implement it using tools like Jaeger or Zipkin.
- What happens when your etcd store goes stale or corrupted? โ Talk backup strategies, RAFT consistency, and disaster recovery. This isnโt your bootcamp final project.
๐ฅ 2. How would you debug a Pod thatโs intermittently unreachable across nodes but passes health checks? โ Show me your CNI layer chops. Calico? Cilium? Routing tables? Letโs go.
๐ฅ 3. Whatโs your strategy for zero-downtime upgrades of Ingress Controllers at scale? โ Hint: Rolling updates + traffic shifting + TLS + chaos = test of your soul.
๐ฅ 4. How do you enforce multi-tenant isolation in a shared cluster? โ Namespaces are just the start. Letโs talk about RBAC boundaries, NetworkPolicies, and resource quotas that donโt break prod.
๐ฅ 5. Whatโs your incident response when Karpenter overscales during a sudden traffic spike and cost goes 3x? โ HPA saves your app. But cost visibility saves your job.
๐ฅ 6. Walk me through how youโd rotate secrets across hundreds of microservices without downtime. โ Letโs see your secret manager + CSI driver + rollout design. Donโt say โwe just restart the pods.โ
๐ฅ 7. You just recovered from an outage. Now what? โ Real engineers talk SLO breaches, RCAs, postmortems, chaos testing. Not โitโs working now, all good.โ
๐ญ Still think Kubernetes is โjust a container orchestratorโ? You donโt need more tutorials. You need fire drills.
๐ 1. What is the status of a Pod with initContainers that fail if the main container has a restartPolicy of Never?
๐ 2. In a StatefulSet with three replicas, if you remove replica-1, will replica-2 and replica-3 be renamed to keep their sequential order?
๐ 3. Can a DaemonSet Pod be placed on a master node that has a NoSchedule taint without adding tolerations?
๐ 4. When you update a Deployment's image during an ongoing rolling update, does K8s wait for the current rollout to finish or initiate a new one right away?
๐ 5. How long does it take for Pods to be evicted when a node becomes NotReady, and can this eviction time be customized for each Pod?
๐ 6. Is it possible for multiple containers within a Pod to share the same port on localhost, and what occurs if they attempt to bind at the same time?
๐ 7. If you create a PVC with ReadWriteOnce access mode, can several Pods on the same node access it at the same time?
๐ 8. What happens to the Horizontal Pod Autoscaler with custom metrics if the metrics server goes down during peak load?
๐ 9. Is it possible to use kubectl port-forward to a Pod in a CrashLoopBackOff state, and will it function?
๐ 10. What occurs with the mounted tokens and API access if a ServiceAccount is deleted while its associated Pods are still running?
๐ 11. When implementing anti-affinity rules, can a situation arise where no new Pods are able to be scheduled?
๐ 12. If you have a Job set to parallelism: 3 and one Pod fails with a restartPolicy of Never, will the Job spawn a replacement Pod?
๐ 13. Can you alter a Pod's resource requests after it has been created, and what is the distinction between requests and limits in OOM situations?
๐ 14. In network policies, if egress rules aren't specified, are outbound connections automatically denied?
๐ 15. If a Persistent Volume becomes corrupted, can multiple PVCs linked to it lead to failures in different namespaces?
Write a Terraform script that will provision, set up, and configure a static website1
. ๐ธ๐๐ฏ๐ฒ๐ฐ๐๐น ๐ด๐ฒ๐ ๐ฝ๐ผ๐ฑ๐ --๐ฎ๐น๐น-๐ป๐ฎ๐บ๐ฒ๐๐ฝ๐ฎ๐ฐ๐ฒ๐: Check the status of all pods across namespaces to identify failures. 2. ๐ธ๐๐ฏ๐ฒ๐ฐ๐๐น ๐ฑ๐ฒ๐๐ฐ๐ฟ๐ถ๐ฏ๐ฒ ๐ฝ๐ผ๐ฑ ๐ฝ๐ผ๐ฑ_๐ป๐ฎ๐บ๐ฒ: Gather detailed information about a failed pod. 3. ๐ธ๐๐ฏ๐ฒ๐ฐ๐๐น ๐น๐ผ๐ด๐ ๐ฝ๐ผ๐ฑ_๐ป๐ฎ๐บ๐ฒ -๐ฐ ๐ฐ๐ผ๐ป๐๐ฎ๐ถ๐ป๐ฒ๐ฟ_๐ป๐ฎ๐บ๐ฒ: View logs of a specific container inside a pod to troubleshoot issues. 4. ๐ธ๐๐ฏ๐ฒ๐ฐ๐๐น ๐ด๐ฒ๐ ๐ฒ๐๐ฒ๐ป๐๐ --๐ฎ๐น๐น-๐ป๐ฎ๐บ๐ฒ๐๐ฝ๐ฎ๐ฐ๐ฒ๐ --๐๐ผ๐ฟ๐-๐ฏ๐='.๐บ๐ฒ๐๐ฎ๐ฑ๐ฎ๐๐ฎ.๐ฐ๐ฟ๐ฒ๐ฎ๐๐ถ๐ผ๐ป๐ง๐ถ๐บ๐ฒ๐๐๐ฎ๐บ๐ฝ': Review recent events for clues on crashes and errors. 5. ๐ธ๐๐ฏ๐ฒ๐ฐ๐๐น ๐ด๐ฒ๐ ๐ป๐ผ๐ฑ๐ฒ๐: Verify the status of nodes in the cluster, checking for node failures. 6. ๐ธ๐๐ฏ๐ฒ๐ฐ๐๐น ๐ฑ๐ฟ๐ฎ๐ถ๐ป ๐ป๐ผ๐ฑ๐ฒ_๐ป๐ฎ๐บ๐ฒ --๐ถ๐ด๐ป๐ผ๐ฟ๐ฒ-๐ฑ๐ฎ๐ฒ๐บ๐ผ๐ป๐๐ฒ๐๐: Safely evacuate and cordon a node for recovery operations. 7. ๐ธ๐๐ฏ๐ฒ๐ฐ๐๐น ๐ฐ๐ผ๐ฟ๐ฑ๐ผ๐ป ๐ป๐ผ๐ฑ๐ฒ_๐ป๐ฎ๐บ๐ฒ: Mark a node as unschedulable to prevent new pods from being scheduled during recovery. 8. ๐ธ๐๐ฏ๐ฒ๐ฐ๐๐น ๐ฑ๐ฒ๐น๐ฒ๐๐ฒ ๐ฝ๐ผ๐ฑ ๐ฝ๐ผ๐ฑ_๐ป๐ฎ๐บ๐ฒ --๐ด๐ฟ๐ฎ๐ฐ๐ฒ-๐ฝ๐ฒ๐ฟ๐ถ๐ผ๐ฑ=0 --๐ณ๐ผ๐ฟ๐ฐ๐ฒ: Forcefully delete a crashed pod to restart it or clear it for recovery. 9. ๐ธ๐๐ฏ๐ฒ๐ฐ๐๐น ๐ฟ๐ผ๐น๐น๐ผ๐๐ ๐๐ป๐ฑ๐ผ ๐ฑ๐ฒ๐ฝ๐น๐ผ๐๐บ๐ฒ๐ป๐ ๐ฑ๐ฒ๐ฝ๐น๐ผ๐๐บ๐ฒ๐ป๐_๐ป๐ฎ๐บ๐ฒ: Roll back a deployment in case a new rollout causes crashes. 10. ๐ธ๐๐ฏ๐ฒ๐ฐ๐๐น ๐ฒ๐ ๐ฒ๐ฐ -๐ถ๐ ๐ฝ๐ผ๐ฑ_๐ป๐ฎ๐บ๐ฒ -- /๐ฏ๐ถ๐ป/๐๐ต: Access a container to debug and resolve application issues directly inside the pod. 11. ๐ธ๐๐ฏ๐ฒ๐ฐ๐๐น ๐ด๐ฒ๐ ๐ฐ๐ผ๐บ๐ฝ๐ผ๐ป๐ฒ๐ป๐๐๐๐ฎ๐๐๐๐ฒ๐: Check the health of core cluster components like etcd, kube-apiserver, and more. 12. ๐ธ๐๐ฏ๐ฒ๐ฐ๐๐น ๐๐ผ๐ฝ ๐ป๐ผ๐ฑ๐ฒ๐: Monitor node resource usage to detect resource exhaustion causing crashes. 13. ๐ธ๐๐ฏ๐ฒ๐ฐ๐๐น ๐๐ผ๐ฝ ๐ฝ๐ผ๐ฑ๐ --๐ฎ๐น๐น-๐ป๐ฎ๐บ๐ฒ๐๐ฝ๐ฎ๐ฐ๐ฒ๐: Check pod resource usage across namespaces, identifying bottlenecks leading to crashes. 14. ๐ธ๐๐ฏ๐ฒ๐ฐ๐๐น ๐ฑ๐ฒ๐น๐ฒ๐๐ฒ ๐ป๐ผ๐ฑ๐ฒ ๐ป๐ผ๐ฑ๐ฒ_๐ป๐ฎ๐บ๐ฒ: Remove a failed node from the cluster to allow recovery operations. 15. ๐ฒ๐๐ฐ๐ฑ๐ฐ๐๐น --๐ฒ๐ป๐ฑ๐ฝ๐ผ๐ถ๐ป๐๐=๐ต๐๐๐ฝ๐://๐ฒ๐๐ฐ๐ฑ-๐๐ฒ๐ฟ๐๐ฒ๐ฟ:2379 ๐๐ป๐ฎ๐ฝ๐๐ต๐ผ๐ ๐ฟ๐ฒ๐๐๐ผ๐ฟ๐ฒ ๐ฏ๐ฎ๐ฐ๐ธ๐๐ฝ.๐ฑ๐ฏ: Restore etcd from a snapshot in case of etcd failure. 16. ๐ธ๐๐ฏ๐ฒ๐ฐ๐๐น ๐ฎ๐ฝ๐ฝ๐น๐ -๐ณ ๐ฏ๐ฎ๐ฐ๐ธ๐๐ฝ.๐๐ฎ๐บ๐น: Reapply configurations from a backup manifest during recovery. 17. ๐ธ๐๐ฏ๐ฒ๐ฐ๐๐น ๐๐ฎ๐ถ๐ป๐ ๐ป๐ผ๐ฑ๐ฒ๐ ๐ป๐ผ๐ฑ๐ฒ_๐ป๐ฎ๐บ๐ฒ ๐ธ๐ฒ๐=๐๐ฎ๐น๐๐ฒ:๐ก๐ผ๐ฆ๐ฐ๐ต๐ฒ๐ฑ๐๐น๐ฒ: Prevent scheduling on a node experiencing issues during recovery. 18. ๐ธ๐๐ฏ๐ฒ๐ฐ๐๐น ๐ด๐ฒ๐ ๐ฒ๐ป๐ฑ๐ฝ๐ผ๐ถ๐ป๐๐ ๐๐ฒ๐ฟ๐๐ถ๐ฐ๐ฒ_๐ป๐ฎ๐บ๐ฒ: Verify service endpoints during recovery to ensure services are resolving correctly.
- What scripting languages are you familiar with?
- What are artifacts in GitLab CI?
- What is a private module registry in Terraform?
- If you delete the local Terraform state file and it's not stored in S3 or DynamoDB, how can you recover it?
- How do you import resources into Terraform?
- What is a dynamic block in Terraform?
- How can you create EC2 instances in two different AWS accounts simultaneously using Terraform?
- How do you handle an error stating that the resource already exists when creating resources with Terraform?
- How does Terraform refresh work?
- How would you upgrade Terraform plugins?
- What are the different types of Kubernetes volumes?
- If a pod is in a crash loop, what might be the reasons, and how can you recover it?
- What is the difference between StatefulSet and DaemonSet?
- What is a sidecar container in Kubernetes, and what are its use cases?
- If pods fail to start during a rolling update, what strategy would you use to identify the issue and rollback?
- How can we enable communication between 500 AWS accounts internally?
- How to configure a solution where a Lambda function triggers on an S3 upload and updates DynamoDB?
- What is the standard port for RDP?
- How do you configure a Windows EC2 instance to join an Active Directory domain?
- How can you copy files from a Linux server to an S3 bucket?
- What permissions do you need to grant for that S3 bucket?
- What are the different types of VPC endpoints and when do you use them?
- How to resolve an image pullback error when using an Alpine image pushed to ECR in a pipeline?
- What is the maximum size of an S3 object?
- What encryption options do we have in S3?
- Can you explain IAM user, IAM role, and IAM group in AWS?
- What is the difference between an IAM role and an IAM policy document?
- What are inline policies and managed policies?
- How can we add a load balancer to Route 53?
- What are A records and CNAME records?
- What is the use of a target group in a load balancer?
- If a target group is unhealthy, what might be the reasons?
- Can you share your screen and write a Jenkins pipeline?
- How do you write parallel jobs in a Jenkins pipeline?
Networking Interview Q & A for Cloud and DevOps
1.What is a subnet? Answer :- A subnet is a smaller section of a larger network. It helps organize and manage devices more effectively, just like cutting a big pizza into smaller slices makes it easier to share.
2.What is a firewall? answer:- A firewall acts like a security guard, monitoring traffic to protect your network from unwanted access.
3.What is NAT (Network Address Translation)? answer:- NAT is like a translator for your network, allowing multiple devices to share one public IP address while keeping them secure.
4.What is DNS (Domain Name System)? answer:- DNS is the internetโs phonebook, translating website names into IP addresses so computers can find each other easily.
5.What is a VPN (Virtual Private Network)? answer:- A VPN creates a secure tunnel for your internet connection, protecting your data from prying eyes, especially on public Wi-Fi.
6.What is the purpose of a load balancer? answer:- A load balancer distributes traffic evenly across servers, preventing any one server from getting overwhelmed.
7.What is the OSI model? answer:- The OSI model is a framework used to understand network interactions in seven layers: Physical, Data Link, Network, Transport, Session, Presentation, and Application. Each layer has specific functions.
8.What is a VPN tunnel? answer:- A VPN tunnel is a secure, encrypted connection between two endpoints over the internet. It ensures that data packets are protected from interception, maintaining privacy and security.
9.What is a public cloud? answer:- A public cloud is like a shared space where multiple users can access resources over the internet, managed by a provider like AWS.
10.What is an API (Application Programming Interface)? answer:- An API is like a waiter, letting different software applications communicate and request data from each other.
11.What is a proxy server? answer:- A proxy server is a middleman for your internet requests, adding security and sometimes speeding up access to websites.
12.What is bandwidth? answer:- Bandwidth is the amount of data that can travel over a network at once, similar to how wide a highway is for cars
Observation 1: The people who got $200k-$500k+ offers talked extensively about 3 topics:
- mentors they learned from
- metrics they measure (and why)
- how they drive revenue for the business
Observation 2: The people who got these offers only talked ~50% of the time.
They asked intentional questions about the
- hiring manager
- company
- culture
- team
- goals
Observation 3: The people who did NOT get offers
- didn't ask questions throughout the call
- didn't talk about business outcomes
- didn't talk about metrics
- didn't give credit to mentors/team
- tried to look completely perfect
If I were interviewing right now, I would
- list the metrics I measure (and why)
- know how I drive revenue for the business
- research the company/interviewers
- ask Qs that can't be answered by Google
- know the value I bring to the company
- make a list of my mentors + send them a note of gratitude for teaching me XYZ
-
Tell me about your project.
-
How does GitLab trigger pipelines automatically when a developer pushes code?
-
How do you declare dependent stages in a CI/CD YAML file?
-
What is
terraform init? -
What is
backend.tf? -
What does
terraform plando? -
What happens if you give the wrong configuration or code in Terraform and run it?
-
Where do you store Docker images?
-
Kubernetes: Describe the
podcommand. -
How do you declare context in Kubernetes?
-
What if you have more contexts and namespaces in Kubernetes?
-
What are the types of services in Kubernetes?
-
NodePort vs. ClusterIP โ What's the difference?
-
Horizontal Pod Scaling vs. Vertical Pod Scaling โ Explain.
-
Limit vs. request in a Kubernetes manifest file.
-
What is AWS Lambda, and what have you done with it?
-
How do you set up CloudWatch alarms, metrics, and alerts?
-
How do you create dashboards on CloudWatch for different resources or services on AWS?
-
Which build tool do you use for Java? How about for Python?
-
Explain the complete structure of GitLab CI/CD with stages.
-
How do you integrate Azure DevOps with AWS for deployments?
-
Write Terraform code for an EC2 instance with security groups using
depends_on. -
What is the difference between public IP and elastic IP in AWS?
-
What is the difference between an application load balancer and a Network Load Balancer?
-
What layer is the Network Load Balancer?
-
What is the difference between Application Gateway and load balancer?
-
๐ง Explain how Linux mechanisms work, especially when the system starts.
-
๐ณ What happens when you run a container in Kubernetes? Explain the internal workings.
-
๐ What is the difference between git merge and git rebase?
-
๐ฏ Why and when would you use the git cherry-pick command?
-
โ๏ธ Key differences between Jenkins Declarative Pipeline and Scripted Pipeline?
-
๐ What is DNS and how does it work?
-
๐ฃ๏ธ Explain how routes work in a network environment.
-
๐ก๏ธ How do you implement network policies in Kubernetes?
-
๐ How do Prometheus and Grafana interact? What is the source of data for Prometheus?
-
๐๏ธ Which database did you use in your recent project, and why?
-
๐ง What is an SQL index, and how does it work?
-
๐ Can you share a situation where you faced a challenge working with a non-technical team, and how you solved it? 13.๐ซ Would you describe yourself as a communicator or a problem-solver?
-
๐ Tell me about a situation where you took on a leadership role.
-
๐ ๏ธ What values do you consider important when working with a team?
-
โฝ What are your hobbies outside of technical work?
-
๐ฆ Why did you choose Docker in your recent project?
-
Explain the orchestration of kubernetes...?
-
What is the difference between state less and state full applications..?
-
What is the difference between Deployment and Statefulset and explain with example..?
-
What is PVC and how you are using this component in your current organization..?
-
How you can configure the ebs volume to a pod..?
-
Write a sample HPA (Horizantal pod Autoscaler) file..?
-
How do you configure Argocd for deployment purpose in k8s..?
-
Default ports of prometheus and grafana..? Prometheus - 9090 , Grafana - 3000
-
Explain Network policies in k8s..?
-
Write a sample jenkins declerative pipeline..?
-
What is the use of jenkins file..?
-
Tell me some jenkins plugins and use cases..?
-
How do you handle user authentication in jenkins..?
-
Explain the terraform commands that can be used to deploying a module ..?
-
What is the use of backend file in terraform..?
-
what is state file..?
-
what is the difference between .tf and state file..?
-
what is sonarlint..?
-
What are common errors that you are seeing in the kubernetes pods..?
-
What is the source for prometheus..?
-
Which database types are using the opensearch and grafana..?
-
Difference between daemonset and statefulset..?
-
๐ด๐ถ๐ ๐ฐ๐ผ๐ป๐ณ๐ถ๐ด --๐ด๐น๐ผ๐ฏ๐ฎ๐น ๐๐๐ฒ๐ฟ.๐ป๐ฎ๐บ๐ฒ "๐๐๐ฒ๐ฟ๐ป๐ฎ๐บ๐ฒ": Set your global Git username to associate commits with your identity.
-
๐ด๐ถ๐ ๐ถ๐ป๐ถ๐: Initialize a new Git repository in the current directory.
-
๐ด๐ถ๐ ๐๐๐ฎ๐๐๐: Check the status of files in your working directory, showing tracked, untracked, and staged files.
-
๐ด๐ถ๐ ๐ฎ๐ฑ๐ฑ ๐ณ๐ถ๐น๐ฒ๐ป๐ฎ๐บ๐ฒ: Stage a specific file for the next commit.
-
๐ด๐ถ๐ ๐ฐ๐ผ๐บ๐บ๐ถ๐ -๐บ "๐บ๐ฒ๐๐๐ฎ๐ด๐ฒ": Commit your staged changes with a descriptive message.
-
๐ด๐ถ๐ ๐ฐ๐น๐ผ๐ป๐ฒ ๐ฟ๐ฒ๐ฝ๐ผ_๐๐ฟ๐น: Clone a remote Git repository to your local machine.
-
๐ด๐ถ๐ ๐ฐ๐ต๐ฒ๐ฐ๐ธ๐ผ๐๐ -๐ฏ ๐ฏ๐ฟ๐ฎ๐ป๐ฐ๐ต_๐ป๐ฎ๐บ๐ฒ: Create and switch to a new branch for development.
-
๐ด๐ถ๐ ๐ฏ๐ฟ๐ฎ๐ป๐ฐ๐ต: List all branches in the repository.
-
๐ด๐ถ๐ ๐๐๐ถ๐๐ฐ๐ต ๐ฏ๐ฟ๐ฎ๐ป๐ฐ๐ต_๐ป๐ฎ๐บ๐ฒ: Switch to a specific branch in your repository.
-
๐ด๐ถ๐ ๐ฝ๐๐๐ต ๐ผ๐ฟ๐ถ๐ด๐ถ๐ป ๐ฏ๐ฟ๐ฎ๐ป๐ฐ๐ต_๐ป๐ฎ๐บ๐ฒ: Push your local commits to the remote repository on a specific branch.
-
๐ด๐ถ๐ ๐ฝ๐๐น๐น: Fetch and merge changes from the remote repository to your local branch.
-
๐ด๐ถ๐ ๐๐ต๐ผ๐: Display information about a specific commit, including changes made and commit details.
Next Level:
- ๐ด๐ถ๐ ๐ฐ๐ผ๐บ๐บ๐ถ๐ --๐ฎ๐บ๐ฒ๐ป๐ฑ: Modify the most recent commit with new changes or a new message.
- ๐ด๐ถ๐ ๐ฟ๐ฒ๐ฏ๐ฎ๐๐ฒ ๐ฏ๐ฟ๐ฎ๐ป๐ฐ๐ต_๐ป๐ฎ๐บ๐ฒ: Move or combine a sequence of commits to a new base commit.
- ๐ด๐ถ๐ ๐บ๐ฒ๐ฟ๐ด๐ฒ ๐ฏ๐ฟ๐ฎ๐ป๐ฐ๐ต_๐ป๐ฎ๐บ๐ฒ: Combine changes from another branch into the current branch.
- ๐ด๐ถ๐ ๐น๐ผ๐ด: Display a detailed log of all commits in the current branch.
- ๐ด๐ถ๐ ๐ฟ๐ฒ๐๐ฒ๐ --๐ต๐ฎ๐ฟ๐ฑ ๐ต๐ฒ๐ฎ๐ฑ~๐ญ: Revert all changes and reset to the previous commit.
- ๐ด๐ถ๐ ๐๐๐ฎ๐๐ต: Temporarily save your changes without committing, allowing you to work on something else.
- ๐ด๐ถ๐ ๐ฐ๐ต๐ฒ๐ฟ๐ฟ๐-๐ฝ๐ถ๐ฐ๐ธ ๐ฐ๐ผ๐บ๐บ๐ถ๐_๐ต๐ฎ๐๐ต: Apply changes from a specific commit to your current branch.
- ๐ด๐ถ๐ ๐ณ๐ฒ๐๐ฐ๐ต: Download changes from a remote repository without applying them to the local branch.
- ๐ด๐ถ๐ ๐ฟ๐ฒ๐ณ๐น๐ผ๐ด: Show a history of where the HEAD pointer has moved, useful for recovering from mistakes.
- ๐ด๐ถ๐ ๐ฏ๐ถ๐๐ฒ๐ฐ๐: Use binary search to find the commit that introduced a bug.
So here are some real-time questions; zero to hero folks do give them a try:
-
Resource Name Constraints: Let's say I want a predefined prefix for all resources in Terraform based on the environment (e.g., dev_ for the development environment and qa_ for the QA environment). How will you plan to do this?
-
Importing Existing Resources: You have existing infrastructure not managed by Terraform. How would you bring these resources under Terraform management?
-
Lifecycle Block: What is it, and why is it used on resources?
-
Toolsets: What are you using to validate and check Terraform code for issues and abnormalities?
-
Validating Policies: How can you define and enforce policies for resources that are created with Terraform?
-How do you handle credentials for a PHP application accessing MySQL or any other secrets in Docker?
-What is the command for running container logs?
-Have you upgraded any Kubernetes clusters?
-How do you deploy an application in a Kubernetes cluster?
-How do you communicate with a Jenkins server and a Kubernetes cluster?
-Which DevOps tools are you proficient with?
-Can you describe the CI/CD workflow in your project?
-How do you handle the continuous delivery (CD) aspect in your projects?
-What methods do you use to check for code vulnerabilities?
-What AWS services are you proficient in?
-How would you access data in an S3 bucket from Account A when your application is running on an EC2 instance in Account B?
-How do you provide access to an S3 bucket, and what permissions need to be set on the bucket side?
-How can Instance 2, with a static IP, communicate with Instance 1, which is in a private subnet and mapped to a multi-AZ load balancer?
-How do you pass arguments to a VPC while using the terraform import command?
-What are the prerequisites before importing a VPC in Terraform?
-If an S3 bucket was created through Terraform but someone manually added a policy to it, how do you handle this situation using IaC?
-For an EC2 instance in a private subnet, how can it verify and download required packages from the internet without using a NAT gateway or bastion host? Are there any other AWS services that can facilitate this?
-What is the typical latency for a load balancer, and if you encounter high latency, what monitoring steps would you take?
-If your application is hosted in S3 and users are in different geographic locations, how can you reduce latency?
-Which services can be integrated with a CDN (Content Delivery Network)?
-How do you dynamically retrieve VPC details from AWS to create an EC2 instance using IaC?
-How do you manage unmanaged AWS resources in Terraform?
-Write BASH Script for Prime numbers.
๐๐๐ง๐ค๐ข๐ง๐ฌ
- How do you integrate unit testing, integration testing, and end-to-end testing into a Jenkins pipeline?
- How do you troubleshoot common Jenkins issues, such as build failures or performance problems?
- How do you configure a Jenkins job to trigger automatically on code changes in a Git repository?
๐๐ข๐ง๐ฎ๐ฑ
- How do you monitor the CPU and memory of the linux systems?
- Explain the concept of system calls and how they interact with the kernel.
- How do you manage file systems using fsck, mount, and umount?
- How do you secure a Linux system using iptables or firewalld?
๐๐ง๐ฌ๐ข๐๐ฅ๐
- What are dynamic inventory in ansible and how do you manage it?
- Create a ansible role to install httpd in 10 servers with tags
- Can you design a playbook with notifiers and handlers for 10 tools installations?
๐๐ฎ๐๐๐ซ๐ง๐๐ญ๐๐ฌ
- Design the kubernetes manifest file to create the CronJob with concurrency as forbid and to run for every 1 minute with nginx container having tag as latest
- What will be the command to add the annotation and the labels for the existing pod?
- Design the kubernetes cluster with Ingress.
- A sudden surge in traffic causes a web application to become unresponsive what will be the steps you will take to mitigate
- Design the deployment of the pod with replica set set as 3 and having apache httpd image running as a container.
๐๐/๐๐
- How do you integrate automated testing into a CI/CD pipeline for ETL jobs?
- How do you manage environment-specific configurations in a CI/CD pipeline?
- How is version control managed for database schemas and ETL scripts in a CI/CD pipeline?
๐๐จ๐๐ค๐๐ซ
- How do you define a multi-container application using Docker Compose?
- How do you ensure the security of Docker images and containers?
- How does Docker handle network communication between containers?
๐ฃ๐๐๐ต๐ผ๐ป
- What are the key differences between Python 2 and Python 3?
- How do you integrate Python with popular DevOps tools like Jenkins and Ansible?
- How do you use Python to interact with cloud platforms like AWS, GCP, and Azure?
- How do you troubleshoot Python scripts in a DevOps environment?
- What are the benefits of using Python for configuration management over other tools?
Make a multi stage declarative jenkinsfile with the following requirements:
- Only allowed to run on nodes labeled 'test-runner'
- First stage - Installs dependencies with the bash command "flutter pub get"
- Second Stage and Third stage run in parallel.
- Second Stage runs a style check with the bash command "flutter analyze".
- Third Stage runs a static analyze check with the bash command "flutter format".
- Make sure all stages store individual logs for each command.
- Archive these logs in Jenkins so they are available to developers.
- Send an alert with Success message if all stages pass. If there are errors, send an alert with the name of the stage that failed.
pipeline {
agent { label 'test-runner' }
stages {
stage('Install Dependencies') {
steps {
script {
def logFile = 'install_dependencies.log'
sh "flutter pub get >
Explain how kubernetes networking works, save this post and share with your community
Pod Networking:
Pod: A group of containers that share the same network namespace. Pod IP: Each pod is assigned a unique IP address within the cluster. CNI (Container Network Interface): A plugin that manages pod networking. It creates network interfaces, assigns IP addresses, and configures routes.
Cluster Networking:
Network Plugin: A component responsible for providing network connectivity between pods and external services. Overlay Networks: These networks create a virtual network on top of the physical network. Routing: The network plugin handles routing between pods in different nodes. Service Discovery: Kubernetes provides a mechanism to discover services within the cluster. This allows pods to communicate with each other without knowing their exact IP addresses.
Service Mesh:
A layer of abstraction: A service mesh provides a layer of abstraction for service-to-service communication. Traffic Management: It handles tasks like load balancing, fault tolerance, and security. Sidecar Proxy: Each pod typically has a sidecar proxy that handles network traffic.
Key Concepts and Components:
CNI (Container Network Interface): A plugin that manages pod networking. Network Plugin: A component responsible for providing network connectivity between pods and external services. Overlay Networks: These networks create a virtual network on top of the physical network. Service Discovery: Kubernetes provides a mechanism to discover services within the cluster. Service Mesh: A layer of abstraction for service-to-service communication. Sidecar Proxy: A proxy that handles network traffic for a pod.
Common Networking Approaches:
Overlay Networks: VXLAN, Calico, Flannel, Weave Net, and Cilium are common overlay network solutions. Host-Port Networking: Pods can be directly exposed to the host network. IP-Per-Pod: Each pod gets its own IP address. Additional Considerations: Network Policies: Kubernetes allows you to define network policies to control traffic flow between pods. Ingress Controllers: Ingress controllers handle incoming traffic to the cluster. Load Balancing: Kubernetes provides built-in load balancing capabilities.
- ๐ธ๐๐ฏ๐ฒ๐ฐ๐๐น ๐ด๐ฒ๐ ๐ฝ๐ผ๐ฑ๐ --๐ฎ๐น๐น-๐ป๐ฎ๐บ๐ฒ๐๐ฝ๐ฎ๐ฐ๐ฒ๐: Check the status of all pods across namespaces to identify failures.
- ๐ธ๐๐ฏ๐ฒ๐ฐ๐๐น ๐ฑ๐ฒ๐๐ฐ๐ฟ๐ถ๐ฏ๐ฒ ๐ฝ๐ผ๐ฑ ๐ฝ๐ผ๐ฑ_๐ป๐ฎ๐บ๐ฒ: Gather detailed information about a failed pod.
- ๐ธ๐๐ฏ๐ฒ๐ฐ๐๐น ๐น๐ผ๐ด๐ ๐ฝ๐ผ๐ฑ_๐ป๐ฎ๐บ๐ฒ -๐ฐ ๐ฐ๐ผ๐ป๐๐ฎ๐ถ๐ป๐ฒ๐ฟ_๐ป๐ฎ๐บ๐ฒ: View logs of a specific container inside a pod to troubleshoot issues.
- ๐ธ๐๐ฏ๐ฒ๐ฐ๐๐น ๐ด๐ฒ๐ ๐ฒ๐๐ฒ๐ป๐๐ --๐ฎ๐น๐น-๐ป๐ฎ๐บ๐ฒ๐๐ฝ๐ฎ๐ฐ๐ฒ๐ --๐๐ผ๐ฟ๐-๐ฏ๐='.๐บ๐ฒ๐๐ฎ๐ฑ๐ฎ๐๐ฎ.๐ฐ๐ฟ๐ฒ๐ฎ๐๐ถ๐ผ๐ป๐ง๐ถ๐บ๐ฒ๐๐๐ฎ๐บ๐ฝ': Review recent events for clues on crashes and errors.
- ๐ธ๐๐ฏ๐ฒ๐ฐ๐๐น ๐ด๐ฒ๐ ๐ป๐ผ๐ฑ๐ฒ๐: Verify the status of nodes in the cluster, checking for node failures.
- ๐ธ๐๐ฏ๐ฒ๐ฐ๐๐น ๐ฑ๐ฟ๐ฎ๐ถ๐ป ๐ป๐ผ๐ฑ๐ฒ_๐ป๐ฎ๐บ๐ฒ --๐ถ๐ด๐ป๐ผ๐ฟ๐ฒ-๐ฑ๐ฎ๐ฒ๐บ๐ผ๐ป๐๐ฒ๐๐: Safely evacuate and cordon a node for recovery operations.
- ๐ธ๐๐ฏ๐ฒ๐ฐ๐๐น ๐ฐ๐ผ๐ฟ๐ฑ๐ผ๐ป ๐ป๐ผ๐ฑ๐ฒ_๐ป๐ฎ๐บ๐ฒ: Mark a node as unschedulable to prevent new pods from being scheduled during recovery.
- ๐ธ๐๐ฏ๐ฒ๐ฐ๐๐น ๐ฑ๐ฒ๐น๐ฒ๐๐ฒ ๐ฝ๐ผ๐ฑ ๐ฝ๐ผ๐ฑ_๐ป๐ฎ๐บ๐ฒ --๐ด๐ฟ๐ฎ๐ฐ๐ฒ-๐ฝ๐ฒ๐ฟ๐ถ๐ผ๐ฑ=0 --๐ณ๐ผ๐ฟ๐ฐ๐ฒ: Forcefully delete a crashed pod to restart it or clear it for recovery.
- ๐ธ๐๐ฏ๐ฒ๐ฐ๐๐น ๐ฟ๐ผ๐น๐น๐ผ๐๐ ๐๐ป๐ฑ๐ผ ๐ฑ๐ฒ๐ฝ๐น๐ผ๐๐บ๐ฒ๐ป๐ ๐ฑ๐ฒ๐ฝ๐น๐ผ๐๐บ๐ฒ๐ป๐_๐ป๐ฎ๐บ๐ฒ: Roll back a deployment in case a new rollout causes crashes.
- ๐ธ๐๐ฏ๐ฒ๐ฐ๐๐น ๐ฒ๐ ๐ฒ๐ฐ -๐ถ๐ ๐ฝ๐ผ๐ฑ_๐ป๐ฎ๐บ๐ฒ -- /๐ฏ๐ถ๐ป/๐๐ต: Access a container to debug and resolve application issues directly inside the pod.
- ๐ธ๐๐ฏ๐ฒ๐ฐ๐๐น ๐ด๐ฒ๐ ๐ฐ๐ผ๐บ๐ฝ๐ผ๐ป๐ฒ๐ป๐๐๐๐ฎ๐๐๐๐ฒ๐: Check the health of core cluster components like etcd, kube-apiserver, and more.
- ๐ธ๐๐ฏ๐ฒ๐ฐ๐๐น ๐๐ผ๐ฝ ๐ป๐ผ๐ฑ๐ฒ๐: Monitor node resource usage to detect resource exhaustion causing crashes.
- ๐ธ๐๐ฏ๐ฒ๐ฐ๐๐น ๐๐ผ๐ฝ ๐ฝ๐ผ๐ฑ๐ --๐ฎ๐น๐น-๐ป๐ฎ๐บ๐ฒ๐๐ฝ๐ฎ๐ฐ๐ฒ๐: Check pod resource usage across namespaces, identifying bottlenecks leading to crashes.
- ๐ธ๐๐ฏ๐ฒ๐ฐ๐๐น ๐ฑ๐ฒ๐น๐ฒ๐๐ฒ ๐ป๐ผ๐ฑ๐ฒ ๐ป๐ผ๐ฑ๐ฒ_๐ป๐ฎ๐บ๐ฒ: Remove a failed node from the cluster to allow recovery operations.
- ๐ฒ๐๐ฐ๐ฑ๐ฐ๐๐น --๐ฒ๐ป๐ฑ๐ฝ๐ผ๐ถ๐ป๐๐=๐ต๐๐๐ฝ๐://๐ฒ๐๐ฐ๐ฑ-๐๐ฒ๐ฟ๐๐ฒ๐ฟ:2379 ๐๐ป๐ฎ๐ฝ๐๐ต๐ผ๐ ๐ฟ๐ฒ๐๐๐ผ๐ฟ๐ฒ ๐ฏ๐ฎ๐ฐ๐ธ๐๐ฝ.๐ฑ๐ฏ: Restore etcd from a snapshot in case of etcd failure.
- ๐ธ๐๐ฏ๐ฒ๐ฐ๐๐น ๐ฎ๐ฝ๐ฝ๐น๐ -๐ณ ๐ฏ๐ฎ๐ฐ๐ธ๐๐ฝ.๐๐ฎ๐บ๐น: Reapply configurations from a backup manifest during recovery.
- ๐ธ๐๐ฏ๐ฒ๐ฐ๐๐น ๐๐ฎ๐ถ๐ป๐ ๐ป๐ผ๐ฑ๐ฒ๐ ๐ป๐ผ๐ฑ๐ฒ_๐ป๐ฎ๐บ๐ฒ ๐ธ๐ฒ๐=๐๐ฎ๐น๐๐ฒ:๐ก๐ผ๐ฆ๐ฐ๐ต๐ฒ๐ฑ๐๐น๐ฒ: Prevent scheduling on a node experiencing issues during recovery.
- ๐ธ๐๐ฏ๐ฒ๐ฐ๐๐น ๐ด๐ฒ๐ ๐ฒ๐ป๐ฑ๐ฝ๐ผ๐ถ๐ป๐๐ ๐๐ฒ๐ฟ๐๐ถ๐ฐ๐ฒ_๐ป๐ฎ๐บ๐ฒ: Verify service endpoints during recovery to ensure services are resolving correctly.
18 - Key Release Management Metrics
- Percent of release success rate This metric simply shows the percentage of planned releases that are deployed on time. There are many factors that can prevent releases from happening on schedule, some of which are difficult to avoid (such as changing priorities). For example, the 2020 State of Code Review survey cites changing requirements as the single biggest reason for missing release deadlines. Another common cause of delays is miscommunication when handing off work between departments, which can be improved by streamlining release processes and strengthening collaboration via culture and tools.
When pinpointing bottlenecks in your release process, itโs helpful to have a quick, all-inclusive view of where you are spending your time. Plutoraโs dashboard below breaks down releases into phases and types, providing a granular view of the average phase and gate duration, bringing to light where constraints are in the release process.
- Percentage of Escaped Defects This measures the defects that make it past production and are found by customers post-release. Naturally itโs impossible (and perhaps even undesirable) to catch all bugs, as that would involve laboriously checking and rechecking features at the cost of velocity. For example, NASA was able to achieve zero defects for its Space Shuttle Software, but at great expense -- thousands of dollars per line of code. Most projects wonโt require that level of scrutiny; however, itโs still an important measure that directly impacts customer satisfaction and maintenance costs. Ultimately, itโs up to the business to find that โsweet spotโ of balancing quality and speed to best serve customers.
Percentage of escaped defects is generally used as one of the key indicators of QA performance, but keep in mind that all bugs are not equal -- some are mission critical, while others are merely cosmetic. Since this metric does not weigh for importance, itโs helpful to discuss with your QA team for context.
- Defect Density The number of known defects per module or X lines of code. Defect density is a useful indicator of quality, and 1 defect per 1000 lines of code is typically seen as the standard for โgoodโ quality.
As always, keep context in mind. If your development team is putting forth substantial effort into optimizing code, defect density could increase even while the organization is improving behind the scenes.
- Number of Releases The number of releases over a given timespan. It gives you insight into whether your release frequency is increasing, inconsistent or decreasing. Similar to the other metrics, context is helpful here. Perhaps in August there were several major releases, as opposed to minor ones in September. Thatโs why this data is best broken down by type or portfolio for additional detail.
Refer to this sample dashboard from Plutora, which gives a quick and simple view of the types releases over time:
-
Deployment Duration Usually measured in hours, this is the time it takes to execute a deployment. Deployment duration is useful for understanding whether deployments are causing delays. Many businesses are turning now to automation, which takes the load off of developers, reduces manual errors, and aids in pushing small changes to production.
-
Release Duration Usually measured in days, this is the time it takes to execute a release. This number measures how quickly a business can serve its customers, and can be useful for roadmap planning. Like other release metrics, this metric can be broken down into type to provide better context.
For instance, have a look at Plutora's release duration dashboard:
release duration over time 7. Number of Release Backouts Also known as a rollback, this counts the number of releases that fail to meet expectations and have to be reversed. If a workaround exists, a Request for Change (RFC) should be created to deploy the fix into production. Tracking the number of release backouts indicates the quality and thoroughness of development and testing.
-
Proportion of Automatic Release Distribution Usually a percentage, this metric refers to the proportion of new releases that are distributed automatically. A high number here indicates that your ITSM environment is operating effectively.
-
Downtime Usually measured in minutes, downtime refers to the cumulative amount of time that users cannot access your software. Downtime is not always avoidable, but should be minimized as much as possible.
-
Number of Outages Caused by a Release The number of outages, or interruptions to your service, that are directly caused by a release. Depending on the length of the outages, these can be very serious -- virtually incapacitating businesses, causing financial loss, and eroding customer trust. This metric helps business stakeholders to get a clear picture of the costs associated with any given release.
-
Number of Incidents Caused by a Release The amount of incidents caused by a release. Incidents are also referred to as defects or bugs, and can be characterized as unexpected deviations in system behavior. This number is often used as a key metric in evaluating release team performance.
Take a look at Plutoraโs breakdown of releases by risk level to understand the risk profile of your releases. This can then be correlated with incidents in production.
releases by risk level 12. Percentage of All Changes Causing Major Incidents Similar to the number of incidents caused by a release, this measures whether the change management team is doing their job effectively.
In the dashboard below, you can see the breakdown of different types of changes over time. Simply overlay incident data over this chart to get a quick sense of where and when incidents are occurring.
changes over time dashboard 13. On-time Delivery This measures the timeliness of delivery. When deadlines are consistently missed, it doesnโt necessarily mean that your team is doing a bad job -- perhaps the problem is a dearth of resources, or the scope of the project needs to be cut.
- Releases Delivered on Schedule by Application Provides a breakdown of timely and successful releases by application. If an application is consistently behind schedule, that impacts the business when it comes to market share, security, and compliance concerns.
For example, this dashboard shows the various releases by portfolio:
releases by portfolio 15. Mean Time to Repair (MTTR) MTTR represents the average time required to repair a failed component or device. It calculates the time from the very beginning of an incident until the moment itโs solved, and provides a strong indication of how fast your organization responds to problems.
To get this number, find out the total time spent on unplanned maintenance for a specific asset and divide that number by the number of failures that happened with that asset over a period of time. In general, a MTTR value of five hours or less is considered to be quite good.
- Average Lead Time Also referred to as Time to Value, this measures the period of time between accepting a work item and pushing it to production. It tells you how long it takes for you to deliver on customers requests. When examining lead time, itโs important to know exactly what youโre measuring. You probably donโt want to include low priority tickets in this metric because it will lead to unreasonably high numbers; keep it limited to high and medium priority tickets.
Average lead time is an essential metric for ensuring business continuity and planning, as it enables the PMs and C-suite to plan accordingly.
lead time dashboard 17. Average Cycle Time Cycle time is how long it takes your team to deliver something after starting work on it. It provides a lot of insight to engineering leaders as to how their team is allocating its time. For instance, if features are sitting for days waiting for QA after developers are done coding, thatโs a useful observation that can generate real change. A good dashboard here is invaluable, as it can help you to drill down deeper and see the exact distribution of how time is spent.
cycle time dashboard 18. Average Cost Per Release Calculate man-hours and salary per release to arrive at this number. This will give you an idea of the ROI of each release, resulting in better business decisions.
Question 7: Problem: Given an array A of N integers, return the maximum product of any two distinct integers in the array.
Input: A = [1, 10, 2, 6, 5, 3] Output: 60 (10 * 6)
Question 8: Problem: Write a function that takes two sorted arrays A and B and returns a new array that contains the elements of both A and B in sorted order.
Input: A = [1, 3, 5], B = [2, 4, 6] Output: [1, 2, 3, 4, 5, 6] Constraints: Try to Solve this without inbuilt functions
Question 9: Problem: Write a function that checks if a string is a palindrome. A palindrome is a string that reads the same backward as forward.
Input: S = "racecar" Output: True Constraints: Ignore case sensitivity and spaces for this problem.
1๏ธโฃ Ping Test:
Start with the ping command to check connectivity to a website or IP address. It helps determine whether your device can communicate with the server.
hashtag#Command ping google.com
- Success: You're connected to the network.
- Failure: There may be a network issue or the destination is unreachable.
2๏ธโฃ DNS Lookup:
If you can ping an IP but not a domain, there may be a DNS issue. Use nslookup or dig to test if DNS is resolving correctly.
hashtag#Command nslookup example.com
This will show the IP address your DNS server resolves for the domain.
3๏ธโฃ Check Network Configuration:
Check your network configuration using ifconfig (Linux/macOS) or ipconfig (Windows) to make sure your device has a proper IP address, gateway, and DNS settings.
hashtag#Command ipconfig
4๏ธโฃ Telnet to Check Open Ports:
If you're trying to access a service but it's not responding, use telnet to check if a specific port is open and accessible.
hashtag#Command telnet example.com 80
If the connection fails, the service might be down or blocked by a firewall.
5๏ธโฃNetstat for Active Connections: Run netstat to view active network connections and see which ports are being used by applications. This is useful for checking open ports and connections on your device.
hashtag#Command netstat -an
It will display all active network connections, which can help identify potential issues.
6๏ธโฃ Traceroute: Use traceroute to map the path data packets take to reach a destination, which helps diagnose where delays or blockages occur in the network.
hashtag#Command traceroute google.com
tcpdump
nc
1.What are the types of pipeline in Jenkins you can write. 2.How do you resolve if the build starts to fail. 3.Master slave Architecture of Jenkins. 4.Architecture of Kubernetes 5.How do you define any yaml file. 6.What are the kinds services in k8s. 7.How do you make your cluster secure. 8.Difference between Stateless and stateful. 9.Tell me about Network Policy in K8s. 10.How do you create EC2 instances using terraform 11.How do you write terraform script, explain with example. 12.Terraform as Cloud Infrastructure. 13.How do you take back up in Jenkins.
โข Understanding HTTP is more important than Nginx vs Apache
- Know how web protocols work
- Master request/response cycles
- Learn load balancing concepts
โข Understanding Linux is more important than Ubuntu vs RHEL
- Know how processes work
- Master file systems
- Learn system troubleshooting
โข Understanding networks is more important than Cisco vs Juniper
- Know how routing works
- Master TCP/IP basics
- Learn network debugging
โข Understanding security is more important than tools vs scanners
- Know how attacks happen
- Master access controls
- Learn threat modeling
โข Understanding storage is more important than EBS vs EFS
- Know how disks work
- Master data persistence
- Learn backup strategies
โข Understanding monitoring is more important than Grafana vs Datadog
- Know what metrics matter
- Master alert design
- Learn troubleshooting
General Questions: 1- Introduce yourself. 2- Describe a challenge you faced as a devops engineer and how you overcame it. 3- Would you like to work individually or in a team. 4 - Tell me about something where you got a chance to learn and implement from scratch. 5- How would you be handling a situation where you are not getting any help from your team members.
AWS Questions: 1- Launch template vs launch configuration. 2- How do you communicate with AWS services privately without exposing to internet. 3- NAT g/w vs NAT instance v/s Egress only IGW v/s IGW 4- Design a 3 tier infrastructure using the AWS services which should be secure and highly available. 5- Statefull v/s stateless firewalls. 6- Differentiate between NLB and ALB. 7- You want to redirect traffic from x.company.in to company.in/x, how to achieve it. 8- Your database initially performs well, but after certain month you face slowness. How to troubleshoot and fix the same.
K8s: 1- K8s architecture 2- Deployment v/s stateful set v/s replica set 3- What is config map 4- Role of etcd in kubernetes. 5- How rolling updates work in a deployment?
Docker: 1- ADD v/s COPY 2- Entrypoint v/s CMD 3- How to remove all unwanted or unused docker objects from system? 4- Multistage docker build. 5- Is docker file immutable
Terraform: 1- What does terraform init do ? 2- How to auto approve the terraform changes? 3- count v/s foreach 4- How to import an existing resource to terraform? 5- Data black in terraform. 6- What are provisioners in terraform? 7- Remote backend.
Linux: 1- How to check the list of installed packages? Ans ) apt --installed 2- Command to check kernel version. Ans) uname -r . More detailed information uname -a 3- How to create a new user and add it as sudo? 4- Command to check if a process called "a" is running or not. If running how stop it. 5- Command to list all files less than 5mb. 6- Hard link v/s soft link. 7- Commands to check disk space usage and free RAM.
CI/CD: 1- Git fetch v/s Git pull 2- Sonarqube quality gate vs quality profile. 3- What is sonar runner. 4- Types of pipeline in Jenkins. 5- Scripted vs Declarative pipeline. 6- Should we prefer artifactory to store artifacts or should we store them in s3. 7- How to upgrade plugins in Jenkins. 8- Usermanagement in Jenkins. 9- Concepts about Gitlab runners. 10- How to upgrade Jenkins.
Ansible: 1- loops in Ansible. 2- Ansible Roles. 3- Is Ansible idempotent? 4- Ansible script to install nginx. 5- Conditionals in Ansible.