Containers

Scaling beyond IPv4: integrating IPv6 Amazon EKS clusters into existing Istio Service Mesh

Organizations are increasingly adopting IPv6 for their Amazon Elastic Kubernetes Service (Amazon EKS) deployments, driven by three key factors: depletion of private IPv4 addresses, the need to streamline or eliminate overlay networks, and improved network security requirements on Amazon Web Services (AWS). In IPv6-enabled EKS clusters, each pod receives a unique IPv6 address from the […]

Centralized Amazon ECS task logging with Amazon OpenSearch

As enterprises continue to adopt containerized workloads, the need for robust and scalable logging solutions has become increasingly important. Logging is a crucial element in monitoring and troubleshooting distributed applications, especially in modern containerized environments such as those deployed on Amazon Elastic Container Service (Amazon ECS). As microservices architectures grow in complexity, managing logs across multiple […]

Deep dive into cluster networking for Amazon EKS Hybrid Nodes

In this post, we dive deep into cluster networking configurations for Amazon EKS Hybrid Nodes, exploring different Container Network Interface (CNI) options and load balancing solutions to meet various networking requirements. The post demonstrates how to implement BGP routing with Cilium CNI, static routing with Calico CNI, and set up both on-premises load balancing using MetalLB and external load balancing using AWS Load Balancer Controller.

UTH - Amazon EKS ultra scale clusters featured image

Under the hood: Amazon EKS ultra scale clusters

This post was co-authored by Shyam Jeedigunta, Principal Engineer, Amazon EKS; Apoorva Kulkarni, Sr. Specialist Solutions Architect, Containers and Raghav Tripathi, Sr. Software Dev Manager, Amazon EKS. Today, Amazon Elastic Kubernetes Service (Amazon EKS) announced support for clusters with up to 100,000 nodes. With Amazon EC2’s new generation accelerated computing instance types, this translates to […]

Featured image: Amazon EKS 100K nodes per cluster

Amazon EKS enables ultra scale AI/ML workloads with support for 100K nodes per cluster

We’re excited to announce that Amazon Elastic Kubernetes Service (Amazon EKS) now supports up to 100,000 worker nodes in a single cluster, enabling customers to scale up to 1.6 million AWS Trainium accelerators or 800K NVIDIA GPUs to train and run the largest AI/ML models. This capability empowers customers to pursue their most ambitious AI […]

Improving Amazon ECS deployment consistency with SOCI Index Manifest v2

Seekable OCI (SOCI) helps Amazon Elastic Container Service (Amazon ECS) customers reduce task launch times by starting containers before their images are fully downloaded. To ensure reliable deployments, Amazon ECS software version consistency ensures that the same container image is used throughout an ECS deployment. However, when running ECS tasks with SOCI, there was still […]

Fully Sharded Data Parallel with Ray on Amazon ECS

In this post, we demonstrate how to implement Fully Sharded Data Parallel (FSDP) fine-tuning of the dolly-v2-7b model using Amazon ECS. The solution uses a Ray cluster running on ECS with two services (head and worker) connected to Amazon S3, enabling efficient distributed training across multiple GPUs while abstracting away container orchestration complexities.

Featured image for Pod Identity Blog

Amazon EKS Pod Identity streamlines cross account access

This post was co-authored by Ashok Srirama, Principal Container Specialist SA and George John, Senior Product Manager EKS.  Introduction Today, we’re excited to announce a significant enhancement to Amazon EKS Pod Identity –streamlined cross-account access for Kubernetes applications. This new feature simplifies the process of granting pods permission to access AWS resources in other accounts. […]

Maximizing GPU Utilization using NVIDIA Run:ai in Amazon EKS

This post was co-authored with Chad Chapman of NVIDIA. Introduction In the fast-paced world of artificial intelligence and machine learning, GPU resources are both critical and in high demand. In this blog, we will cover key challenges related to GPU utilization in Artificial Intelligence and Machine Learning applications, and how NVIDIA Run:ai fractional GPU technology […]

Deep Dive: Amazon EKS Dashboard for Visibility into Multi-Cluster Operations and Governance

This blog post was jointly authored by Carlos Santana, Sr. Solution Architect, Containers; Sriram Ranganathan, Sr. Product Manager, Kubernetes; Sabari Sawant, Product Marketing Manager, Kubernetes; and Frank Carta, Sr. GTM specialist, Containers. As organizations grow their Kubernetes infrastructure across AWS Regions and accounts, they face increasing challenges in maintaining oversight of their Kubernetes clusters. Without […]