CloudEngineering (AWS)/EKS 2021. 12. 3. 11:32

AWS EKS를 사용 중에 있으니, 리소스가 부족하면 자동확장이 되는 것이 맞죠?라는 질문을 받았다.

아니다. EKS 클러스터는 클러스터 내부의 상황에 자동으로 대처해주지 않는다. 이전 글에서 노드 그룹에는 pod 수에 제한이 있다고 했다. pod 수가 초과되어 스케줄링되지 못한 상황과 클러스터에 CPU / Memory 자원이 부족하여 pending상태일 경우 이럴 때는 여분의 노드가 필요할 것이다.

Cluster Autoscaler - Amazon EKS

Cluster Autoscaler Kubernetes Cluster Autoscaler는 포드가 실패하거나 다른 노드로 다시 예약될 때 클러스터의 노드 수를 자동으로 조정합니다. Cluster Autoscaler는 일반적으로 클러스터에 배포로 설치됩니다.

docs.aws.amazon.com

이와같은 상황을 대처하는 것이 바로 ClusterAutoscaler라는 개념이다. 이는 여러 CSP를 제공하며 AWS 공식 문서는 ClusterAutoscaler를 pod가 실패하거나 다른 노드로 다시 예약될 때 클러스터의 노드 수를 자동으로 조정한다고 소개하고 있다.

사전 작업

IAM 정책

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Action": [
                "autoscaling:DescribeAutoScalingGroups",
                "autoscaling:DescribeAutoScalingInstances",
                "autoscaling:DescribeLaunchConfigurations",
                "autoscaling:DescribeTags",
                "autoscaling:SetDesiredCapacity",
                "autoscaling:TerminateInstanceInAutoScalingGroup",
                "ec2:DescribeLaunchTemplateVersions"
            ],
            "Resource": "*",
            "Effect": "Allow"
        }
    ]
}

노드 그룹이 NodeGroup의 autoscaling을 할 수 있는 권한을 부여한다. 아래 정책을 생성해 노드 그룹에 적용된 노드 IAM 역할에 추가한다.

노드 그룹 최소/최대 크기

ClusterAutoscaler는 노드 그룹에 적용된 최대 크기만큼 scale out이 되며, 리소스 사용량이 적은 노드를 최소 크기까지 scale in 하여 비용적인 부담을 덜 수 있다.

Cluster Autoscaler 배포

wget https://raw.githubusercontent.com/kubernetes/autoscaler/master/cluster-autoscaler/cloudprovider/aws/examples/cluster-autoscaler-autodiscover.yaml

cluster-autoscaler-autodiscover.yaml을 다운로드하고 <YOUR CLUSTER NAME> 부분만 클러스터 명으로 변경하면 된다. Cluster Autoscaler는 오토스케일링 k8s.io/cluster-autoscaler/enabled, k8s.io/cluster-autoscaler/<cluster name> Tag를 기반으로 'Desired capacity'를 조정한다. 오토 스케일링에 해당 Tag가 적용되어 있지 않다면 적용해야한다.

spec:
      containers:
      - command:
        - ./cluster-autoscaler
        - --v=4
        - --stderrthreshold=info
        - --cloud-provider=aws
        - --skip-nodes-with-local-storage=false
        - --expander=least-waste
        - --node-group-auto-discovery=asg:tag=k8s.io/cluster-autoscaler/enabled,k8s.io/cluster-autoscaler/<YOUR CLUSTER NAME>
        - --balance-similar-node-groups
        - --skip-nodes-with-system-pods=false

$ kubectl apply -f cluster-autoscaler-autodiscover.yaml
serviceaccount/cluster-autoscaler created
clusterrole.rbac.authorization.k8s.io/cluster-autoscaler created
role.rbac.authorization.k8s.io/cluster-autoscaler created
clusterrolebinding.rbac.authorization.k8s.io/cluster-autoscaler created
rolebinding.rbac.authorization.k8s.io/cluster-autoscaler created
deployment.apps/cluster-autoscaler created

Cluster Autoscaler 작동 원리

cluster의 부하를 증가시키는 파드이다. 아래 문서를 참고한 뒤 일시적인 Pod를 배포하여 테스트하면 좋을 것이다.

EKSworkshop.com

In this workshop, we will explore multiple ways to configure VPC, ALB, and EC2 Kubernetes workers, and Amazon Elastic Kubernetes Service.

www.eksworkshop.com

CPU를 과도하게 점유하는 Pod 생성

CPU를 과도하게 점유시켜 클러스터 CPU를 96%로 유지시켰다. 이 상황에서 Scale Out이 되어야 하지 않을까 라는 생각이 들겠지만, Cluster Autoscaler는 노드를 증설하지 않는다.

POD 추가 생성

pod를 추가 생성해보자. 리소스가 부족한 클러스터는 pod를 스케줄링하지 못한 채 Pending 상태로 유지하고 만다. 위에서도 언급했지만 ClusterAutoscaler는 스케줄링되지 못한 pod를 모니터링한다.

ClusterAutoscaler pod 로그를 확인해 보자. 1→2 (max: 4)라고 명시되어있다. 내가 설정한 노드 그룹의 최대 크기를 읽고 최대 크기까지 scale out이 된다는 뜻이다.

노드가 1→2 개로 scale out 되고 Pending 상태였던 pod가 Running 상태로 바뀌었다. pod가 Pending에서 Running으로 상태는 바뀌었지만, 해당 pod의 특성 때문에 또다시 CPU가 높은 상태를 유지한다. 하지만 이때도 scale out이 되지는 않는다.

노드 그룹 축소

임시로 생성한 pod들은 할 일을 모두 끝내면 Complete 상태로 바뀌고 CPU를 더 이상 사용하지 않는다. 이때 Cluster Autoscaler는 노드의 부하를 모니터링하다가 종료할 노드를 선택한 뒤 pod를 재 스케줄링한 뒤 Scale In 한다.

결론

CA는 리소스 자원이 부족할 때 scale out 되며, 리소스 점유율이 적은 노드를 설정한 min 값에 맞추도록 모니터링하고 노드 개수를 변경한다. 자원이 부족할 때라고 하면 노드의 CPU / Memory가 95% 정도를 말하는 건가?라고 생각할 수 있다. CA는 자원 부족으로 pending 상태인 Pod를 모니터링한다. pod가 스케줄링되지 못할 때 scale out이 되며, 리소스 점유율이 적은 노드를 모니터링하다 10분이 지나면 scale in 하게 되는 것을 알고 있으면 좋을 것 같다.

'CloudEngineering (AWS) > EKS' 카테고리의 다른 글

Amazon ECR 와 PrivateLink (0)	2022.02.12
EKS 환경에서 kubeconfig 적용하기 (0)	2021.11.05
Custom AMI 로 EKS NodeGroup 배포 (0)	2021.09.04
EKS 환경에서 RBAC 적용하기 (1)	2021.07.21
EKS Pod 수 제한 (2)	2021.07.16

ABOUT ME

SRE TECHNOTE SRE TECHNOTE

사전 작업

Cluster Autoscaler 배포

Cluster Autoscaler 작동 원리

결론

'CloudEngineering (AWS) > EKS' 카테고리의 다른 글

티스토리툴바

ABOUT ME

사전 작업

Cluster Autoscaler 배포

Cluster Autoscaler 작동 원리

결론

'CloudEngineering (AWS) > EKS' 카테고리의 다른 글

관련글 관련글 더보기

티스토리툴바