← 返回文章列表

Kubernetes 调度原理完全指南

深入理解 Kubernetes 调度原理:Taint/Toleration、Node Affinity、Pod Affinity

10 分钟阅读
字号

Kubernetes 调度原理完全指南

本文基于 Kubernetes 1.28


概述

Kubernetes 调度器(Scheduler)负责决定每个 Pod 应该运行在哪个节点上。理解调度原理是掌握 K8s 的关键,本文详细介绍调度器的工作流程以及 Taint/Toleration、Node Affinity、Pod Affinity 等核心概念。


一、调度器工作流程

调度流程图

用户提交 Pod

API Server 存储到 etcd

Scheduler 监听发现新 Pod

① 预选(Predicates)- 过滤不合适的节点

② 优选(Priorities)- 给剩余节点打分

③ 绑定(Bind)- 绑定 Pod 到节点

Kubelet 启动容器

预选阶段(Predicates)

过滤掉不满足条件的节点:

  • 资源不足(CPU/内存/GPU)
  • 端口冲突
  • 节点标签不匹配
  • 污点不兼容

优选阶段(Priorities)

对通过预选的节点打分,选择得分最高的节点:

  • 资源利用率(优先用空闲节点)
  • 亲和性得分
  • 数据局部性(尽量用有数据的节点)

二、Taint(污点)— 节点驱赶 Pod

概念

Taint(污点)用于标记节点的"特殊性",表示节点不愿意接受某些 Pod。

效果行为
NoSchedule不允许新建 Pod 调度到该节点(已在的不受影响)
PreferNoSchedule尽量避免新建 Pod 调度到该节点
NoExecute不允许新建 Pod,且驱逐已在的 Pod

命令行操作

# 打污点
kubectl taint nodes <节点> <>=<>:<>
 
# 示例:标记 GPU 专用节点
kubectl taint nodes gpu-node1 gpu=true:NoSchedule
 
# 示例:标记测试节点
kubectl taint nodes test-node1 env=test:PreferNoSchedule
 
# 示例:标记故障节点(驱逐所有 Pod)
kubectl taint nodes bad-node1 kubernetes.io/unreachable:NoExecute
 
# 移除污点
kubectl taint nodes node1 gpu-

常见场景

场景 1:Master 节点

kubectl describe node master | grep Taints

输出

Taints: node-role.kubernetes.io/control-plane:NoSchedule
        node-role.kubernetes.io/master:NoSchedule

Master 节点默认不运行普通应用,只运行系统组件。

场景 2:GPU 专用节点

# 给 GPU 节点打污点
kubectl taint nodes gpu-1 gpu=true:NoSchedule
 
# 查看污点
kubectl describe node gpu-1 | grep -A3 Taint

场景 3:节点维护

# 标记节点维护,30分钟后驱逐 Pod
kubectl taint nodes node1 maintenance=true:NoExecute-30m
 
# 移除
kubectl taint nodes node1 maintenance=true-

三、Toleration(容忍)— Pod 的妥协

概念

Toleration(容忍)表示 Pod 可以接受节点的污点。只有容忍了污点的 Pod 才能调度到有污点的节点。

YAML 字段详解

spec:
  tolerations:
  - key: "gpu"                    # 要容忍的污点键
    operator: "Equal"            # 匹配方式:Equal 或 Exists
    value: "true"                # 污点值(Equal 时需要)
    effect: "NoSchedule"         # 容忍哪种效果
    tolerationSeconds: 300       # 容忍多久(NoExecute 用)

operator 说明

operator含义示例
Equal值必须相等key=gpu, value=true
Exists键存在即可key=gpu, 不关心值

示例

示例 1:容忍 GPU 污点

apiVersion: v1
kind: Pod
metadata:
  name: ai-app
spec:
  containers:
  - name: app
    image: pytorch:latest
  tolerations:
  - key: "gpu"
    operator: "Equal"
    value: "true"
    effect: "NoSchedule"

示例 2:容忍 Master 节点

tolerations:
- key: "node-role.kubernetes.io/control-plane"
  operator: "Exists"
  effect: "NoSchedule"
- key: "node-role.kubernetes.io/master"
  operator: "Exists"
  effect: "NoSchedule"

示例 3:容忍节点失联(短时网络抖动)

tolerations:
- key: kubernetes.io/unreachable
  operator: Exists
  effect: NoExecute
  tolerationSeconds: 300  # 5 分钟后才驱逐

示例 4:容忍所有污点(谨慎使用)

tolerations:
- operator: "Exists"
  effect: "NoSchedule"

四、Node Affinity(节点亲和性)

概念

Node Affinity 表示 Pod 对节点的选择偏好,分为硬性要求和软性偏好。

类型

类型含义
requiredDuringSchedulingIgnoredDuringExecution必须满足(硬性)
preferredDuringSchedulingIgnoredDuringExecution最好满足(软性)

matchExpressions 操作符

operator含义
In值在列表中
NotIn值不在列表中
Exists键存在
DoesNotExist键不存在
Gt大于(数值)
Lt小于(数值)

示例

示例 1:硬性要求(必须满足)

spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: zone
            operator: In
            values:
            - beijing
            - shanghai
          - key: env
            operator: In
            values:
            - production

示例 2:软性偏好(最好满足)

spec:
  affinity:
    nodeAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 100
        preference:
          matchExpressions:
          - key: ssd
            operator: In
            values:
            - "true"
      - weight: 50
        preference:
          matchExpressions:
          - key: high-memory
            operator: In
            values:
            - "true"

示例 3:完整配置

apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-app
spec:
  replicas: 3
  selector:
    matchLabels:
      app: my-app
  template:
    metadata:
      labels:
        app: my-app
    spec:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: zone
                operator: In
                values:
                - production
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 100
            preference:
              matchExpressions:
              - key: ssd
                operator: In
                values:
                - "true"
      containers:
      - name: app
        image: nginx:latest

五、NodeSelector(简单选择器)

概念

NodeSelector 是最简单的节点选择方式,适用于简单的标签匹配需求。

用法

# 先给节点打标签
kubectl label nodes node1 node-type=compute
 
# Pod 使用
spec:
  nodeSelector:
    node-type: compute

对比

特性NodeSelectorNode Affinity
复杂度简单复杂
功能单条件多条件 + 权重
推荐场景简单需求复杂需求

六、Pod Affinity / Anti-Affinity

概念

Pod Affinity 表示 Pod 之间的关系:

  • Pod Affinity:希望和某些 Pod 在一起
  • Pod Anti-Affinity:希望远离某些 Pod

topologyKey 说明

topologyKey含义示例
kubernetes.io/hostname同一个节点node1
topology.kubernetes.io/zone同一个可用区北京 Zone A
topology.kubernetes.io/region同一个区域北京

示例

示例 1:Pod 分散部署(高可用)

# 3 个 Redis Pod 分散在不同节点
spec:
  affinity:
    podAntiAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
      - labelSelector:
          matchLabels:
            app: redis
        topologyKey: kubernetes.io/hostname

示例 2:Web 和 Cache 放一起

spec:
  affinity:
    podAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
      - labelSelector:
          matchLabels:
            app: cache
        topologyKey: kubernetes.io/hostname

示例 3:跨可用区高可用

# Redis Pod 分散在不同可用区
spec:
  affinity:
    podAntiAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 100
        preference:
          matchExpressions:
          - key: app
            operator: In
            values:
            - redis
      topologyKey: topology.kubernetes.io/zone

七、综合实战

场景

需求:
1. 必须调度到有 GPU 的节点
2. 最好调度到 SSD 节点
3. 不要和 Web 应用在同一节点
4. 能容忍节点维护(30分钟内不驱逐)

节点准备

# GPU 节点(有 SSD)
kubectl label nodes gpu-1 gpu=true
kubectl label nodes gpu-1 disktype=ssd
kubectl taint nodes gpu-1 gpu=true:NoSchedule
 
# 普通节点(运行 Web)
kubectl label nodes node-1 app=web
kubectl label nodes node-2 app=web

Pod 配置

apiVersion: apps/v1
kind: Deployment
metadata:
  name: ai-training
spec:
  replicas: 2
  selector:
    matchLabels:
      app: ai-training
  template:
    metadata:
      labels:
        app: ai-training
    spec:
      # 1. 容忍 GPU 污点
      tolerations:
      - key: "gpu"
        operator: "Equal"
        value: "true"
        effect: "NoSchedule"
      
      # 2. 节点亲和性
      affinity:
        # 硬性:必须有 GPU
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: "gpu"
                operator: In
                values:
                - "true"
        
        # 软性:偏好 SSD
        preferredDuringSchedulingIgnoredDuringExecution:
        - weight: 100
          preference:
            matchExpressions:
            - key: "disktype"
              operator: In
              values:
              - "ssd"
        
        # 反亲和性:不要和 Web 在一起
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchLabels:
                app: web
            topologyKey: kubernetes.io/hostname
      
      containers:
      - name: trainer
        image: pytorch:2.0
        resources:
          limits:
            nvidia.com/gpu: 1

八、调试命令

查看节点信息

# 查看节点列表
kubectl get nodes
 
# 查看节点详情(包含标签和污点)
kubectl describe node master
 
# 查看节点标签
kubectl get nodes --show-labels

查看 Pod 调度情况

# 查看 Pod 在哪个节点
kubectl get pods -o wide
 
# 查看 Pod 调度详情
kubectl describe pod my-pod | grep -A 20 "Events:"
 
# 查看调度失败原因
kubectl describe pod my-pod | grep -A 5 "Warning"

常见错误

# 1. 资源不足
# Warning  FailedScheduling  ...  Insufficient cpu
 
# 2. 节点标签不匹配
# Warning  FailedScheduling  ...  node(s) didn't match node selector
 
# 3. 污点不匹配
# Warning  FailedScheduling  ...  node(s) had taints that the pod didn't tolerate

九、总结

调度相关概念对比

概念作用方向
Taint节点"驱赶"Pod节点 → Pod
TolerationPod "接受"污点Pod ← 节点
Node AffinityPod "选择"节点Pod → 节点
Pod AffinityPod "靠近"其他 PodPod → Pod
Pod Anti-AffinityPod "远离"其他 PodPod ↔ Pod

使用场景

污点/容忍 → 专用节点(GPU/数据库)
节点亲和性 → 选择特定区域/配置
Pod 亲和性 → 服务间通信
Pod 反亲和性 → 高可用分散

相关文档

分享

// RELATED_POSTS

0%