Kubernetes 调度原理完全指南

本文基于 Kubernetes 1.28

概述

Kubernetes 调度器（Scheduler）负责决定每个 Pod 应该运行在哪个节点上。理解调度原理是掌握 K8s 的关键，本文详细介绍调度器的工作流程以及 Taint/Toleration、Node Affinity、Pod Affinity 等核心概念。

一、调度器工作流程

调度流程图

用户提交 Pod
    ↓
API Server 存储到 etcd
    ↓
Scheduler 监听发现新 Pod
    ↓
① 预选（Predicates）- 过滤不合适的节点
    ↓
② 优选（Priorities）- 给剩余节点打分
    ↓
③ 绑定（Bind）- 绑定 Pod 到节点
    ↓
Kubelet 启动容器

预选阶段（Predicates）

过滤掉不满足条件的节点：

资源不足（CPU/内存/GPU）
端口冲突
节点标签不匹配
污点不兼容

优选阶段（Priorities）

对通过预选的节点打分，选择得分最高的节点：

资源利用率（优先用空闲节点）
亲和性得分
数据局部性（尽量用有数据的节点）

二、Taint（污点）— 节点驱赶 Pod

概念

Taint（污点）用于标记节点的"特殊性"，表示节点不愿意接受某些 Pod。

效果	行为
NoSchedule	不允许新建 Pod 调度到该节点（已在的不受影响）
PreferNoSchedule	尽量避免新建 Pod 调度到该节点
NoExecute	不允许新建 Pod，且驱逐已在的 Pod

命令行操作

# 打污点
kubectl taint nodes <节点名> <键>=<值>:<效果>
 
# 示例：标记 GPU 专用节点
kubectl taint nodes gpu-node1 gpu=true:NoSchedule
 
# 示例：标记测试节点
kubectl taint nodes test-node1 env=test:PreferNoSchedule
 
# 示例：标记故障节点（驱逐所有 Pod）
kubectl taint nodes bad-node1 kubernetes.io/unreachable:NoExecute
 
# 移除污点
kubectl taint nodes node1 gpu-

常见场景

场景 1：Master 节点

kubectl describe node master | grep Taints

输出：

Taints: node-role.kubernetes.io/control-plane:NoSchedule
        node-role.kubernetes.io/master:NoSchedule

Master 节点默认不运行普通应用，只运行系统组件。

场景 2：GPU 专用节点

# 给 GPU 节点打污点
kubectl taint nodes gpu-1 gpu=true:NoSchedule
 
# 查看污点
kubectl describe node gpu-1 | grep -A3 Taint

场景 3：节点维护

# 标记节点维护，30分钟后驱逐 Pod
kubectl taint nodes node1 maintenance=true:NoExecute-30m
 
# 移除
kubectl taint nodes node1 maintenance=true-

三、Toleration（容忍）— Pod 的妥协

概念

Toleration（容忍）表示 Pod 可以接受节点的污点。只有容忍了污点的 Pod 才能调度到有污点的节点。

YAML 字段详解

spec:
  tolerations:
  - key: "gpu"                    # 要容忍的污点键
    operator: "Equal"            # 匹配方式：Equal 或 Exists
    value: "true"                # 污点值（Equal 时需要）
    effect: "NoSchedule"         # 容忍哪种效果
    tolerationSeconds: 300       # 容忍多久（NoExecute 用）

operator 说明

operator	含义	示例
Equal	值必须相等	key=gpu, value=true
Exists	键存在即可	key=gpu, 不关心值

示例

示例 1：容忍 GPU 污点

apiVersion: v1
kind: Pod
metadata:
  name: ai-app
spec:
  containers:
  - name: app
    image: pytorch:latest
  tolerations:
  - key: "gpu"
    operator: "Equal"
    value: "true"
    effect: "NoSchedule"

示例 2：容忍 Master 节点

tolerations:
- key: "node-role.kubernetes.io/control-plane"
  operator: "Exists"
  effect: "NoSchedule"
- key: "node-role.kubernetes.io/master"
  operator: "Exists"
  effect: "NoSchedule"

示例 3：容忍节点失联（短时网络抖动）

tolerations:
- key: kubernetes.io/unreachable
  operator: Exists
  effect: NoExecute
  tolerationSeconds: 300  # 5 分钟后才驱逐

示例 4：容忍所有污点（谨慎使用）

tolerations:
- operator: "Exists"
  effect: "NoSchedule"

四、Node Affinity（节点亲和性）

概念

Node Affinity 表示 Pod 对节点的选择偏好，分为硬性要求和软性偏好。

类型

类型	含义
requiredDuringSchedulingIgnoredDuringExecution	必须满足（硬性）
preferredDuringSchedulingIgnoredDuringExecution	最好满足（软性）

matchExpressions 操作符

operator	含义
In	值在列表中
NotIn	值不在列表中
Exists	键存在
DoesNotExist	键不存在
Gt	大于（数值）
Lt	小于（数值）

示例

示例 1：硬性要求（必须满足）

spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: zone
            operator: In
            values:
            - beijing
            - shanghai
          - key: env
            operator: In
            values:
            - production

示例 2：软性偏好（最好满足）

spec:
  affinity:
    nodeAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 100
        preference:
          matchExpressions:
          - key: ssd
            operator: In
            values:
            - "true"
      - weight: 50
        preference:
          matchExpressions:
          - key: high-memory
            operator: In
            values:
            - "true"

示例 3：完整配置

apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-app
spec:
  replicas: 3
  selector:
    matchLabels:
      app: my-app
  template:
    metadata:
      labels:
        app: my-app
    spec:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: zone
                operator: In
                values:
                - production
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 100
            preference:
              matchExpressions:
              - key: ssd
                operator: In
                values:
                - "true"
      containers:
      - name: app
        image: nginx:latest

五、NodeSelector（简单选择器）

概念

NodeSelector 是最简单的节点选择方式，适用于简单的标签匹配需求。

用法

# 先给节点打标签
kubectl label nodes node1 node-type=compute
 
# Pod 使用
spec:
  nodeSelector:
    node-type: compute

对比

特性	NodeSelector	Node Affinity
复杂度	简单	复杂
功能	单条件	多条件 + 权重
推荐场景	简单需求	复杂需求

六、Pod Affinity / Anti-Affinity

概念

Pod Affinity 表示 Pod 之间的关系：

Pod Affinity：希望和某些 Pod 在一起
Pod Anti-Affinity：希望远离某些 Pod

topologyKey 说明

topologyKey	含义	示例
kubernetes.io/hostname	同一个节点	node1
topology.kubernetes.io/zone	同一个可用区	北京 Zone A
topology.kubernetes.io/region	同一个区域	北京

示例

示例 1：Pod 分散部署（高可用）

# 3 个 Redis Pod 分散在不同节点
spec:
  affinity:
    podAntiAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
      - labelSelector:
          matchLabels:
            app: redis
        topologyKey: kubernetes.io/hostname

示例 2：Web 和 Cache 放一起

spec:
  affinity:
    podAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
      - labelSelector:
          matchLabels:
            app: cache
        topologyKey: kubernetes.io/hostname

示例 3：跨可用区高可用

# Redis Pod 分散在不同可用区
spec:
  affinity:
    podAntiAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 100
        preference:
          matchExpressions:
          - key: app
            operator: In
            values:
            - redis
      topologyKey: topology.kubernetes.io/zone

七、综合实战

场景

需求：
1. 必须调度到有 GPU 的节点
2. 最好调度到 SSD 节点
3. 不要和 Web 应用在同一节点
4. 能容忍节点维护（30分钟内不驱逐）

节点准备

# GPU 节点（有 SSD）
kubectl label nodes gpu-1 gpu=true
kubectl label nodes gpu-1 disktype=ssd
kubectl taint nodes gpu-1 gpu=true:NoSchedule
 
# 普通节点（运行 Web）
kubectl label nodes node-1 app=web
kubectl label nodes node-2 app=web

Pod 配置

apiVersion: apps/v1
kind: Deployment
metadata:
  name: ai-training
spec:
  replicas: 2
  selector:
    matchLabels:
      app: ai-training
  template:
    metadata:
      labels:
        app: ai-training
    spec:
      # 1. 容忍 GPU 污点
      tolerations:
      - key: "gpu"
        operator: "Equal"
        value: "true"
        effect: "NoSchedule"
      
      # 2. 节点亲和性
      affinity:
        # 硬性：必须有 GPU
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: "gpu"
                operator: In
                values:
                - "true"
        
        # 软性：偏好 SSD
        preferredDuringSchedulingIgnoredDuringExecution:
        - weight: 100
          preference:
            matchExpressions:
            - key: "disktype"
              operator: In
              values:
              - "ssd"
        
        # 反亲和性：不要和 Web 在一起
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchLabels:
                app: web
            topologyKey: kubernetes.io/hostname
      
      containers:
      - name: trainer
        image: pytorch:2.0
        resources:
          limits:
            nvidia.com/gpu: 1

八、调试命令

查看节点信息

# 查看节点列表
kubectl get nodes
 
# 查看节点详情（包含标签和污点）
kubectl describe node master
 
# 查看节点标签
kubectl get nodes --show-labels

查看 Pod 调度情况

# 查看 Pod 在哪个节点
kubectl get pods -o wide
 
# 查看 Pod 调度详情
kubectl describe pod my-pod | grep -A 20 "Events:"
 
# 查看调度失败原因
kubectl describe pod my-pod | grep -A 5 "Warning"

常见错误

# 1. 资源不足
# Warning  FailedScheduling  ...  Insufficient cpu
 
# 2. 节点标签不匹配
# Warning  FailedScheduling  ...  node(s) didn't match node selector
 
# 3. 污点不匹配
# Warning  FailedScheduling  ...  node(s) had taints that the pod didn't tolerate

九、总结

调度相关概念对比

概念	作用	方向
Taint	节点"驱赶"Pod	节点 → Pod
Toleration	Pod "接受"污点	Pod ← 节点
Node Affinity	Pod "选择"节点	Pod → 节点
Pod Affinity	Pod "靠近"其他 Pod	Pod → Pod
Pod Anti-Affinity	Pod "远离"其他 Pod	Pod ↔ Pod

使用场景

污点/容忍 → 专用节点（GPU/数据库）
节点亲和性 → 选择特定区域/配置
Pod 亲和性 → 服务间通信
Pod 反亲和性 → 高可用分散