Skip to content

Local Storage 本地存储

基本介绍

local storage 在 pod 所在的物理机节点分配一块 ephemeral 存储空间。 该空间为临时性存储,Pod 重建、删除或调度时数据会丢失。适用场景为业务需要大容量本地存储,且无需持久化保存数据。

注意:

  • 专属节点池、预约节点池中的工作负载,开启本地存储不会再额外收取费用。

不同规格支持的 local storage 容量上限

目前仅A800,H800,H100支持 local storage。

H800支持的 local storage 上限
规格名称容量上限
bob-eci.h800.5xlarge768 GB
bob-eci.h800.10xlarge1536 GB
bob-eci.h800.20xlarge3072 GB
bob-eci.h800.40xlarge6144 GB
A800支持的 local storage 上限
规格名称容量上限
bob-eci.a800.5large768 GB
bob-eci.a800.10large1536 GB
bob-eci.a800.20large3072 GB
bob-eci.a800.40large6144 GB
bob-eci.a800.5xlarge768 GB
bob-eci.a800.10xlarge1536 GB
bob-eci.a800.20xlarge3072 GB
bob-eci.a800.40xlarge6144 GB
H100支持的 local storage 上限
规格名称容量上限
bob-eci.h100.5xlarge768 GB
bob-eci.h100.10xlarge1536 GB
bob-eci.h100.20xlarge3072 GB
bob-eci.h100.40xlarge6144 GB

基于开发机使用本地存储

具体示例如下:

img_1.png

基于 k8s 工作负责使用本地存储

Storage Class

ephemeral-els-premium

注意:

  • 不可以直接用这个sc创建pvc,需要与pod一同使用。
  • 参考下面的示例。
Deployment 使用本地存储 yaml 示例
yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: test1
  namespace: default
  labels:
    app: test1
spec:
  replicas: 1
  selector:
    matchLabels:
      app: test1
  template:
    metadata:
      labels:
        app: test1
    spec:
      schedulerName: volcano
      terminationGracePeriodSeconds: 1
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
              - matchExpressions:
                  - key: cloud.ebtech.com/gpu
                    operator: In
                    values: ["A800_NVLINK_80GB"]
      containers:
        - name: server
          image: registry-cn-huabei1-internal.ebcloud.com/ebsys/pytorch:2.7.0-cuda12.9-python3.12-ubuntu24.04-v01
          imagePullPolicy: IfNotPresent
          command:
            - bash
            - -c
            - |
              # 捕获 SIGTERM 信号并立即退出
              trap 'echo "Received SIGTERM, exiting..."; exit 0' SIGTERM

              # 使用一个简单的循环,而不是 sleep infinity
              # 这样可以更快地响应信号
              while true; do
                sleep 1
              done
          resources:
            limits:
              cpu: "10"
              memory: 100Gi
              nvidia.com/gpu: "1"
          securityContext:
            capabilities:
              add: ["IPC_LOCK"]
          volumeMounts:
            - mountPath: /mnt/tmp-test1
              name: tmp-test1
      volumes:
        - name: tmp-test1
          ephemeral:
            volumeClaimTemplate:
              spec:
                accessModes: [ "ReadWriteOnce" ]
                resources:
                  requests:
                    storage: 500Gi
                storageClassName: ephemeral-els-premium
                volumeMode: Filesystem
StatefulSet 使用本地存储 yaml 示例
yaml
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: test2
  namespace: default
  labels:
    app: test2
spec:
  replicas: 1
  serviceName: test2
  selector:
    matchLabels:
      app: test2
  template:
    metadata:
      labels:
        app: test2
    spec:
      schedulerName: volcano
      terminationGracePeriodSeconds: 1
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
              - matchExpressions:
                  - key: cloud.ebtech.com/gpu
                    operator: In
                    values: ["A800_NVLINK_80GB"]
      containers:
        - name: server
          image: registry-cn-huabei1-internal.ebcloud.com/ebsys/pytorch:2.7.0-cuda12.9-python3.12-ubuntu24.04-v01
          imagePullPolicy: IfNotPresent
          command:
            - bash
            - -c
            - |
              # 捕获 SIGTERM 信号并立即退出
              trap 'echo "Received SIGTERM, exiting..."; exit 0' SIGTERM

              # 使用一个简单的循环,而不是 sleep infinity
              # 这样可以更快地响应信号
              while true; do
                sleep 1
              done
          resources:
            limits:
              cpu: "10"
              memory: 100Gi
              nvidia.com/gpu: "1"
          securityContext:
            capabilities:
              add: ["IPC_LOCK"]
          volumeMounts:
            - mountPath: /mnt/tmp-test1
              name: tmp-test1
      volumes:
        - name: tmp-test1
          ephemeral:
            volumeClaimTemplate:
              spec:
                accessModes: [ "ReadWriteOnce" ]
                resources:
                  requests:
                    storage: 500Gi
                storageClassName: ephemeral-els-premium
                volumeMode: Filesystem
Pod 使用本地存储 yaml 示例
yaml
apiVersion: v1
kind: Pod
metadata:
  name: test3
  namespace: default
  labels:
    app: test3
spec:
  schedulerName: volcano
  terminationGracePeriodSeconds: 1
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
          - matchExpressions:
              - key: cloud.ebtech.com/gpu
                operator: In
                values: ["A800_NVLINK_80GB"]
  containers:
    - name: server
      image: registry-cn-huabei1-internal.ebcloud.com/ebsys/pytorch:2.7.0-cuda12.9-python3.12-ubuntu24.04-v01
      imagePullPolicy: IfNotPresent
      command:
        - bash
        - -c
        - |
          # 捕获 SIGTERM 信号并立即退出
          trap 'echo "Received SIGTERM, exiting..."; exit 0' SIGTERM

          # 使用一个简单的循环,而不是 sleep infinity
          # 这样可以更快地响应信号
          while true; do
            sleep 1
          done
      resources:
        limits:
          cpu: "10"
          memory: 100Gi
          nvidia.com/gpu: "1"
      securityContext:
        capabilities:
          add: ["IPC_LOCK"]
      volumeMounts:
        - mountPath: /mnt/tmp-test1
          name: tmp-test1
  volumes:
    - name: tmp-test1
      ephemeral:
        volumeClaimTemplate:
          spec:
            accessModes: [ "ReadWriteOnce" ]
            resources:
              requests:
                storage: 500Gi
            storageClassName: ephemeral-els-premium
            volumeMode: Filesystem