基于 lws 部署推理服务
通过 kubectl 连接集群
基于 lws 部署推理服务,需要首先安装 lws 相关 operator,并用yaml提交任务,因此需要首先准备集群,并用kubectl连接成功。
具体步骤可以参考这里。
安装 lws operator
安装命令如下:
VERSION=v0.5.1
kubectl apply --server-side -f https://ghfast.top/github.com/kubernetes-sigs/lws/releases/download/$VERSION/manifests.yaml
部署推理服务
准备以下 yaml 文件
---
# r1_lws.yaml
apiVersion: leaderworkerset.x-k8s.io/v1
kind: LeaderWorkerSet
metadata:
name: sglang
spec:
replicas: 1
leaderWorkerTemplate:
size: 2
restartPolicy: RecreateGroupOnPodRestart
leaderTemplate:
metadata:
labels:
role: leader
spec:
dnsPolicy: ClusterFirstWithHostNet
hostNetwork: true
hostIPC: true
affinity:
nodeAffinity: # Pod调度亲和性
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: cloud.ebtech.com/gpu
operator: In
values:
- H800_NVLINK_80GB
containers:
- name: sglang-leader
image: registry-cn-huabei1-internal.ebcloud.com/ebsys/sglang:0.4.6-cuda12.2-python3.10-ubuntu22.04-v04
securityContext:
privileged: true
env:
- name: NCCL_IB_HCA
value: "mlx5_100,mlx5_101,mlx5_102,mlx5_103,mlx5_104,mlx5_105,mlx5_106,mlx5_107"
- name: GLOO_SOCKET_IFNAME
value: "bond0"
- name: NCCL_SOCKET_IFNAME
value: "bond0"
- name: NCCL_IB_DISABLE
value: "0"
- name: NCCL_P2P_LEVEL
value: "NVL"
- name: NCCL_IB_GID_INDEX
value: "3"
- name: LWS_WORKER_INDEX
valueFrom:
fieldRef:
fieldPath: metadata.labels['leaderworkerset.sigs.k8s.io/worker-index']
command:
- python3
- -m
- sglang.launch_server
- --model-path
- /public/huggingface-models/deepseek-ai/DeepSeek-R1
- --mem-fraction-static
- "0.93"
- --torch-compile-max-bs
- "8"
- --max-running-requests
- "20"
- --tp
- "16" # Size of Tensor Parallelism
- --dist-init-addr
- $(LWS_LEADER_ADDRESS):20000
- --nnodes
- $(LWS_GROUP_SIZE)
- --node-rank
- $(LWS_WORKER_INDEX)
- --trust-remote-code
- --host
- "0.0.0.0"
- --port
- "40000"
resources:
limits:
cpu: "128"
memory: "1Ti"
nvidia.com/gpu: "8"
rdma/hca_shared_devices_ib: "8"
ports:
- containerPort: 40000
readinessProbe:
tcpSocket:
port: 40000
initialDelaySeconds: 15
periodSeconds: 10
volumeMounts:
- mountPath: /dev/shm
name: dshm
- name: public-volume
mountPath: /public
volumes:
- name: dshm
emptyDir:
medium: Memory
- name: public-volume
hostPath:
path: /public
type: Directory
workerTemplate:
spec:
dnsPolicy: ClusterFirstWithHostNet
hostNetwork: true
hostIPC: true
affinity:
nodeAffinity: # Pod调度亲和性
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: cloud.ebtech.com/gpu
operator: In
values:
- H800_NVLINK_80GB
containers:
- name: sglang-worker
image: registry-cn-huabei1-internal.ebcloud.com/docker.io/lmsysorg/sglang:v0.4.2.post4-cu125-build1
securityContext:
privileged: true
env:
- name: NCCL_IB_HCA
value: "mlx5_100,mlx5_101,mlx5_102,mlx5_103,mlx5_104,mlx5_105,mlx5_106,mlx5_107"
- name: GLOO_SOCKET_IFNAME
value: "bond0"
- name: NCCL_SOCKET_IFNAME
value: "bond0"
- name: NCCL_IB_DISABLE
value: "0"
- name: NCCL_P2P_LEVEL
value: "NVL"
- name: NCCL_IB_GID_INDEX
value: "3"
- name: LWS_WORKER_INDEX
valueFrom:
fieldRef:
fieldPath: metadata.labels['leaderworkerset.sigs.k8s.io/worker-index']
command:
- python3
- -m
- sglang.launch_server
- --model-path
- /public/huggingface-models/deepseek-ai/DeepSeek-R1
- --mem-fraction-static
- "0.93"
- --torch-compile-max-bs
- "8"
- --max-running-requests
- "20"
- --tp
- "16" # Size of Tensor Parallelism
- --dist-init-addr
- $(LWS_LEADER_ADDRESS):20000
- --nnodes
- $(LWS_GROUP_SIZE)
- --node-rank
- $(LWS_WORKER_INDEX)
- --trust-remote-code
resources:
limits:
cpu: "128"
memory: "1Ti"
nvidia.com/gpu: "8"
rdma/hca_shared_devices_ib: "8"
volumeMounts:
- mountPath: /dev/shm
name: dshm
- name: public-volume
mountPath: /public
volumes:
- name: dshm
emptyDir:
medium: Memory
- name: public-volume
hostPath:
path: /public
type: Directory
执行部署命令:
kubectl apply -f r1_lws.yaml
暴露服务
我们采用公网IP暴露服务,准备如下的 yaml 文件:
---
# r1_lws_svc.yaml
apiVersion: v1
kind: Service
metadata:
name: sglang-leader
spec:
type: LoadBalancer
selector:
leaderworkerset.sigs.k8s.io/name: sglang
role: leader
ports:
- protocol: TCP
port: 40000
targetPort: 40000
执行部署命令:
kubectl apply -f r1_lws_svc.yaml
查看公网IP地址:
kubectl get svc | grep sglang-leader
sglang-leader LoadBalancer 10.233.33.222 61.135.xx.xx 40000:30758/TCP 107m
注意:
- 更多关于公网IP的操作,参考这里
访问服务
基于上面的公网IP地址,直接请求服务如下:
# 执行请求
curl -X POST http://61.135.xx.xx:40000/v1/chat/completions -H "Content-Type: application/json" -d '{
"model": "/model",
"messages": [
{
"role": "user",
"content": "你好,你是谁?"
}
],
"stream": false,
"temperature": 0.7
}'
# 结果输出
{"id":"fd716224a59a4b058192468a6475dfd8","object":"chat.completion","created":1742113202,"model":"/model","choices":[{"index":0,"message":{"role":"assistant","content":"<think>\n\n</think>\n\n您好!我是由中国的深度求索(DeepSeek)公司开发的智能助手DeepSeek-R1。如您有任何任何问题,我会尽我所能为您提供帮助。","tool_calls":null},"logprobs":null,"finish_reason":"stop","matched_stop":1}],"usage":{"prompt_tokens":7,"total_tokens":49,"completion_tokens":42,"prompt_tokens_details":null}}%