PyTorch Memory Fragment Analysis

Posted on 2025-06-29 Edited on 2025-07-13

最近在研究 PyTorch 训练显存 offload 相关的技术, 发现了一些有意思的现象和问题, 也踩了一些有意思的坑

model.to 的实现

在 PyTorch 中, 如果我们想把模型挪到某个设备中, 我们会调用 model.to 这个方法, 例如如果想把 model 从 GPU 挪到 CPU,
我们会使用 model.to('cuda') 实现模型的原地 device offload, 也就是说, 调用这个方法并不需要你写成
model = model.to('cuda')。但是我们知道如果是一个 tensor 的话是需要写成 tensor = tensor.to('cpu') 的。
那么为什么模型就无需这样操作呢? torch 是如何实现一个 tensor 原地替换呢?

注意到, pytorch 有一个有意思的方法叫 torch.utils.swap_tensors, 正如文档里描述, 我们可以这样的使用它来实现不改tensor reference的情况下改里面的storage

import torch
import torch.nn as nn
t1 = torch.arange(2)
t2 = torch.arange(3)
print(f"Before swapping, t1: {t1}, t2: {t2}")
torch.utils.swap_tensors(t1, t2)
print(f"After swapping, t1: {t1}, t2: {t2}")

那么 model 就可以使用类似的方法来实现原地替换, 参考文档。
我们也可以直接从 torch 源码中找到对应的实现

需要注意的是, 如果一个 tensor 被别人存了一个 slice, 那么 swap_tensors 会报错, 因此当我们想使用 swap_tensors 的时候, 不要做一些 slice 相关操作

奇妙的 64MiB 的显存占用

我们书写一个非常简单的程序如下

import torch

def print_mem_info(prefix: str):
    alloc, reserved = torch.cuda.memory_allocated() / 1024 / 1024, torch.cuda.memory_reserved() / 1024 / 1024
    print(f"{prefix} Current CUDA allocated {alloc:.2f} MiB, reserved {reserved:.2f} MiB.")

print_mem_info('[start]')
a = torch.zeros(1, 2, dtype=torch.bfloat16, device='cuda', requires_grad=True)
b = torch.zeros(2, 1, dtype=torch.bfloat16, device='cuda', requires_grad=True)

print_mem_info('[alloc]')

loss = torch.matmul(a, b)

print_mem_info('[matmul]')

loss.backward()

print_mem_info('[backward]')

运行代码得到了下面的输出

text

[start] Current CUDA allocated 0.00 MiB, reserved 0.00 MiB.
[alloc] Current CUDA allocated 0.00 MiB, reserved 2.00 MiB.
[matmul] Current CUDA allocated 32.00 MiB, reserved 34.00 MiB.
[backward] Current CUDA allocated 64.00 MiB, reserved 66.00 MiB.

看起来非常奇怪, 因为我们只分配了很小的三个 tensor, 分别是 a, b 和 loss, 为什么经过一次 matmul 之后,
显存涨了 32MiB, 然后经过一次 backward 之后, 显存又涨了 32MiB? 而且这部分增长的 64MiB 显存即便是
使用 torch.cuda.empty_cache() 也无法清理, 到底是哪里占了这些显存?

在 torch 的 code 中我们发现 torch 在 matmul 和 bwd 的时候需要给 cublas 一个 32M 的 workspace。
这个问题还是很隐蔽的, 其中最大的问题是会导致 torch 显存释放的时候出现碎片。
如果第一次 fwd 或者 bwd 是在代码比较靠后的位置做的, 那么前面申请的那些显存如果释放的话极有可能会导致 torch 出现较大的显存碎片,
这两个 32MiB 极端情况可能会占着一个非常大的 torch Segment 而无法释放, 因此我们需要额外关注这个问题。

一个简单的解决办法是把上面这段 demo 代码在你的代码最开头运行一次, 相当于做一次 fwd 和 bwd 的 warmup, 这样就能极大的避免显存碎片

即可以直接在代码最开头执行下面的代码:

torch.matmul(
    torch.zeros(1, 2, device='cuda', requires_grad=True),
    torch.zeros(2, 1, device='cuda', requires_grad=True),
).backward()

Monkey patch tensorboard writer 提高性能

Posted on 2024-06-28 Edited on 2024-09-22 In AI

背景

工作中会经常使用 tensorboard 进行实验管理和记录。但是我们发现，在某些场景下，tensorboard 的写性能会发现一些降速。查了下 tb 代码，发现写 tb 的时候代码写成了每一次 append write 都会 open 一个 file write 之后再 close，导致每次调用上层的 writer.add_scalar 函数的时候会多一次 open 和 close 操作，这个行为其实比较蠢，一般来说，只需要初始化一次文件指针，然后针对这个文件指针做操作即可，但是这里做了多次 open 和 close 真是多此一举。

另外，tb 外层的 flush 会清空本地队列的缓存，会等待 _byte_queue.join() 里元素都清空，这块逻辑没问题，但是底层的 fs 如果被我们换成一个统一的 fd 来 write 的话，需要支持能 flush 的 fs 中的 fd。但我们查看了 GFile 的 flush 代码，发现在我们的场景里面 self.fs_supports_append 是 True，所以看代码基本上 flush 啥都没干，close 代码也是类似的。因此我们也需要 monkey patch 一下 GFile 使其能支持内部 fs 的 flush 和 close

代码更改

思路是希望更改 tb 代码

支持使用单个文件指针做 append 的 write，然后可以使用 register_filesystem 去覆盖默认的 writer
Monkey patch 优化 GFile 的 flush 代码和 close 代码，使其支持 self.fs 的 flush 和 close

因此我们可以写下面的 append_tb_writer.py 代码

# append_tb_writer.py
import functools
import typing as t

from tensorboard.compat import tf
from tensorboard.compat.tensorflow_stub import compat
from torch.utils.tensorboard import SummaryWriter


class PatchedLocalFileSystem(tf.io.gfile.LocalFileSystem):
    stream_map: t.Dict[str, t.TextIO]

    def __init__(self):
        super().__init__()
        self.stream_map = {}

    def _stream(self, filename, binary_mode=False, create=True) -> t.Optional[t.TextIO]:
        # use append to open file
        mode = "ab" if binary_mode else "a"
        # generate key due to mode and filename
        key = f"{mode}@{filename}"
        if key not in self.stream_map or self.stream_map[key] is None:
            if not create:
                return None
            encoding = None if "b" in mode else "utf8"
            self.stream_map[key] = open(filename, mode, encoding=encoding)
        return self.stream_map[key]

    def append(self, filename, file_content, binary_mode=False):
        compatify = compat.as_bytes if binary_mode else compat.as_text
        # reuse fd to write instead of duplicated open and close
        self._stream(filename, binary_mode).write(compatify(file_content))

    def flush(self, filename, binary_mode=False):
        stream = self._stream(filename, binary_mode, create=False)
        if stream:
            stream.flush()

    def close(self, filename, binary_mode=False):
        stream = self._stream(filename, binary_mode, create=False)
        if stream:
            stream.close()


_origin_gfile = tf.io.gfile.GFile


class PatchedGFile(_origin_gfile):
    def flush(self):
        super().flush()
        assert hasattr(self.fs, "flush")
        self.fs.flush(self.filename, self.binary_mode)

    def close(self):
        super().close()
        assert hasattr(self.fs, "close")
        self.fs.close(self.filename, self.binary_mode)


@functools.cache
def patch_tb_gfile_and_fs():
    fs = PatchedLocalFileSystem()
    # monkey patch for registering a default fs instead of tf.io.gfile.LocalFileSystem
    tf.io.gfile.register_filesystem("", fs)
    tf.io.gfile.GFile = PatchedGFile


def assert_patched(writer: SummaryWriter):
    gfile = writer.file_writer.event_writer._general_file_writer
    assert isinstance(gfile, PatchedGFile)
    assert isinstance(gfile.fs, PatchedLocalFileSystem)

在主函数中，可以这么打上 patch

from append_tb_writer import patch_tb_gfile_and_fs, assert_patched
patch_tb_gfile_and_fs()
from torch.utils.tensorboard import SummaryWriter
writer = SummaryWriter(tensorboard_dir)
assert_patched(writer)

上面的 assert_patched 主要是需要确保给 writer 打上 patch

张量分析推导 AI 反向传播中的经典案例

Posted on 2023-12-03 Edited on 2024-09-22 In AI

张量分析推导 loss 对 $\mathbf{X}$ 的导数

利用张量分析书写矩阵乘法

在一个标准的神经网络线性层中，输出和输入的关系如下

\tag{1} \mathbf{T} = \mathbf{X} \mathbf{W} + \mathbf{b}

其中

$\mathbf{T}$ 是一个 $b \times h$ 的矩阵，这里 $b$ 一般是 batch_size， $h$ 是隐藏层大小
$\mathbf{X}$ 是一个 $b \times m$ 的矩阵，这里 $m$ 一般是 emb_size * block_size
$\mathbf{W}$ 是一个 $m \times h$ 的矩阵
$\mathbf{b}$ 在 pytorch 中一般不是一个矩阵，是一个长度为 $h$ 的向量，
通过使用了 pytorch 的扩展第一列扩展到所有列

我们可以使用张量分量的形式来表示

\tag{2} \mathbf{T} = (x_{ik} w^{k}_{\cdot j} + b_j)\mathbf{g}^i\mathbf{g}^j

用张量推导导数

在深度学习中，上述说的 $\mathbf{T}$ 其实会是 $loss$ 的一个函数，然后 $\mathbf{T}$ 又是 $\mathbf{X}$ 的函数，写出来是下述的样子

loss = l(\mathbf{T}(\mathbf{X}))

上述 $l$ 可以看做是一个二阶张量 $\mathbf{T}$ 的标量函数， $\mathbf{T}$ 是二阶张量 $\mathbf{X}$ 的二阶张量函数，而张量函数的导数存在以下链式求导规则

\frac{\partial l}{\partial \mathbf{X}} = \frac{\partial l}{\partial \mathbf{T}} \vcentcolon \frac{\partial \mathbf{T}}{\partial \mathbf{X}}

将上式按照张量分量展开如下

\tag{4} \begin{aligned} \frac{\partial l}{\partial \mathbf{X}} &= \frac{\partial l}{\partial t_{ij}}\mathbf{g}_i\mathbf{g}_j \vcentcolon \frac{\partial t_{kl}}{\partial x_{mn}}\mathbf{g}^k\mathbf{g}^l\mathbf{g}_m\mathbf{g}_n \\ &= \frac{\partial l}{\partial t_{ij}} \frac{\partial t_{kl}}{\partial x_{mn}} \delta_i^{\cdot k} \delta_j^{\cdot l} \mathbf{g}_m\mathbf{g}_n \\ &= \frac{\partial l}{\partial t_{ij}} \frac{\partial t_{ij}}{\partial x_{mn}} \mathbf{g}_m\mathbf{g}_n \\ \end{aligned}

于是我们得到了基于张量分量的链式法则。在公式 $(4)$ 中，其实 $\frac{\partial l}{\partial t_{ij}}$ 是一个已知字段，在上一次 backward 中已经计算出来了，现在我们的目标是计算 $loss$ 各相对于 $\mathbf{W}$ 、 $\mathbf{X}$ 和 $\mathbf{b}$ 的导数，将该链式法则带入 $(2)$ 式

\tag{5} \begin{aligned} \frac{\partial l}{\partial \mathbf{X}} &= \frac{\partial l}{\partial t_{ij}} \frac{\partial (x_{ik} w^{k}_{\cdot j} + b_j)}{\partial x_{mn}} \mathbf{g}_m\mathbf{g}_n \\ &= \frac{\partial l}{\partial t_{ij}} \delta_i^{\cdot m} \delta_k^{\cdot n} w^k_{\cdot j} \mathbf{g}_m\mathbf{g}_n \\ &= \frac{\partial l}{\partial t_{mj}} w^n_{\cdot j} \mathbf{g}_m\mathbf{g}_n \\ &= \frac{\partial l}{\partial t_{mj}} (w_j^{\cdot n})^\top \mathbf{g}_m\mathbf{g}_n \\ &= \frac{\partial l}{\partial \mathbf{T}} \mathbf{W}^\top \end{aligned}

上式中 $(w_{ki})^\top$ 表示矩阵的转置。因此我们将上式改为矩阵形式就可以得到

\tag{6} \frac{\partial l}{\partial \mathbf{X}} = \frac{\partial l}{\partial \mathbf{T}}\mathbf{W}^\top

由此，我们用很简单规整的数学推导推出了 $loss$ 相对于 $\mathbf{X}$ 的导数（或者称为梯度）

同样的思路，我们可以推导出 $\mathbf{W}$ 对应的导数

\tag{7} \begin{aligned} \frac{\partial l}{\partial \mathbf{W}} &= \frac{\partial l}{\partial t_{ij}} \frac{\partial (x_{ik} w^{k}_{\cdot j} + b_j)}{\partial w_{mn}} \mathbf{g}_m\mathbf{g}_n \\ &= \frac{\partial l}{\partial t_{ij}} x_{ik} \delta^{km} \delta_{j}^{\cdot n} \mathbf{g}_m\mathbf{g}_n \\ &= \frac{\partial l}{\partial t_{in}} x_i^{\cdot m} \mathbf{g}_m\mathbf{g}_n \\ &= (x^m_{\cdot i})^\top \frac{\partial l}{\partial t_{in}} \mathbf{g}_m\mathbf{g}_n \\ &= \mathbf{X}^\top \frac{\partial l}{\partial \mathbf{T}} \end{aligned}

和 $\mathbf{b}$ 对应的导数

\tag{8} \begin{aligned} \frac{\partial l}{\partial \mathbf{b}} &= \frac{\partial l}{\partial t_{ij}} \frac{\partial (w_{ik} x^{k}_{\cdot j} + b_j)}{\partial b_{m}} \mathbf{g}_m \\ &= \frac{\partial l}{\partial t_{ij}} \delta_j^{\cdot m} \mathbf{g}_m \\ &= \frac{\partial l}{\partial t_{im}} \mathbf{g}_m \\ &= \sum_i \frac{\partial l}{\partial t_{im}} \mathbf{g}_m \\ \end{aligned}

这里比较特殊，最终得到的结果其实是 $\frac{\partial l}{\partial \mathbf{T}}$ 按照行求和得到了一个长度为 $h$ 的向量。

将上述的公式表示为 pytorch 代码即为

1
2
3

dX = dT @ W.T
dW = X.T @ dT
db = dT.sum(0)

git 使用 http proxy 通过 ssh 和仓库交互

Posted on 2023-05-17 Edited on 2024-09-22 In Development

在公司内网开发的时候，有时候会无法连接外网，公司提供了一个 http proxy，例如 http://my.proxy:1087 来访问外网。因此我们需要通过这个代理来访问 github 等仓库。

访问 github 有两种方式，一种是 https 的，另一种是 ssh 的。当我们使用 https 来访问 github 的时候，直接在命令行使用 export http_proxy=http://my.proxy:1087 https_proxy=http://my.proxy:1087 就可以直接使用 ggit clone https://github.com/weixiao-huang/weixiao-huang.github.io.git 来克隆相应的仓库。

但是当我们要使用 ssh 来 git clone 的时候，会发现无法使用。我们需要更改一下 ssh 的配置，方式如下：

更改 ~/.ssh/config 添加下面的代码

1
2
3

Host  github.com
  User git
  ProxyCommand socat - PROXY:my.proxy:%h:%p,proxyport=1087

通过上述方式即可以通过代理连接上 github，使用 git clone git@github.com:weixiao-huang/weixiao-huang.github.io.git

gloo 源码解析

Posted on 2023-05-07 Edited on 2024-09-22 In Gateway

gloo 是一款基于 envoy 的云原生的API网关，它能够非常方便的和 K8S 进行集成，通过监听相关的 CRD，基于 envoy 的 xDS 接口对 envoy 配置进行 hot reload。同时也能很方便的集成 knative。另外它的文档也比较完善，设计比较简单，易于上手。

关于如何上手并使用 gloo，相信网上也能搜到一些中文资料，当然最好的方式当然是直接去 gloo 官方文档查看。本文主要深入 gloo 的源码，探究 gloo 是如何实现对 envoy 的配置下发的。

从最基本意义上讲，gloo 是转化引擎，Envoy xDS 服务器为 Envoy 提供高级配置（包括 gloo 的自定义 Envoy 过滤器）。gloo 遵循基于事件的体系结构，监视各种配置源以进行更新，并立即通过 v2 版本的 gRPC 接口更新 Envoy 配置。

本文基于 gloo 1.14.0 对 gloo 源码进行分析。

核心代码结构

gloo 的核心概念

整体上，gloo 的核心概念如下图所示：

可以看到，主要核心控制器有 3 个：Emitter、EventLoop 和 Syncer。其中 Emitter 实现了 Snapshots() 函数，这个函数会返回一个 Snapshot channel，然后在主循环 EventLoop 中会执行一个 Run() 函数去监听 Snapshot channel，一旦接收到新的 Snapshot，就会把这个 Snapshot 发送给 Syncer 实现的 Sync() 函数去处理，而这个 Sync() 函数就是最主要的将 gloo 配置同步到 envoy 的核心函数。

在上面我们可以频繁提到一个 Snapshot 的概念，这个就是 gloo 的最核心的数据结构。Snapshot 是一堆资源的集合，包括 gloo 的各种 CR 以及 K8S 的 Services 等资源，下面是 ApiSnapshot 的一个定义，可以参考源码：

type ApiSnapshot struct {
	Artifacts          gloo_solo_io.ArtifactList
	Endpoints          gloo_solo_io.EndpointList
	Proxies            gloo_solo_io.ProxyList
	UpstreamGroups     gloo_solo_io.UpstreamGroupList
	Secrets            gloo_solo_io.SecretList
	Upstreams          gloo_solo_io.UpstreamList
	AuthConfigs        enterprise_gloo_solo_io.AuthConfigList
	Ratelimitconfigs   github_com_solo_io_gloo_projects_gloo_pkg_api_external_solo_ratelimit.RateLimitConfigList
	VirtualServices    gateway_solo_io.VirtualServiceList
	RouteTables        gateway_solo_io.RouteTableList
	Gateways           gateway_solo_io.GatewayList
	VirtualHostOptions gateway_solo_io.VirtualHostOptionList
	RouteOptions       gateway_solo_io.RouteOptionList
	HttpGateways       gateway_solo_io.MatchableHttpGatewayList
	GraphqlApis        graphql_gloo_solo_io.GraphQLApiList
}

上面的 ApiSnapshot 里储存了很多东西，包括 gloo 自己的 CRD，例如 UpstreamList、VirtualServiceList、RouteTableList 等，也有 K8S 自己的资源（当然 gloo 在这里稍微做了一层转换），例如 EndpointList、SecretList 等。

Snapshot 的作用如下：

gloo 会通过一些 informer 接收到 K8S 中相应的 CRD 信息，并把这些信息汇总到 Snapshot 中
gloo 会在 Snapshots() 函数中一秒轮训一次，周期性的把 Snapshot 送入 Emitter 的 Snapshots() 接口返回的 Snapshot Channel 中

核心 Translate 流程

下图是一个总体的服务入口和 Translate 流程

gloo 如何把 K8S 的资源汇总转换到 Snapshot 中

在这里我们主要查看一下 Emitter 的逻辑，下面是一个 apiEmitter 的定义

type apiEmitter struct {
	forceEmit            <-chan struct{}
	artifact             gloo_solo_io.ArtifactClient
	endpoint             gloo_solo_io.EndpointClient
	proxy                gloo_solo_io.ProxyClient
	upstreamGroup        gloo_solo_io.UpstreamGroupClient
	secret               gloo_solo_io.SecretClient
	upstream             gloo_solo_io.UpstreamClient
	authConfig           enterprise_gloo_solo_io.AuthConfigClient
	rateLimitConfig      github_com_solo_io_gloo_projects_gloo_pkg_api_external_solo_ratelimit.RateLimitConfigClient
	virtualService       gateway_solo_io.VirtualServiceClient
	routeTable           gateway_solo_io.RouteTableClient
	gateway              gateway_solo_io.GatewayClient
	virtualHostOption    gateway_solo_io.VirtualHostOptionClient
	routeOption          gateway_solo_io.RouteOptionClient
	matchableHttpGateway gateway_solo_io.MatchableHttpGatewayClient
	graphQLApi           graphql_gloo_solo_io.GraphQLApiClient
}

可以看到 Emitter 中存了很多 Client，需要注意的是这些 Client 并不是 K8S 的 client-go，而是 gloo 自己封装的 client。

Emitter 的 Snapshots() 方法会做以下事情

初始化各 Client 并 watch 其中数据变化
一旦 Watch 到数据，对数据进行部分处理，然后塞入临时的 currentSnapshot 结构体中
每一秒一次，把 currentSnapshot 送入 Snapshot channel

下面针对这 3 个过程从源码进行解析

1. 初始化各 `Client` 并 watch 其中数据变化

第一步，apiEmitter 的 Register() 函数会调用各个 client 的 Register 接口，这些 Register 接口会对各个 client 对应的 K8S informer 进行初始化。这些 Register 会最终调用到 ResourceClient 这个接口的 Register()，这部分代码已经不在 gloo 里了，是在 gloo 依赖的 solo-kit 中。然后这个 Register() 最重要的是实现方式在 kube/resource_client.go，最终会调用到 kube/resource_client_factory.go 中。这个函数里的逻辑就很清晰了，里面针对需要监听的 namespace 初始化了对应资源的 sharedInformer 并加入缓存中。

第二步，会调用各个 client 的 List 接口，对 Register() 中注册的 informer 进行 Start，等待数据 sync 到本地 cache 之后顺便 List 一份返回。这里是一个 artifact 资源的 List 调用的地方，往里看也会调用到 solo-kit 的 kube/resource_client.go 或 kubesecret/resource_client.go 的 List 代码。

第三部，会调用各个 client 的 Watch 接口，监听该资源的情况，一旦有变化会反馈到 Watch 接口返回的 channel 中。例如这里是一个 artifact 资源的 Watch 调用的地方，最终也会调用到 solo-kit，此处不在赘述。

2. Watch 到数据，对数据进行部分处理

上面的 Watch 接口会返回一个 artifactNamespacesChan，在下面会监听这个 channel，一旦有数据过来，就会汇总发布到 artifactChan 中，然后另一个 goroutine 也在监听 artifactChan，一旦有数据产生，就会放入 currentSnapshot 中。

gloo 通过这种方式，把所有需要的资源逗整合到了 Snapshot 结构体中。

3. 把 `currentSnapshot` 送入 `Snapshot` channel

每一秒一次，将上面 List 以及 Watch 得到的结果都放入 currentSnapshot 中，最后把 currentSnapshot 深拷贝一份，放入 Snapshot channel 中。

通过上面的三步，就实现了最核心的 K8S 资源到 Snapshot 的转换，最后会将这个 Snapshot 资源传入 Sync() 函数中，实现最核心的 K8S 配置到 envoy 的 xDS 转换流程。

xDS 流程

gloo 的 Translate 最终做的工作是把 ApiSnapshot 转化为下面的 EnvoySnapshot：

// Snapshot is an internally consistent snapshot of xDS resources.
// Consistently is important for the convergence as different resource types
// from the snapshot may be delivered to the proxy in arbitrary order.
type EnvoySnapshot struct {
	// Endpoints are items in the EDS V3 response payload.
	Endpoints cache.Resources

	// Clusters are items in the CDS response payload.
	Clusters cache.Resources

	// Routes are items in the RDS response payload.
	Routes cache.Resources

	// Listeners are items in the LDS response payload.
	Listeners cache.Resources
}

Translate 转化完成之后，会调用 SetSnapshot 方法把 EnvoySnapshot 放到本地缓存中。一旦这个缓存被 Set，就会触发 xDS 服务和 envoy 的数据同步。

gloo 在初始化的时候会初始化一个 xdsServer，这个 xdsServer 会实现下面的 interface：

// Server is a collection of handlers for streaming discovery requests.
type Server interface {
	// StreamEnvoyV3 is the streaming method for Evnoy V3 XDS
	StreamEnvoyV3(
		stream StreamEnvoyV3,
		defaultTypeURL string,
	) error
	// StreamSolo is the streaming method for Solo discovery
	StreamSolo(
		stream StreamSolo,
		defaultTypeURL string,
	) error
	// Fetch is the universal fetch method.
	FetchEnvoyV3(
		context.Context,
		*envoy_service_discovery_v3.DiscoveryRequest,
	) (*envoy_service_discovery_v3.DiscoveryResponse, error)
	FetchSolo(
		context.Context,
		*sk_discovery.DiscoveryRequest,
	) (*sk_discovery.DiscoveryResponse, error)
}

然后会需要利用这个 xdsServer 生成一个 EnvoyServerV3，这个才是真正和 Envoy 做交互的，这个 Server 需要实现 Envoy xDS 相关接口

// Server is a collection of handlers for streaming discovery requests.
type EnvoyServerV3 interface {
	envoy_service_endpoint_v3.EndpointDiscoveryServiceServer
	envoy_service_cluster_v3.ClusterDiscoveryServiceServer
	envoy_service_route_v3.RouteDiscoveryServiceServer
	envoy_service_listener_v3.ListenerDiscoveryServiceServer
	envoy_service_discovery_v3.AggregatedDiscoveryServiceServer
}

如果仔细查看这些接口的实现的话，会发现基本都是使用上面的 StramEnvoyV3 来实现的，这里就不列举了。

总结

以上就是 gloo 的一些源码的主要流程了，总体来说架构和思路还算清晰，但是代码可能有些复杂，我们在阅读 gloo 代码的时候需要注意的是，*.sk.go 的代码都是生成的代码，所以不用过于在乎这部分代码的写法，大致知道逻辑即可。

Kubernetes CentOS 7 低版本 3.10 内核出现 cannot allocate memory 问题

Posted on 2023-03-09 Edited on 2024-09-22 In Kubernetes

在标准的 CentOS 7 系统中，标配的是 3.10 版本的 Linux kernel，在这个内核下搭建 Kubernetes 一版来说还是比较稳定的，但是当集群跑着跑着，创建的 Pod 多于一定数量的时候，就会发现新 Pod 再也创建不起来了，报错

Error response from daemon: OCI runtime create failed: container_linux.go:348: starting container process caused "process_linux.go:279: applying cgroup configuration for process caused \"mkdir /sys/fs/cgroup/memory/kubepods/burstable/pod7d54e3ab-2639-11ea-b794-801844e495bc/cba62d4976e90bfdc2fb7cf20a7c37dfbf4b2273742b78f27b6d9b5d15b284ce: cannot allocate memory\"": unknown

主要的关键字是 cannot allocate memory，当然这个问题网上有诸多解答，搜索这个问题其实也挺费劲，但是最终发现一个帖子写得不错，比较有参考意义，在此贴上链接： https://cloud.tencent.com/developer/article/2141496 。这篇文章详细的说明了这个问题发生的原因，也给出了解决方案。在这里总结一下解决方案：

方案一：升级内核到 4.x 之后能解决
方案二：更改机器 grub 配置禁用 kmem 并重启机器
方案三：更改 kubelet 代码并重新编译

我们团队经过尝试，发现方案一和方案二均能解决问题，但是方案三不一定能解决，因此在这里我会着重介绍下方案二。

方案二在上面的文章里也说了一些，这里可以再做一些补充。

首先执行

1	vim /etc/default/grub

将 GRUB_CMDLINE_LINUX 的配置

1	GRUB_CMDLINE_LINUX="crashkernel=auto spectre_v2=retpoline intel_iommu=on iommu=pt rootflags=prjquota"

末尾加上 cgroup.memory=nokmem，例如

1	GRUB_CMDLINE_LINUX="crashkernel=auto spectre_v2=retpoline intel_iommu=on iommu=pt rootflags=prjquota cgroup.memory=nokmem"

然后下面需要注意，根据机器启动方式不一样，需要生成的 grub 配置可能不一样。

先试用 ls /sys/firmware/efi 查看 /sys/firmware/efi 是否存在，如果存在的话，那么该机器是通过 UEFI 启动，则执行

1 2	cp /boot/efi/EFI/centos/grub.cfg /boot/efi/EFI/centos/grub.cfg.bak grub2-mkconfig -o /boot/efi/EFI/centos/grub.cfg

如果上述目录不存在的话，则执行

1 2	cp /boot/grub2/grub.cfg /boot/grub2/grub.cfg.bak grub2-mkconfig -o /boot/grub2/grub.cfg

执行完成之后，可以用重启机器。

机器重启成功之后，可以使用 cat /proc/cmdline 检查配置是否生效，如果里面有 cgroup.memory=nokmem 相关的配置项，则说明修改成功。

envoy 网关实现优雅退出

Posted on 2023-03-07 Edited on 2024-09-22 In Gateway

在我们的项目中，我们采用了 gloo 作为我们最前端的网关，gloo 是基于 envoy 的云原生网关， gloo 实现了一套控制器，自定义了 VirtualService, RouteTable, Upstream 等 CRD，利用 envoy 的 xDS ，将 K8S 的 CRD 转化为了 envoy 自身的配置实现反向代理。

gloo 将 envoy 的反向代理通过 K8S 部署成为一个 gateway-proxy 的 Deployment，所有网关流量可以打到这个 Deployment 对应的 Pod 或者 Service 上即能实现反向代理。

但是需要注意的是，envoy 并没有实现完善的 Gracefully Stop 功能，在我们的项目中，利用 envoy 的 2222 端口代理了我们的一个 ssh 反向代理服务，用户会通过 envoy 的 2222 端口脸上 upstream 的 ssh 服务。但是如果不做优雅退出的话，一旦我们做 gateway-proxy 这个 Deployment 的滚动更新，就会导致所有用户的 ssh 连接断连，造成非常不好的体验，因此实现一个 Gracefully Stop 是迫在眉睫的事情。

初步调研看来， gloo 有 PR 实现了一个简单的优雅退出的逻辑，也写了相关文档。但仔细看了 PR 里的细节才发现，gloo 的这个功能比较鸡肋，可能会造成所有的 conn 都退出之后，还在等 sleep 的诡异景象，使得 gateway-proxy pod 迟迟处于 Terminating 状态，不是一个合理的解决方案。

经过调研，我们发现 istio 自己实现了一个优雅退出，通过暴露一个 EXIT_ON_ZERO_ACTIVE_CONNECTIONS 配置。这个给了我们新思路，仔细查看它的代码，详细逻辑是 activeProxyConnections 这个函数

func (a *Agent) activeProxyConnections() (int, error) {
	activeConnectionsURL := fmt.Sprintf("http://%s:%d/stats?usedonly&filter=downstream_cx_active$", a.localhost, a.adminPort)
	stats, err := http.DoHTTPGet(activeConnectionsURL)
	if err != nil {
		return -1, fmt.Errorf("unable to get listener stats from Envoy : %v", err)
	}
	if stats.Len() == 0 {
		return -1, nil
	}
	// 后面的逻辑省略
}

可以看到可以采用 envoy 的 /stats 接口了解到 downstream 的连接数，通过对监控指标的 filter 即可查询到连接数，如果连接数一直不清零，就一直等待即可。

最终我们通过给 gateway-proxy Pod 书写如下的 preStop 终于实现了优雅退出

lifecycle:
  preStop:
    exec:
      command:
      - /bin/sh
      - -c
      - for i in $(seq 1 8640); do if [[ $(wget -O- "127.0.0.1:19000/stats?usedonly&filter=^listener\.\[__\]_2222\.downstream_cx_active$"
        | awk 'BEGIN{sum=0}{sum+=$2;}END{print sum;}') -ne 0  ]]; then sleep
        10; else break; fi done

稍微解释一下上面的 command

for 循环的作用：我们采用了一个 for 循环，循环 8640 次，并且利用 sleep 命令实现 10s 检测一次，因此总共 86400s，即优雅退出 timeout 大概在 1d 以上。
wget 的作用：在容器中并没有直接照抄 istio 的 filter，而是自己写了个 filter=^listener\.\[__\]_2222\.downstream_cx_active$"，这是因为 istio 检查了所有 downstream，但是我们没有这个必要，我们主要只需要检查 2222 端口的 downstream 即可，因此我们写了一个正则表达式，这样可以提高 wget 返回的速度。
这里的 usedonly 是必须的，如果不填的话在集群规模大的情况下，此接口会需要返回特别多的信息，特别耗时间，利用 usedonly 能够极大的减小请求时间。
awk 的作用：wget 返回的结果类似这样：listener.[__]_2222.downstream_cx_active: 52，不排除会出现多行或者一行都没有，因此我们用 awk 获取了空格后面的数字，例如这里的 52，然后如果有多行，会将多行的数字相加并输出，如果一行都没有就输出 0，我们利用 awk 的结果判断是否是 0 即可知道连接数是否是 0，可谓精妙。

除了上面的配置之外，Pod 的 spec 层面需要配置 terminationGracePeriodSeconds: 86400 固定 1 天时间优雅退出，和上面的 86400s 需要匹配即可完美实现 envoy 监听的 2222 端口的优雅退出。

Dockerfile 的 VOLUME 关键字会从镜像传递到 k8s pod 中

Posted on 2023-01-27 Edited on 2024-09-22 In Container

在书写 Dockerfile 的时候，会存在一个 VOLUME 的关键字，例如

1 2	FROM busybox VOLUME /mnt/data

我们如果将这个 Dockerfile 打成一个镜像

1	docker build -t test-image .

并将这个镜像在 Kubernetes 中创建一个 Pod 的话，当我们 exec 进入该 Pod，执行

1
2
3

/ # df -Th /mnt/data/
Filesystem           Type            Size      Used Available Use% Mounted on
/dev/sda1            xfs           199.9G     53.8G    146.1G  27% /local-rootfs

会发现 /mnt/data 被挂载了一个宿主机的临时目录，这种情况其实不是 Kubernetes 能控制的，其实是底层的容器运行时控制的，但这样的话就会导致平台无法控制用户镜像中的 VOLUME，可能会导致用户往对应 VOLUME 里疯狂写入数据造成宿主机磁盘写满而存在安全风险。鉴于现在 Kubernetes 官方推荐使用 containerd，所以我们来说一下针对 containerd 的场景该问题该如何解决。

为了解决该问题，我们需要在 containerd 的 /etc/containerd/config.toml 中使用 ignore_image_defined_volumes = true 去避免该问题，参考文档，里面有段注释说明了该选项的功能

1
2
3

# ignore_image_defined_volumes ignores volumes defined by the image. Useful for better resource
# isolation, security and early detection of issues in the mount configuration when using
# ReadOnlyRootFilesystem since containers won't silently mount a temporary volume.

具体实现的代码可以在这里找到。

尝试把 Golang 从 1.15 升级到 1.16 以上时遇到的一些坑

Posted on 2022-04-04 Edited on 2024-09-22 In Development

背景

之前项目中用的 go 版本是 1.15.8，期望能升级到 1.17.4。主要驱动是我们用了 bazel，然后想在 M1 Mac 上用 bazel 编译 go 项目，发现需要升级 rules_go 到最新版本，并且 Go 在 1.16 之前官方不提供编译好的 darwin-arm64 的二进制包。因此尝试把项目升级到了 1.17

问题

升级之后，发现一些奇怪的问题，后面我构造了一个详细的最小复现例子。根据这个例子的 README.md 执行，我们发现 go 1.15 和 go 1.16 有了完全不同的两个行为。最终发现是 syscall.Setgroups 这个函数在 1.16、1.17 的行为和 1.15 的不一致，因此我给 go 提了一个 issue。目前已被解决，经过测试，在 go 1.18 中已经不存在该问题。

利用老版本 containerd 自带的 ctr 命令进行镜像 tag 操作

Posted on 2022-03-05 Edited on 2024-09-22 In Container

一、起因

在帮助客户做线上操作的时候，需要安装寒武纪卡官方的 K8S device-plugin ，
按照 daemonset yaml 部署 device-plugin 之后，
发现 yaml 里写的 cambricon-k8s-device-plugin:v1.1.3 根本 pull 不下来

1
2

$ docker pull cambricon-k8s-device-plugin:v1.1.3
Error response from daemon: pull access denied for cambricon-k8s-device-plugin, repository does not exist or may require 'docker login': denied: requested access to the resource is denied

后面发现这个镜像需要自己打。但是其实之前有离职的同学曾经打过一个镜像，放在了一台寒武纪机器的本地 containerd image 中，因此我们没必要再打一次镜像了，最好就直接复用这个镜像，把这个镜像 push 到一个内网 registry 上供所有机器使用。

操作了一番之后发现 containerd 这个 ctr 操作和 docker 还是非常不一样的，一开始想尝试要不直接安装一个 nerdctl 得了，但是感觉这么小的一个需求没必要搞那么麻烦，因此就开始探究 ctr export 镜像并 push 的操作。

二、操作

ctr 是 containerd 内置的一个查看镜像的命令，一般安装了 containerd 的话一定会安装 ctr。用 ctr 做镜像操作的话，需要用 ctr image 命令，或者方便一些，直接用 ctr i。如果查看镜像列表，就使用 ctr i ls。但其实我们执行了一下发现

1 2	$ ctr i ls REF TYPE DIGEST SIZE PLATFORMS LABELS

并没有得到什么结果。这里主要是因为 ctr 里面有一个 namespace 的概念，用 ctr --help 可以看到这个参数

$ ctr --help
...
--namespace value, -n value  namespace to use with commands (default: "default") [$CONTAINERD_NAMESPACE]
...

我们的镜像是给 k8s 使用的，k8s 会使用 k8s.io 这个 namespace 下的镜像，因此我们需要加一个 -n k8s.io

1 2	$ ctr -n k8s.io i ls -q \| grep cambricon-k8s-device-plugin:v1.1.3 172.1.1.34:5000/library/cambricon-k8s-device-plugin:v1.1.3

这里使用 -q 参数只打印了镜像名。然后我们找到了存在在这台寒武纪机器上的这个 172.1.1.34:5000/library/cambricon-k8s-device-plugin:v1.1.3 镜像。现在我们的目的是把这个镜像 tag 成 docker-registry.i.demo/library/cambricon-k8s-device-plugin:v1.1.3，其中 docker-registry.i.demo 姑且当做是内网 docker-registry 的域名地址。

经过查阅资料发现，ctr 在后续版本合并了 tag 命令，然而一个不幸的消息是我们的 ctr 版本是 v1.2.6，似乎并没有集成 tag 命令，于是只能另辟蹊径。

后来发现，我们可以尝试先把镜像 export 出来成为 tar 包，然后再 import 回去，ctr 在 import 镜像的时候支持重新写个名字。这样看来这么绕一下看起来是可行的：

$ ctr -n k8s.io i export cambricon-k8s-device-plugin-v1.1.3.tar 172.1.1.34:5000/library/cambricon-k8s-device-plugin:v1.1.3
$ ctr -n k8s.io i import cambricon-k8s-device-plugin-v1.1.3.tar --base-name docker-registry.i.demo/library/cambricon-k8s-device-plugin
unpacking docker-registry.i.demo/library/cambricon-k8s-device-plugin:v1.1.3 (sha256:4b67b0f9c5c7fee7ed80409b2df4f68d4fa5dd170af88e7281fab7c88cbbd62a)...done
$ ctr -n k8s.io i ls -q | grep docker-registry.i.demo
docker-registry.i.demo/library/cambricon-k8s-device-plugin:v1.1.3

可以看到非常成功，我们成功做了一个 tag 操作。虽然看起来比较蠢，似乎下载一个新版的 ctr 就可以实现。但是如果客户环境不通互联网的话，这类 hack 操作还是比较好的救命稻草。

之后我们就可以使用 ctr i push 把镜像 push 上去了

1	$ ctr -n k8s.io i push docker-registry.i.demo/library/cambricon-k8s-device-plugin:v1.1.3

大功告成