Postgres TOAST

本文主要介绍 Postgres 中的 Toast 技术

（一）、Toast 技术

1.1 简介

Toast 是一种超尺寸字段存储技术（ The Oversized Attribute Storage Technique） 。是 Postgres 中存储超长字段的方法，通常在 Postgres 中：

每个 Row（数据行）都存储在大小为 8kb 的 Page（页）中
每行数据不会跨页存储

为了支持一行数据超过 8kb 的情况，所以引入了 $Toast$ 技术。

1.2 技术细节

当 Postgres 需要存储一个超过 2kb 的行的时候，$Toast$ 技术就会派上用场；Postgres 首先会尝试使用 压缩技术 ，当该行能够被压缩到 2kb 以下的话，那么问题也就解决了。如果压缩后，大小还超过 2kb 的话，Postgres 就会把这个被压缩的数据切分为大小为 2kb 的数据块（chunk），然后每个数据块都存储在 Toast Table 中。每个普通的 $Postgres Table$ 都有一个与之对应的 $Toast Table$，把超长数据切分后存储在这个 $Toast Table$ 中，存储在 $Toast Table$ 中的数据也称为TOASTed。

每个 $Toast Table$ 有三列：

chunk_id：用于区分 Toast table 中的数据块(chunk) 属于哪个 $TOASTed$ 数据；也就是区分哪些数据块属于同一原始数据。
chunk_seq：用于区分数据块之间的顺序(orders)；比如排第一的数据块 $chunk_seq=0$，排第二的数据块 $chunk_seq=1$ 等等。
chunk_data：用于真正存储数据。

当进行 $query$ 的时候，当需要读取 $TOASTed$ 数据时，Postgres 会使用(chunk_id, chunk_seq) 来当作索引，去 $Toast Table$ 读取数据编号为 $chunk_id$ 且按照 chunk_seq 排序的数据块；把这相关的数据块都读走，然后进行解压缩，就能够给得到原始的数据了。

特定情况下，$Toast$ 技术还能够提升性能；当不需要读取 $TOASTed$ 原始数据时，我们可以不把 Toast table 中的数据块读入到内存。

1.3 实例

我们可以通过 Postgres 中普通的数据表，得到相应的 Toast table 。首先，我们创建一个 $messages$ 表，有一个单一的列 $message$:

CREATE TABLE messages (message text);

然后，在表中插入一些随机的字符串：

INSERT INTO messages
SELECT (SELECT
        string_agg(chr(floor(random() * 26)::int + 65), '')
        FROM generate_series(1,10000))
FROM generate_series(1,10);

这样我们就得到了一个表，里面有 $TOASTed$ 数据；首先，我们需要从这个 $messages$ 表得到相应的 $Toast table$ 的名字，可以通过以下命令获得：

> SELECT reltoastrelid::regclass
> FROM pg_class
> WHERE relname = 'messages';
reltoastrelid
-------------------------
pg_toast.pg_toast_59611
(1 row)

这行语句从 pg_class 表中读取相应的信息（ pg_class 表是 Postgres 中用于存储表的metadata的一个表）。

当我们有了相应的 $Toast table$ 的名字后，我们可以输出相应的信息：

> SELECT * FROM pg_toast.pg_toast_59611;
chunk_id | chunk_seq | chunk_data
----------+-----------+------------
59617 |         0 | \x4c4457...
59617 |         1 | \x424d4b...
...

注意：chunk_data 的内容是压缩的binary格式的数据，所以不是 human readable.

（二）、Postgres 源码阅读

并不是全部数据类型都支持toast，对于有些不会产生大字段数据的字段类型(比如date,time,boolean等)是完全没必要用到Toast技术。支持Toast的数据类型应当时变长的 varlena：

struct varlena
{
    char        vl_len_[4];        /* Do not touch this field directly! */
    char        vl_dat[FLEXIBLE_ARRAY_MEMBER];    /* Data content is here */
};

以上数据类型并不是表示一个toasted 数据类型的，因为它过于简单，没有包含是否压缩，以及是否使用 toast 技术的描述。该数据类型仅仅表示 de-toasted 数据。对于表示 toasted 数据，应当使用：(定义在 src/include/postgres.h 中：)

typedef union
{
    struct                        /* Normal varlena (4-byte length) */
    {
        uint32        va_header;
        char        va_data[FLEXIBLE_ARRAY_MEMBER];
    }            va_4byte;
    struct                        /* Compressed-in-line format */
    {
        uint32        va_header;
        uint32        va_rawsize; /* Original data size (excludes header) */
        char        va_data[FLEXIBLE_ARRAY_MEMBER]; /* Compressed data */
    }            va_compressed;
} varattrib_4b;

typedef struct
{
    uint8        va_header;
    char        va_data[FLEXIBLE_ARRAY_MEMBER]; /* Data begins here */
} varattrib_1b;

/* TOAST pointers are a subset of varattrib_1b with an identifying tag byte */
typedef struct
{
    uint8        va_header;        /* Always 0x80 or 0x01 */
    uint8        va_tag;            /* Type of datum */
    char        va_data[FLEXIBLE_ARRAY_MEMBER]; /* Type-specific data */
} varattrib_1b_e;

那么当我拿到一个变长数据类型（varlena）的数据时，我怎么知道它是否使用到了 toast 技术呢？以及是否压缩了呢？这就用到了以下的宏：

在 src/include/postgres.h 中：

/*
 * Bit layouts for varlena headers on big-endian machines:
 *
 * 00xxxxxx 4-byte length word, aligned, uncompressed data (up to 1G)
 * 01xxxxxx 4-byte length word, aligned, *compressed* data (up to 1G)
 * 10000000 1-byte length word, unaligned, TOAST pointer
 * 1xxxxxxx 1-byte length word, unaligned, uncompressed data (up to 126b)
 *
 * Bit layouts for varlena headers on little-endian machines:
 *
 * xxxxxx00 4-byte length word, aligned, uncompressed data (up to 1G)
 * xxxxxx10 4-byte length word, aligned, *compressed* data (up to 1G)
 * 00000001 1-byte length word, unaligned, TOAST pointer
 * xxxxxxx1 1-byte length word, unaligned, uncompressed data (up to 126b)
 *
 * The "xxx" bits are the length field (which includes itself in all cases).
 * In the big-endian case we mask to extract the length, in the little-endian
 * case we shift.  Note that in both cases the flag bits are in the physically
 * first byte.  Also, it is not possible for a 1-byte length word to be zero;
 * this lets us disambiguate alignment padding bytes from the start of an
 * unaligned datum.  (We now *require* pad bytes to be filled with zero!)
 *
 * In TOAST pointers the va_tag field (see varattrib_1b_e) is used to discern
 * the specific type and length of the pointer datum.
 */

/*
 * Endian-dependent macros.  These are considered internal --- use the
 * external macros below instead of using these directly.
 *
 * Note: IS_1B is true for external toast records but VARSIZE_1B will return 0
 * for such records. Hence you should usually check for IS_EXTERNAL before
 * checking for IS_1B.
 */

#define VARATT_IS_4B(PTR) \
((((varattrib_1b *) (PTR))->va_header & 0x80) == 0x00)  # 判断 ptr 指针所指向的是不是：aligned；是则返回 1。
#define VARATT_IS_4B_U(PTR) \
((((varattrib_1b *) (PTR))->va_header & 0xC0) == 0x00)  # 判断 ptr 指针所指向的是不是：aligned 且 非压缩 ；是则返回 1。
#define VARATT_IS_4B_C(PTR) \
((((varattrib_1b *) (PTR))->va_header & 0xC0) == 0x40)  # 判断 ptr 指针所指向的是不是：aligned 且 压缩 ；是则返回 1。
#define VARATT_IS_1B(PTR) \
((((varattrib_1b *) (PTR))->va_header & 0x80) == 0x80)  # 判断 ptr 指针所指向的是不是：非 aligned ；是则返回 1。
#define VARATT_IS_1B_E(PTR) \
((((varattrib_1b *) (PTR))->va_header) == 0x80)         # 如果 ptr 指针所指向的结构体中，head 的最高两位为：10；是则是：TOAST pointers
#define VARATT_NOT_PAD_BYTE(PTR) \
(*((uint8 *) (PTR)) != 0)

（三）、参考资料

https://malisper.me/postgres-toast/
https://my.oschina.net/Kenyon/blog/113026