Frontpage

Database types

Which is better, VARBINARY(MAX) or BLOB for storing users encrypted data?

Row-bloat is a main factor of concern.

BLOB seem optimal type.

Since you’re not performing any operations on the data itself (like searching, sorting, or comparing), BLOB is optimal because it’s intended for storing raw binary data without the need for text processing, encoding, or collation. (chatgpt)

No indexing will be done on the encrypted data column, so BLOB is the best choice. Blobs are variable-length binary strings that can store up to 65,535 bytes of data. Still effective for small amounts.

Blobs have smaller in-row storage treshold (768 bytes). Varbinary has 65,535 bytes for in-row storage supposedly.

Blobs are better...

Keep pages nice with many rows in page. Better preformance. Varbinary filling up rows to max would give less rows per page and more disk usage. The benefit of quicker in-row access with VARBINARY is lost by row-bloat.

Row Bloat and Table Size:

Using VARBINARY to store data close to the maximum row size (around 65 KB) will keep data in-row, which can lead to row bloat. Since each row can be close to the InnoDB row size limit, this can fill the data pages quickly, resulting in fewer rows per page and increased disk usage.

With BLOB, data overflows to separate pages when it exceeds the in-row threshold (usually above 768 bytes or 40 bytes depending on the row format). This keeps the main row compact, which allows more rows to fit per page, reducing row bloat and helping with overall database performance.

Preformance Trade-Offs:

Extra I/O for BLOB: Retrieving BLOB data stored out-of-row does introduce additional I/O, as it requires the database to access overflow pages. However, for basic operations like SELECT on an id column, this added I/O cost is minimal and generally manageable.

Table Scanning and Cache Efficiency: Storing data out-of-row can improve cache efficiency, as the database can load more rows into memory without the binary data taking up row space. This improves performance in cases where the primary access is via an id lookup, and you only retrieve the binary data when needed.

Future

for future, i can fill the ordinary blob, and have a flag if there is more data, then resort to select on another table for long blobs, thus keeping ordinary structure for normal blobs. Or not fill up ordinary blob, but leave empty and have a reference to the other-blob table. Thus retain original design with benefits for most ordinary blobs, and still support cases of longer blobs.