T-SQL Tuesday #79 – COMPRESS()ing LOB data

It’s T-SQL Tuesday,  the blog party that SQL Server expert Adam Machanic (blog|twitter) started. This month’s episode is hosted by Michael J Swart (blog| twitter). The topic: SQL Server 2016 is out!

SQL Server 2016 is hot off the presses, and Michael challenges us to blog about it.

I’ve been using SSMS 2016 (I’m impressed!), but I haven’t spent tons of time using the server-side bits. As I type this intro, the 2016 installer is chugging along getting a fresh instance ready for me to explore so I can write about…

COMPRESS()

I’ve written about SQL Server Data Compression in a few other blog posts. Until now, I’ve concentrated exclusively on the feature that compresses rowstore data on heaps & b-trees. One limitation of Data Compression is that it doesn’t compress LOB data. COMPRESS() gives an option for compressing that LOB data.

How’s it work?

The COMPRESS() function takes an input of string or binary data, and applies the gzip algorithm to it. It returns a value of type varbinary(max). In essence, instead of storing string or binary data, you can gzip it up and store it in a varbinary(max) column. There’s also a DECOMPRESS() function for when you are reading the data and need to unzip it.

This costs some CPU, but gzip can save quite a bit of space. How much space, and whether it’s worth the CPU cost will vary depending on your data and workload. In this blog post, we’ll take a look at one table. We’ll look at the space savings we get out of using COMPRESS(), and we’ll look at the effort necessary to implement it.

Tangent: Data Compression

Let’s take a look at the WorldWideImporters sample DB (Also new in 2016!). The Sales.Invoices table has several nvarchar(max) columns on it, so that should make for some good LOB data for testing this out. First off, let’s look at the size of the table uncompressed, and using PAGE compression:

ALTER INDEX ALL ON Sales.Invoices REBUILD WITH (DATA_COMPRESSION=NONE);
EXEC sp_spaceused 'Sales.Invoices';

ALTER INDEX ALL ON Sales.Invoices REBUILD WITH (DATA_COMPRESSION=PAGE);
EXEC sp_spaceused 'Sales.Invoices';

 

Page compression shaves about 30% of the size off the table, from ~75MB down to ~50MB.

Compression Level Reserved Space (kb) Compression Ratio
None 75744 N/A
Page 52064 1.45:1  (31% original size)

 

Creating a COMPRESS()ed version

To test how much space we can save by using COMPRESS() on the LOB data, we’ll just copy it into a new table. We’ll create a table called Sales.InvoicesBase, which has a similar schema, except I’ll replace the nvarchar(max) columns with varbinary(max) columns. We’re also going to create all of the constraints and indexes (there are quite a few constraints & indexes).

CREATE TABLE Sales.InvoicesBase(
	InvoiceID                   int NOT NULL,
	CustomerID                  int NOT NULL,
	BillToCustomerID            int NOT NULL,
	OrderID                     int NULL,
	DeliveryMethodID            int NOT NULL,
	ContactPersonID             int NOT NULL,
	AccountsPersonID            int NOT NULL,
	SalespersonPersonID         int NOT NULL,
	PackedByPersonID            int NOT NULL,
	InvoiceDate                 date NOT NULL,
	CustomerPurchaseOrderNumber nvarchar(20) NULL,
	IsCreditNote                bit NOT NULL,
	CreditNoteReason            varbinary(max) NULL,
	Comments                    varbinary(max) NULL,
	DeliveryInstructions        varbinary(max) NULL,
	InternalComments            varbinary(max) NULL,
	TotalDryItems               int NOT NULL,
	TotalChillerItems           int NOT NULL,
	DeliveryRun                 nvarchar(5) NULL,
	RunPosition                 nvarchar(5) NULL,
	ReturnedDeliveryData        varbinary(max) NULL,
	ConfirmedDeliveryTime       AS (TRY_CONVERT(datetime2(7),json_value(CAST(DECOMPRESS(ReturnedDeliveryData) AS nvarchar(max)),N'$.DeliveredWhen'),(126))),
	ConfirmedReceivedBy         AS (json_value(CAST(DECOMPRESS(ReturnedDeliveryData) AS nvarchar(max)),N'$.ReceivedBy')),
	LastEditedBy                int NOT NULL,
	LastEditedWhen              datetime2(7) NOT NULL,
 CONSTRAINT PK_Sales_InvoicesBase PRIMARY KEY CLUSTERED (InvoiceID)
	WITH (DATA_COMPRESSION = PAGE) ON USERDATA
) ON USERDATA TEXTIMAGE_ON USERDATA
GO

ALTER TABLE Sales.InvoicesBase ADD  CONSTRAINT DF_Sales_InvoicesBase_InvoiceID  DEFAULT (NEXT VALUE FOR Sequences.InvoiceID) FOR InvoiceID
ALTER TABLE Sales.InvoicesBase ADD  CONSTRAINT DF_Sales_InvoicesBase_LastEditedWhen  DEFAULT (sysdatetime()) FOR LastEditedWhen
ALTER TABLE Sales.InvoicesBase  WITH CHECK ADD  CONSTRAINT FK_Sales_InvoicesBase_AccountsPersonID_Application_People FOREIGN KEY(AccountsPersonID)
	REFERENCES Application.People (PersonID)
ALTER TABLE Sales.InvoicesBase CHECK CONSTRAINT FK_Sales_InvoicesBase_AccountsPersonID_Application_People
ALTER TABLE Sales.InvoicesBase  WITH CHECK ADD  CONSTRAINT FK_Sales_InvoicesBase_Application_People FOREIGN KEY(LastEditedBy)
	REFERENCES Application.People (PersonID)
ALTER TABLE Sales.InvoicesBase CHECK CONSTRAINT FK_Sales_InvoicesBase_Application_People
ALTER TABLE Sales.InvoicesBase  WITH CHECK ADD  CONSTRAINT FK_Sales_InvoicesBase_BillToCustomerID_Sales_Customers FOREIGN KEY(BillToCustomerID)
	REFERENCES Sales.Customers (CustomerID)
ALTER TABLE Sales.InvoicesBase CHECK CONSTRAINT FK_Sales_InvoicesBase_BillToCustomerID_Sales_Customers
ALTER TABLE Sales.InvoicesBase  WITH CHECK ADD  CONSTRAINT FK_Sales_InvoicesBase_ContactPersonID_Application_People FOREIGN KEY(ContactPersonID)
	REFERENCES Application.People (PersonID)
ALTER TABLE Sales.InvoicesBase CHECK CONSTRAINT FK_Sales_InvoicesBase_ContactPersonID_Application_People
ALTER TABLE Sales.InvoicesBase  WITH CHECK ADD  CONSTRAINT FK_Sales_InvoicesBase_CustomerID_Sales_Customers FOREIGN KEY(CustomerID)
	REFERENCES Sales.Customers (CustomerID)
ALTER TABLE Sales.InvoicesBase CHECK CONSTRAINT FK_Sales_InvoicesBase_CustomerID_Sales_Customers
ALTER TABLE Sales.InvoicesBase  WITH CHECK ADD  CONSTRAINT FK_Sales_InvoicesBase_DeliveryMethodID_Application_DeliveryMethods FOREIGN KEY(DeliveryMethodID)
	REFERENCES Application.DeliveryMethods (DeliveryMethodID)
ALTER TABLE Sales.InvoicesBase CHECK CONSTRAINT FK_Sales_InvoicesBase_DeliveryMethodID_Application_DeliveryMethods
ALTER TABLE Sales.InvoicesBase  WITH CHECK ADD  CONSTRAINT FK_Sales_InvoicesBase_OrderID_Sales_Orders FOREIGN KEY(OrderID)
	REFERENCES Sales.Orders (OrderID)
ALTER TABLE Sales.InvoicesBase CHECK CONSTRAINT FK_Sales_InvoicesBase_OrderID_Sales_Orders
ALTER TABLE Sales.InvoicesBase  WITH CHECK ADD  CONSTRAINT FK_Sales_InvoicesBase_PackedByPersonID_Application_People FOREIGN KEY(PackedByPersonID)
	REFERENCES Application.People (PersonID)
ALTER TABLE Sales.InvoicesBase CHECK CONSTRAINT FK_Sales_InvoicesBase_PackedByPersonID_Application_People
ALTER TABLE Sales.InvoicesBase  WITH CHECK ADD  CONSTRAINT FK_Sales_InvoicesBase_SalespersonPersonID_Application_People FOREIGN KEY(SalespersonPersonID)
	REFERENCES Application.People (PersonID)
ALTER TABLE Sales.InvoicesBase CHECK CONSTRAINT FK_Sales_InvoicesBase_SalespersonPersonID_Application_People
ALTER TABLE Sales.InvoicesBase  WITH CHECK ADD  CONSTRAINT CK_Sales_InvoicesBase_ReturnedDeliveryData_Must_Be_Valid_JSON CHECK  ((ReturnedDeliveryData IS NULL OR isjson(CAST(DECOMPRESS(ReturnedDeliveryData) AS nvarchar(max)))<>(0)))
ALTER TABLE Sales.InvoicesBase CHECK CONSTRAINT CK_Sales_InvoicesBase_ReturnedDeliveryData_Must_Be_Valid_JSON
GO
CREATE NONCLUSTERED INDEX FK_Sales_InvoicesBase_AccountsPersonID ON Sales.InvoicesBase (AccountsPersonID)
	WITH (DATA_COMPRESSION = PAGE) ON USERDATA;
CREATE NONCLUSTERED INDEX FK_Sales_InvoicesBase_BillToCustomerID ON Sales.InvoicesBase (BillToCustomerID)
	WITH (DATA_COMPRESSION = PAGE) ON USERDATA;
CREATE NONCLUSTERED INDEX FK_Sales_InvoicesBase_ContactPersonID ON Sales.InvoicesBase (ContactPersonID)
	WITH (DATA_COMPRESSION = PAGE) ON USERDATA;
CREATE NONCLUSTERED INDEX FK_Sales_InvoicesBase_CustomerID ON Sales.InvoicesBase (CustomerID)
	WITH (DATA_COMPRESSION = PAGE) ON USERDATA;
CREATE NONCLUSTERED INDEX FK_Sales_InvoicesBase_DeliveryMethodID ON Sales.InvoicesBase (DeliveryMethodID)
	WITH (DATA_COMPRESSION = PAGE) ON USERDATA;
CREATE NONCLUSTERED INDEX FK_Sales_InvoicesBase_OrderID ON Sales.InvoicesBase (OrderID)
	WITH (DATA_COMPRESSION = PAGE) ON USERDATA;
CREATE NONCLUSTERED INDEX FK_Sales_InvoicesBase_PackedByPersonID ON Sales.InvoicesBase (PackedByPersonID)
	WITH (DATA_COMPRESSION = PAGE) ON USERDATA;
CREATE NONCLUSTERED INDEX FK_Sales_InvoicesBase_SalespersonPersonID ON Sales.InvoicesBase (SalespersonPersonID)
	WITH (DATA_COMPRESSION = PAGE) ON USERDATA;
CREATE NONCLUSTERED INDEX IX_Sales_InvoicesBase_ConfirmedDeliveryTime ON Sales.InvoicesBase (ConfirmedDeliveryTime)
	INCLUDE (ConfirmedReceivedBy)
	WITH (DATA_COMPRESSION = PAGE) ON USERDATA

 

Now, we just have to move data from Sales.Invoices to Sales.InvoiceBase. We’ll also update FK references to point at the new table, and we’ll drop the old table.

INSERT INTO Sales.InvoicesBase 
	(InvoiceID, CustomerID, BillToCustomerID, OrderID, DeliveryMethodID, 
	ContactPersonID, AccountsPersonID, SalespersonPersonID, PackedByPersonID, 
	InvoiceDate, CustomerPurchaseOrderNumber, IsCreditNote, CreditNoteReason, 
	Comments, DeliveryInstructions, InternalComments, TotalDryItems, 
	TotalChillerItems, DeliveryRun, RunPosition, ReturnedDeliveryData, 
	LastEditedBy, LastEditedWhen)
SELECT InvoiceID, CustomerID, BillToCustomerID, OrderID, DeliveryMethodID, 
	ContactPersonID, AccountsPersonID, SalespersonPersonID, PackedByPersonID, 
	InvoiceDate, CustomerPurchaseOrderNumber, IsCreditNote, 
	COMPRESS(CreditNoteReason), COMPRESS(Comments), COMPRESS(DeliveryInstructions), 
	COMPRESS(InternalComments), TotalDryItems, TotalChillerItems, DeliveryRun, 
	RunPosition, COMPRESS(ReturnedDeliveryData), LastEditedBy, LastEditedWhen
FROM Sales.Invoices;
--
--
--Now, we can update FKs to point at InvoicesBase, and drop Invoices
--First update FKs
ALTER TABLE Sales.CustomerTransactions DROP CONSTRAINT FK_Sales_CustomerTransactions_InvoiceID_Sales_Invoices;
ALTER TABLE Sales.CustomerTransactions WITH CHECK ADD CONSTRAINT FK_Sales_CustomerTransactions_InvoiceID_Sales_Invoices FOREIGN KEY(InvoiceID)
 REFERENCES Sales.InvoicesBase (InvoiceID);
--
ALTER TABLE Sales.InvoiceLines DROP CONSTRAINT FK_Sales_InvoiceLines_InvoiceID_Sales_Invoices;
ALTER TABLE Sales.InvoiceLines WITH CHECK ADD CONSTRAINT FK_Sales_InvoiceLines_InvoiceID_Sales_Invoices FOREIGN KEY(InvoiceID)
 REFERENCES Sales.InvoicesBase (InvoiceID);
GO
--
ALTER TABLE Warehouse.StockItemTransactions DROP CONSTRAINT FK_Warehouse_StockItemTransactions_InvoiceID_Sales_Invoices;
ALTER TABLE Warehouse.StockItemTransactions WITH CHECK ADD CONSTRAINT FK_Warehouse_StockItemTransactions_InvoiceID_Sales_Invoices FOREIGN KEY(InvoiceID)
 REFERENCES Sales.InvoicesBase (InvoiceID);
--
--Now we can drop the original Invoices table
DROP TABLE Sales.Invoices;

Cool, Bro. Did you save any space?

ALTER INDEX ALL ON Sales.InvoicesBase REBUILD WITH (DATA_COMPRESSION=PAGE);
EXEC sp_spaceused 'Sales.InvoicesBase';

Yup. We eeked out another 20% space savings:

Compression Level Reserved Space (kb) Compression Ratio
None 75744 N/A
Page 52064 1.45:1  (31% original size)
Page + GZIP 37472 2.02:1  (51% original size)

 

Let’s make life a little bit easier

We can replace the Sales.Invoices table with a view of the same name, and embed the DECOMPRESS() calls inside that view:

CREATE VIEW Sales.Invoices
AS
SELECT InvoiceID, CustomerID, BillToCustomerID, OrderID, DeliveryMethodID, 
	ContactPersonID, AccountsPersonID, SalespersonPersonID, PackedByPersonID, 
	InvoiceDate, CustomerPurchaseOrderNumber, IsCreditNote, 
	DECOMPRESS(CreditNoteReason) AS CreditNoteReason, 
	DECOMPRESS(Comments) AS Comments, 
	DECOMPRESS(DeliveryInstructions) AS DeliveryInstructions, 
	COMPRESS(InternalComments) AS InternalComments, 
	TotalDryItems, TotalChillerItems, DeliveryRun, RunPosition, 
	DECOMPRESS(ReturnedDeliveryData) AS ReturnedDeliveryData, 
	ConfirmedDeliveryTime, ConfirmedReceivedBy, LastEditedBy, LastEditedWhen
FROM Sales.InvoicesBase;

 

Summary

I see this as being particularly useful with archive data. In most cases, older data is accessed less often, limiting the impact of additional CPU load. Additionally, greater data density (rows per GB) is valued with archive data. If your data takes up 20% less space, you can keep even more data!

One thing to note about the gzip algorithm is that compression ratios get better as data gets larger. War and Peace will get a much higher compression ratio than Readers’ Digest. That means that the bigger your LOBs are, the better compression you’ll see.

I’m looking forward to spending more time playing around with this function, and trying to measure what that CPU cost is. If you’ve been experimenting with COMPRESS(), comment below and let me know what you think.

2 Trackbacks & Pingbacks

  1. Compressing LOB Data – Curated SQL
  2. T-SQL Tuesday #079 Roundup: It’s 2016! | Michael J. Swart

Leave a Reply