Monday, May 14, 2012

Compress XML data

Its been while since I've posted anything. Sorry folks. I've done some very cool stuff, will try and post in the coming weeks. 1 in particular is the Minionizer or Minionator, depending on the scale and effect of the minions. ;-). Basically taking a unit of work, encapsulating inot a Powershell thread, i.e. job, and send off these threads to numerous destinations, like servers or databases. In some cases I can gather server and database information from 300+ servers in 15 minutes. But more on that in the next post.

So, XML data compression. From the reading you may have done already, you don't get too much return from conventional compression techniques like native data compression in SQL 2008. Some folk have used transparent data encryption and got a 30% compression. More PT than its worth. What really grates me is the repetition of data and elements in each XML record. The nature of the beast. This was the primary target for my compression technique. Basically, I use xpath query and show the XML element names of the XML string. So if an XML string has 15 elements, the names of these will be repeated twice, once at the beginning of the element and once at the end. I then replace each of these element names with a 2 character string, made up of special characters. So element name PostalAddress may be replaced with ## for example. I iterate through the element names with a cursor. Its the simplest way I could think of doing this. There are probably smarter ways, but my method works quite efficiently.

One thing to note, not all XML data returns great compression using the above technique. After much testing I found that XML of length 4000 - 6000 bytes returned the most compression. In some cases as much as 140 fold compression!!! Yes, thats correct. So a 10000 record sample originally used 33256 KB, after compression this was 272 KB. The trick is to be able to get the compressed XML string and the entire record to smaller than 4000 bytes, and then apply Table compression at the page level.

Another note, the element names are replaced with 2 byte special character strings. This mapping of original element name to 2 byte string is saved in a mapping table, almost like an encryption key. Lose this table and you lose your mapping. The data in the XML string is not changed, only the XML schema, so if you know the schema you can still extract the data.

The 1st bit of code you'll need is a view that generates a few thousand 2 character strings:
CREATE VIEW uv_Charr
AS
WITH charr(CharNumber, CharVal)
AS
(
SELECT 192 AS CharNumber, CHAR(192) AS CharVal UNION ALL
SELECT 193 AS CharNumber, CHAR(193) AS CharVal UNION ALL
SELECT 194 AS CharNumber, CHAR(194) AS CharVal UNION ALL
....
SELECT 255 AS CharNumber, CHAR(255) AS CharVal
)
Select c1.CharVal+c2.CharVal Charr FROM charr c1 CROSS JOIN charr c2

--Avoid 215 and 248 go all the way up to 255. This will give you 3721 2 byte strings
The rest of the code basically gets the element names, joins these to the data from the above view and then updates the XML string. The below code saves the mapping of element names to 2 byte strings to the MQMessageDetailLogNodeMap. The query below is against a table called MQMessageDetailLog
IF  EXISTS (SELECT * FROM sys.objects WHERE object_id = OBJECT_ID(N'[dbo].[MQMessageDetailLogNodeMap]') AND type in (N'U'))
 
DROP TABLE [dbo].[MQMessageDetailLogNodeMap]
GO
SET NOCOUNT ON

DECLARE @StartTime DATETIME
DECLARE @EndTIME DATETIME
 
SET @StartTime = GETDATE()
DECLARE @Char TABLE
(ID INT IDENTITY(1,1),
CharVal CHAR(2))

INSERT @Char
SELECT Charr FROM uv_Charr
 
DECLARE @Xs VARCHAR(MAX)
DECLARE @Xso VARCHAR(MAX)
DECLARE @XX XML
DECLARE @Nv1 TABLE
(NodeName XML)
DECLARE @Nv2 TABLE
(ID INT IDENTITY(1,1),
NodeName XML)
 
SET @Xso = (SELECT TOP 1 CAST(MessageData AS VARCHAR(MAX)) FROM MQMessageDetailLog)

 
SET @XX = @Xso 
 
INSERT @Nv1
SELECT Nodes.Name.query('local-name(.)') FROM @XX.nodes('//*') As Nodes(Name)
 
INSERT @Nv2
SELECT DISTINCT CAST(NodeName AS VARCHAR(255)) FROM @Nv1

SELECT CAST(NodeName AS VARCHAR(255)) AS NodeName, c.CharVal 
INTO MQMessageDetailLogNodeMap
from @Nv2 n inner join @Char c on n.ID = c.ID
WHERE LEN(CAST(NodeName AS VARCHAR(255))) > 5
ORDER BY LEN(CAST(NodeName AS VARCHAR(255))) DESC

DECLARE @MessageLogID BIGINT
DECLARE @MessageXML XML
DECLARE UpdateAll CURSOR FOR 
ELECT MQMessageHeaderLogID, MessageData FROM dbo.MQMessageDetailLog
OPEN UpdateAll
FETCH NEXT FROM UpdateAll INTO @MessageLogID, @MessageXML
WHILE @@FETCH_STATUS = 0
BEGIN 
DECLARE @NodeName VARCHAR(255)
DECLARE @CharVal CHAR(2)
 
DECLARE Upd CURSOR FOR
SELECT CAST(NodeName AS VARCHAR(255)) AS NodeName, c.CharVal
from @Nv2 n inner join @Char c on n.ID = c.ID
OPEN Upd
FETCH NEXT FROM Upd INTO @NodeName, @CharVal
WHILE @@FETCH_STATUS = 0
BEGIN
SET @Xso = REPLACE(REPLACE(REPLACE(@Xso, '     ', ' '), '    ', ' '), '   ', ' ')
SET @Xso = REPLACE(@Xso,@NodeName, @CharVal)
UPDATE dbo.MQMessageDetailLog
SET MessageData = CAST(@Xso AS VARCHAR(MAX))
WHERE MQMessageHeaderLogID = @MessageLogID
 
FETCH NEXT FROM Upd INTO @NodeName, @CharVal
END
CLOSE Upd
DEALLOCATE Upd
 
FETCH NEXT FROM UpdateAll INTO @MessageLogID, @MessageXML
 
END
CLOSE UpdateALL
DEALLOCATE UpdateALL
 
SET @EndTime = GETDATE()
SELECT DATEDIFF(ms, @StartTime, @EndTime)
 
ALTER INDEX idx_MQMessageHeaderLogID ON [dbo].MQMessageDetailLog REBUILD PARTITION = ALL WITH ( PAD_INDEX  = OFF, STATISTICS_NORECOMPUTE  = OFF, ALLOW_ROW_LOCKS  = ON, ALLOW_PAGE_LOCKS  = ON, ONLINE = OFF, SORT_IN_TEMPDB = OFF, FILLFACTOR = 100, DATA_COMPRESSION = PAGE)

If you need any more information on the above, give me a shout.