InMemory OLTP for ETL

I’ve had a new project at work where I need to do some heavy lookups (lookup table is 26+ million in full load) and the performance in this has been horrible.
I needed to do lookups on the same data set that I’m sending through a data flow in SSIS.
The reason for this is a business case of finding where the people who shopped at a particular store arrived from and are heading to.
The data we have tell us some of this, but also gives hints on the people we don’t know where are coming from or going to.
So, applying a set of business rules to the data and do lookups in several steps based on these rules is what I’m doing.

Now, in a regular disk based database this will end in LOCK conflicts. So sending the data into a temp table or a regular cloned table is the only solution.
This adds to the loading times, and created other resource conflicts. The job got too heavy to function on a full load.
Let’s just say that if it didn’t crash the server, the ssis job took 12+ hours…
I’ve been looking into ways of improving this load, and stumbled across SQL Judo’s post on in memory staging.

As we where in the process of upgrading from SQLServer 2012 to 2016 by the end of September, I decided to investigate further. And the SQL Judo post links to another important tidbit – the new hash index type for in-memory optimized tables that got introduced with SQL Server 2014. It’s supposedly got one job, which is returning a single value, and it does it well.
You can find code samples and an explanation here.

Naturally, the code samples are simple and might not fit what you want to do exactly, but you get the general idea. Just remember the main challenge with this index: You have to have a bucket count that’s large enough for the amount of unique values in your table…

All right, so I start by making a hashed column for what constitutes a unique row in my data set. And then put the hash index on this. You don’t have to create a hashed column, that’s just due to the nature of my data set. You just put it on the column that you want to run the lookup on.

Great! Now I’m ready to start creating my first InMemory table…
Then I get this error:

Cannot create memory optimized tables. To create memory optimized tables, the database must have a MEMORY_OPTIMIZED_FILEGROUP that is online and has at least one container.

Ah, of course – you need to prep the server for InMemory OLTP… Quick search in the BOL reveals the needed steps:

ALTER DATABASE imoltp ADD FILEGROUP imoltp_mod CONTAINS MEMORY_OPTIMIZED_DATA  

ALTER DATABASE imoltp ADD FILE (name='imoltp_mod1', filename='c:\data\imoltp_mod1') 
TO FILEGROUP imoltp_mod  

Ok, this time though, we’re good to go.

CREATE TABLE [demo].[TempLookup]
(
	[OLTP_ID] [int] IDENTITY(1,1) NOT NULL PRIMARY KEY CLUSTERED,
	[Bnr] [nvarchar](15) COLLATE Danish_Norwegian_CI_AS NULL,
	[Dep_Code] [varchar](5) COLLATE Danish_Norwegian_CI_AS NULL,
	[Initials] [nvarchar](3) COLLATE Danish_Norwegian_CI_AS NULL,
	[Dest_Code] [varchar](5) COLLATE Danish_Norwegian_CI_AS NULL,
	[DateSk] [int] NULL,
	[TimeSk] [int] NULL,
	[platform] [nvarchar](max) COLLATE Danish_Norwegian_CI_AS NULL,
	[Sequence number] [nvarchar](10) COLLATE Danish_Norwegian_CI_AS NULL,
	[EndTimeSk] [int] NULL,
	HASH_Dep [varbinary] NULL  INDEX Ix_Hash_Dep  NONCLUSTERED HASH WITH (BUCKET_COUNT=30000000),
	HASH_Dest [varbinary] NULL INDEX Ix_Hash_Dest NONCLUSTERED HASH WITH (BUCKET_COUNT=30000000),
	HASH_Time [varbinary] NULL INDEX Ix_Hash_Time NONCLUSTERED HASH WITH (BUCKET_COUNT=30000000)





)WITH ( MEMORY_OPTIMIZED = ON , DURABILITY = SCHEMA_ONLY )

I tested both SCHEMA_AND_DATA and SCHEMA_ONLY, which are the two versions of InMemory tables available.
SCHEMA_AND_DATA gives you a guarantee that the data won’t be lost if the server goes down, however you loose a bit of speed as the data has to be backed up to disk.
From my tests, the SCHEMA_ONLY gave so much better performance that I decided it was worth ignoring the extra security data retention gave me.
In a staging or transformation step in SSIS, in my experience you rarely have to worry about the data getting lost – just add a load_date filter if you need to – and start the job from scratch.
But this might be important to you, so take a moment to decide on this setting.

Alright, back to my little table I run into this little error:

Msg 12317, Level 16, State 78, Line 20
Clustered indexes, which are the default for primary keys, are not supported with memory optimized tables. Specify a NONCLUSTERED index instead.

Oh yeah, sorry about that… InMemory – or Hekaton if you like – has it’s quirks, or limitations.
Just adding that crucial little “NON” on the primary key and we should be OK.

CREATE TABLE [demo].[TempLookup]
(
	[OLTP_ID] [int] IDENTITY(1,1) NOT NULL PRIMARY KEY NONCLUSTERED,
	[Bnr] [nvarchar](15) COLLATE Danish_Norwegian_CI_AS NULL,
	[Dep_Code] [varchar](5) COLLATE Danish_Norwegian_CI_AS NULL,
	[Initials] [nvarchar](3) COLLATE Danish_Norwegian_CI_AS NULL,
	[Dest_Code] [varchar](5) COLLATE Danish_Norwegian_CI_AS NULL,
	[DateSk] [int] NULL,
	[TimeSk] [int] NULL,
	[platform] [nvarchar](max) COLLATE Danish_Norwegian_CI_AS NULL,
	[Sequence number] [nvarchar](10) COLLATE Danish_Norwegian_CI_AS NULL,
	[EndTimeSk] [int] NULL,
	HASH_Dep [varbinary] (50) NULL  ,
	HASH_Dest [varbinary] (50) NULL,
	HASH_Time [varbinary] (50) NULL,


	INDEX Ix_Hash_Dep 
	HASH(HASH_Dep) WITH (BUCKET_COUNT=30000000),
	INDEX Ix_Hash_Dest
	HASH(HASH_Dest) WITH (BUCKET_COUNT=30000000),
	INDEX Ix_Hash_Time
	HASH(HASH_Time) WITH (BUCKET_COUNT=30000000)

)WITH ( MEMORY_OPTIMIZED = ON , DURABILITY = SCHEMA_ONLY )



GO

SET ANSI_PADDING OFF
GO

Allright, this did the trick. I now have a table ready for the task at hand.
Now I swap the lookup queries so it uses the hashed column as the lookup value, and start the package.

The result? While running through 26 million rows still takes a while, the job now runs through in about 3 hours. Yes, that’s 1/4 the time it took to run it before. Granted, we did have to double the memory on the server – but we had plans for doing this anyways.
And the job still fails at times, but when it does it fails in the beginning of the job – leaving no error messages in the regular SSIS execution log.
This could be a bug, or it could be a resource conflict. And perhaps warrants another blogpost when I’ve figured it out.
But for now, if it fails, it’s just to restart the job and it will run through within the normal DWH loading window. So all is well.