Partitioning Data in SQL

July 17, 2009 ~ briandrought ~ 2 Comments

(Disclaimer: Yes, I know there are ways to ‘let SQL do this for you’ but they’re bloody complicated, and I have a really nice, even hash key to use so it’s simple!)

(Disclaimer 2: This is a pretty long post.)

So, after a few years of usage gathering data, SQL tables can become pretty large. Especially if they’re wide tables. Consequently you also end up with a big database as well. What we want to do is to partition my data out into lots of smaller tables, across multiple databases. For example, a message is stored in a ‘messages’ table with the company code the message is for in the row. This makes an excellent partition as it’s very unlikely you’ll be wanting to do a query on messages for different companies. Similarly, you might partition user preferences by user.

You could simply partition everything by using the first letter of the name. Things for ‘Acme’ go into ‘MessagesA’ etc. However, the natural distribution of letters will mean some tables are vastly bigger than others. In my data I have hashcodes for everything which look like this: “F6QHC5TYB”. Think of them as mini GUID’s. Since these are pretty even, having an algorithm that decides where the data goes based on the hashcode should work fine.

Splitting across different tables is only half of the story though. You really want to be able to put those different tables into a few different databases, for example, with 4 tables split across 2 databases:

First Letter of Hash	Table	Database

A	Messages1	Database1
B	Messages2	Database2
C	Messages3	Database1
D	Messages4	Database2
E	Messages1	Database1
F	Messages2	Database2

So Messages1 & Messages3 live in Database1, Messages 2 and 4 live in Database2. All we need to do is to have something that generates a suffix for the table and the database based on the hashcode. You could then simply do:

“SELECT * From ” & GetDatabase(Hashcode) & “..” & GetTable(Hashcode) & ” Where blah blah blah”

Lets knock up a couple of functions then…..

For the Database

    Public Shared Function PartitioningGetDatabase(ByVal HashCode As String, ByVal DataType As PartitioningDataType) As String
        Dim DatabaseName As String = ""
        Dim DataBaseSuffix As String = ""
        Dim SuffixRange As Integer = 2


        Select Case DataType

            Case PartitioningDataType.Message
                DatabaseName = "Database"
        End Select


        Dim _HashNumber As Integer = Asc(Left(HashCode, 1))
        Dim _HashOutNumber As Integer = 0

        If _HashNumber > 47 And _HashNumber < 58 Then _HashOutNumber = _HashNumber - 48
        If _HashNumber > 64 And _HashNumber < 91 Then  _HashOutNumber = _HashNumber - 55

        Dim _DataBaseNumber = _HashOutNumber Mod SuffixRange

        DataBaseSuffix = Chr(49 + _DataBaseNumber)

        Return DatabaseName & DataBaseSuffix
    End Function

.csharpcode, .csharpcode pre
{
font-size: small;
color: black;
font-family: consolas, “Courier New”, courier, monospace;
background-color: #ffffff;
/*white-space: pre;*/
}
.csharpcode pre { margin: 0em; }
.csharpcode .rem { color: #008000; }
.csharpcode .kwrd { color: #0000ff; }
.csharpcode .str { color: #006080; }
.csharpcode .op { color: #0000c0; }
.csharpcode .preproc { color: #cc6633; }
.csharpcode .asp { background-color: #ffff00; }
.csharpcode .html { color: #800000; }
.csharpcode .attr { color: #ff0000; }
.csharpcode .alt
{
background-color: #f4f4f4;
width: 100%;
margin: 0em;
}
.csharpcode .lnum { color: #606060; }

For the Table

    Public Shared Function PartitioningGetTable(ByVal HashCode As String, ByVal DataType As PartitioningDataType) As String
        Dim TableSuffix As String = ""
        Dim SuffixRange As Integer = 8


        Select Case DataType
            'Here was can set different ranges based on different data types
            Case PartitioningDataType.Message
                SuffixRange = 4
        End Select


        Dim _HashNumber As Integer = Asc(Left(HashCode, 1))
        Dim _HashOutNumber As Integer = 0

        If _HashNumber > 47 And _HashNumber < 58 Then _HashOutNumber = _HashNumber - 48
        If _HashNumber > 64 And _HashNumber < 91 Then _HashOutNumber = _HashNumber - 55


        Dim _TableNumber = _HashOutNumber Mod SuffixRange

        If _TableNumber < 9 Then
            TableSuffix = Chr(49 + _TableNumber)
        Else
            TableSuffix = Chr(56 + _TableNumber)
        End If

        Return TableSuffix
    End Function

The ‘suffixrange’ is what determines how big the spread is for the tables and the databases and is variable dependant on the type of datas. This way you can easily have it set to spread very heavily used data across more tables/db’s than less heavy data.

So let’s have a look at some data output using a simple test rig like this:

        Dim Chars As String = "ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789"

        For i As Integer = 1 To Len(Chars)

            Dim ThisChar As String = Mid(Chars, i, 1)
            Dim ThisTable As String = PartitioningGetTable(ThisChar, PartitioningDataType.Message)
            Dim ThisDatabase As String = PartitioningGetDatabase(ThisChar, PartitioningDataType.Message)
            System.Diagnostics.Debug.Print(ThisChar & "," & ThisDatabase & ",Messages" & ThisTable)
        Next i

1st Char of Hash	Database	Table
A	DataShard2	Messages3
B	DataShard1	Messages4
C	DataShard2	Messages5
D	DataShard1	Messages6
E	DataShard2	Messages7
F	DataShard1	Messages8
G	DataShard2	Messages1
H	DataShard1	Messages2
I	DataShard2	Messages3
J	DataShard1	Messages4
K	DataShard2	Messages5
L	DataShard1	Messages6
M	DataShard2	Messages7
N	DataShard1	Messages8
O	DataShard2	Messages1
P	DataShard1	Messages2
Q	DataShard2	Messages3
R	DataShard1	Messages4
S	DataShard2	Messages5
T	DataShard1	Messages6
U	DataShard2	Messages7
V	DataShard1	Messages8
W	DataShard2	Messages1
X	DataShard1	Messages2
Y	DataShard2	Messages3
Z	DataShard1	Messages4
0	DataShard2	Messages1
1	DataShard1	Messages2
2	DataShard2	Messages3
3	DataShard1	Messages4
4	DataShard2	Messages5
5	DataShard1	Messages6
6	DataShard2	Messages7
7	DataShard1	Messages8
8	DataShard2	Messages1
9	DataShard1	Messages2

All looking good ? OK, let’s change the suffixrange to spread it across 3 databases instead of 2…

8 tables, 2 DBs

8 tables, 3 DBs

1st Char of Hash	Database	Table
A	DataShard1	Messages3
B	DataShard2	Messages4
C	DataShard1	Messages5
D	DataShard2	Messages6
E	DataShard1	Messages7
F	DataShard2	Messages8
G	DataShard1	Messages1
H	DataShard2	Messages2
I	DataShard1	Messages3
J	DataShard2	Messages4
K	DataShard1	Messages5
L	DataShard2	Messages6
M	DataShard1	Messages7
N	DataShard2	Messages8
O	DataShard1	Messages1
P	DataShard2	Messages2
Q	DataShard1	Messages3
R	DataShard2	Messages4
S	DataShard1	Messages5
T	DataShard2	Messages6
U	DataShard1	Messages7
V	DataShard2	Messages8
W	DataShard1	Messages1
X	DataShard2	Messages2
Y	DataShard1	Messages3
Z	DataShard2	Messages4
0	DataShard1	Messages1
1	DataShard2	Messages2
2	DataShard1	Messages3
3	DataShard2	Messages4
4	DataShard1	Messages5
5	DataShard2	Messages6
6	DataShard1	Messages7
7	DataShard2	Messages8
8	DataShard1	Messages1
9	DataShard2	Messages2

1st Char	Database	Table
A	DataShard2	Messages3
B	DataShard3	Messages4
C	DataShard1	Messages5
D	DataShard2	Messages6
E	DataShard3	Messages7
F	DataShard1	Messages8
G	DataShard2	Messages1
H	DataShard3	Messages2
I	DataShard1	Messages3
J	DataShard2	Messages4
K	DataShard3	Messages5
L	DataShard1	Messages6
M	DataShard2	Messages7
N	DataShard3	Messages8
O	DataShard1	Messages1
P	DataShard2	Messages2
Q	DataShard3	Messages3
R	DataShard1	Messages4
S	DataShard2	Messages5
T	DataShard3	Messages6
U	DataShard1	Messages7
V	DataShard2	Messages8
W	DataShard3	Messages1
X	DataShard1	Messages2
Y	DataShard2	Messages3
Z	DataShard3	Messages4
0	DataShard1	Messages1
1	DataShard2	Messages2
2	DataShard3	Messages3
3	DataShard1	Messages4
4	DataShard2	Messages5
5	DataShard3	Messages6
6	DataShard1	Messages7
7	DataShard2	Messages8
8	DataShard3	Messages1
9	DataShard1	Messages2

Whoops!!! What happened there? The table Messages3 is now in 3 different databases. That’s not good. Whilst it’d be ‘workable’ it’s messy. If you ever want to move the data around again it’s going to be very hard work. So what went wrong? Well… look at our code that calls the functions again…..

Dim ThisTable As String = PartitioningGetTable(ThisChar, PartitioningDataType.Message)
Dim ThisDatabase As String = PartitioningGetDatabase(ThisChar, PartitioningDataType.Message)

Can you see the problem ?

The table and the database are determined on the hash of the row we’re storing. Sounds correct yes? If you ran 8 tables on 4 DB’s it’d look fine against the original data, but since we’re using a non divisible numbers, the suffixes go out of sync. You could just assume that you’d be sensible and move from say 2 DB’s to 4, or we make a teeny tiny code change……..

Dim ThisTable As String = PartitioningGetTable(ThisChar, PartitioningDataType.Message)
Dim ThisDatabase As String = PartitioningGetDatabase(ThisTable, PartitioningDataType.Message)

Not a very complicated change is it, but it should be a fundamental…. The hashcode must just determine which table to use, and the table suffix should always determine the DB. That way you always always have the same tables in the same DB’s.

Hope that helps someone out before they make a massively silly mistake!

I’ll post up real life distribution data based on my keys at a later date.

Every guy needs a picture of a rocketship on his wall.

July 16, 2009 ~ briandrought ~ 3 Comments

Quite timely this, what with the Shuttle Launch the other day. I bought a picture when visiting Kennedy Space Centre last year, but only just got round to putting it up.

It’s been there for a week now and it’s doing a great job of inspiring me. Every time I catch a glimpse of it, I’m reminded that:

With enough time, money and will power you can do pretty much anything within the boundaries of physics.
The Shuttle is huge and complicated, but you can break it down into lots of small easy problems. When designing the launch system, they didn’t have to think about how to get the loading bay doors to open.
Even something as insanely powerful as a rocket takes time to build momentum. On the left is T+1, then T+2, T+4, T+6, T+8 and T+9. It moves more from T+8- to T+9 than it does in the first 6 seconds. Nearly every project works in the same way.

What do you see every day that inspires you? If there’s nothing, then maybe there ought to be!

Who needs values these days?

July 3, 2009 ~ briandrought ~ 2 Comments

Whilst doing some SQL optimization on something else in my system I spotted something odd whizz up SQL Profiler. It was a simple query to get the last 8 U2U’s for a user. However, some queries were doing 75,000 reads and taking near enough a second. What the hell ?

I copied and pasted the query and ran it as straight TSQL (converting it from the parameterised query). Odd…. only 130 reads?!

Running the original parameterized version produced the high read count and a complex query plan, and this was repeatable time and time again. So why was using a query window so much faster that using code???

Stupid code, thats why. I was creating the query like this using AddWithValue

    Dim cmd As Data.SqlClient.SqlCommand = New Data.SqlClient.SqlCommand(SQL, con)
        cmd.Parameters.AddWithValue("@LoggedInUser", PearlUserManagement.GetInfoAboutCurrentUser(PearlConstants.UserInfoItems.UserCode))

But the resultant SQL call contained this:

‘@LoggedInUser nvarchar(9)’,@LoggedInUser=N’ABCDEFGHI’

When the code was passing the usercode (9 character alphanumeric) it was passing it as an nvarchar rather than just a char. What if we change the parameterized version from:

exec sp_executesql N'SELECT top 8 UserMessages.*, PearlUsers..Users.UserName as UserFromName, PearlUsers..Users.FullName as UserFromNameFull FROM UserMessages LEFT JOIN PearlUsers..Users ON 
UserMessages.UserFrom=PearlUsers..Users.UserCode Where MessageType = 0 and UserTo = @LoggedInUser order by datesent desc',N'@LoggedInUser nvarchar(9)',@LoggedInUser=N'ABCDEFGHI'

To:

exec sp_executesql N'SELECT top 8 UserMessages.*, PearlUsers..Users.UserName as UserFromName, PearlUsers..Users.FullName as UserFromNameFull FROM UserMessages LEFT JOIN PearlUsers..Users ON 
UserMessages.UserFrom=PearlUsers..Users.UserCode Where MessageType = 0 and UserTo = @LoggedInUser order by datesent desc',N'@LoggedInUser char(9)',@LoggedInUser='ABCDEFGHI'

Original query used 13436 reads and took 703ms.

Version using ‘char’ used 73 reads and took 171ms!!

Looking at the query plans for both you can see vastly different patterns:

The first query was ignoring the index as the index was built on a char and not an nvarchar!!

So quick change of the code from:

    Dim cmd As Data.SqlClient.SqlCommand = New Data.SqlClient.SqlCommand(SQL, con)
        cmd.Parameters.AddWithValue("@LoggedInUser", PearlUserManagement.GetInfoAboutCurrentUser(PearlConstants.UserInfoItems.UserCode))

  Dim cmd As Data.SqlClient.SqlCommand = New Data.SqlClient.SqlCommand(SQL, con)
        cmd.Parameters.Add("@LoggedInUser", Data.SqlDbType.Char).Value = PearlUserManagement.GetInfoAboutCurrentUser(PearlConstants.UserInfoItems.UserCode)

and we’ve reduced disk reads by a factor of 180, and dropped the query time to just a quarter.

So, AddWithValue is great for not having to think about things, but it can cause some REALLY REALLY poor query plans in SQL!!

Shared file storage on a web cluster.

July 1, 2009 ~ briandrought ~ 1 Comment

Part of my web app allows users to upload their own avatars as a GIF/PNG/JPG. It then resizes the pic automatically, saves out the file (and a smaller thumbnail) into ‘/static/avatars/’ and updates their Avatar_Filename field in the user table. All I had to do was to grant the user account that IIS runs under the appropriate write permissions for that dir. Simple enough, and it’s worked for ages.

So now we want to use a web cluster we have a problem. If you’re on Server A and you add an avatar, anyone on Server B won’t see it. We need to have shared file storage.

“I know, I’ll just create another website on an IIS server and that can be ‘static.mydomain.com’ and I’ll just save the files to that across the network, and then reference the different hostname… easy, and I’ll be done in time for lunch”.

HAH! Not so. You can’t save files across to a different machine with IIS. IIS has no network access. Well.. you *can* bodge it, but you have to change the user account that IIS runs on to a domain account, then grant it write access. Not only did that sound dangerous security wise, it also sounded a bit……….. shaky. What if I rebuild the server and forget that bit?

A bit of brain storming later and I had a solution. A solution so cunning, that it needed a super accurate diagram to even think of it:

So, take the file the user uploads on the web host, resize it, shred it into a byte array and whack it into a SQL blob on the main server. Then call a webservice on the static file server that says “I have a blob in SQL waiting for you with the file name of XYZ, make a file from it!”

The webservice then grabs the blob from SQL using the filename passed to it in the service call, gets the bytearray, saves it with an IO stream and returns True.

Yes, I could pass the byte array directly to the webservice but it’s not worth it.