Partitioning Data in SQL

(Disclaimer: Yes, I know there are ways to ‘let SQL do this for you’ but they’re bloody complicated, and I have a really nice, even hash key to use so it’s simple!)

(Disclaimer 2: This is a pretty long post.)

So, after a few years of usage gathering data, SQL tables can become pretty large. Especially if they’re wide tables. Consequently you also end up with a big database as well. What we want to do is to partition my data out into lots of smaller tables, across multiple databases.  For example, a message is stored in a ‘messages’ table with the company code the message is for in the row. This makes an excellent partition as it’s very unlikely you’ll be wanting to do a query on messages for different companies. Similarly, you might partition user preferences by user.

You could simply partition everything by using the first letter of the name. Things for ‘Acme’ go into ‘MessagesA’ etc. However, the natural distribution of letters will mean some tables are vastly bigger than others.   In my data I have hashcodes for everything which look like this: “F6QHC5TYB”. Think of them as mini GUID’s. Since these are pretty even, having an algorithm that decides where the data goes based on the hashcode should work fine.

Splitting across different tables is only half of the story though. You really want to be able to put those different tables into a few different databases, for example, with 4 tables split across 2 databases:

First Letter of Hash Table Database
     
A Messages1 Database1
B Messages2 Database2
C Messages3 Database1
D Messages4 Database2
E Messages1 Database1
F Messages2 Database2

So Messages1 & Messages3 live in Database1, Messages 2 and 4 live in Database2. All we need to do is to have something that generates a suffix for the table and the database based on the hashcode. You could then simply do:

“SELECT * From ” & GetDatabase(Hashcode) & “..” & GetTable(Hashcode) & ” Where blah blah blah”

Lets knock up a couple of functions then…..

 

For the Database

    Public Shared Function PartitioningGetDatabase(ByVal HashCode As String, ByVal DataType As PartitioningDataType) As String
        Dim DatabaseName As String = ""
        Dim DataBaseSuffix As String = ""
        Dim SuffixRange As Integer = 2


        Select Case DataType
            Case PartitioningDataType.Message
                DatabaseName = "Database"
        End Select


        Dim _HashNumber As Integer = Asc(Left(HashCode, 1))
        Dim _HashOutNumber As Integer = 0

        If _HashNumber > 47 And _HashNumber < 58 Then _HashOutNumber = _HashNumber - 48
        If _HashNumber > 64 And _HashNumber < 91 Then  _HashOutNumber = _HashNumber - 55

        Dim _DataBaseNumber = _HashOutNumber Mod SuffixRange

        DataBaseSuffix = Chr(49 + _DataBaseNumber)

        Return DatabaseName & DataBaseSuffix
    End Function

.csharpcode, .csharpcode pre
{
font-size: small;
color: black;
font-family: consolas, “Courier New”, courier, monospace;
background-color: #ffffff;
/*white-space: pre;*/
}
.csharpcode pre { margin: 0em; }
.csharpcode .rem { color: #008000; }
.csharpcode .kwrd { color: #0000ff; }
.csharpcode .str { color: #006080; }
.csharpcode .op { color: #0000c0; }
.csharpcode .preproc { color: #cc6633; }
.csharpcode .asp { background-color: #ffff00; }
.csharpcode .html { color: #800000; }
.csharpcode .attr { color: #ff0000; }
.csharpcode .alt
{
background-color: #f4f4f4;
width: 100%;
margin: 0em;
}
.csharpcode .lnum { color: #606060; }

 

For the Table

    Public Shared Function PartitioningGetTable(ByVal HashCode As String, ByVal DataType As PartitioningDataType) As String
        Dim TableSuffix As String = ""
        Dim SuffixRange As Integer = 8


        Select Case DataType
            'Here was can set different ranges based on different data types
            Case PartitioningDataType.Message
                SuffixRange = 4
        End Select


        Dim _HashNumber As Integer = Asc(Left(HashCode, 1))
        Dim _HashOutNumber As Integer = 0

        If _HashNumber > 47 And _HashNumber < 58 Then _HashOutNumber = _HashNumber - 48
        If _HashNumber > 64 And _HashNumber < 91 Then _HashOutNumber = _HashNumber - 55


        Dim _TableNumber = _HashOutNumber Mod SuffixRange

        If _TableNumber < 9 Then
            TableSuffix = Chr(49 + _TableNumber)
        Else
            TableSuffix = Chr(56 + _TableNumber)
        End If

        Return TableSuffix
    End Function

.csharpcode, .csharpcode pre
{
font-size: small;
color: black;
font-family: consolas, “Courier New”, courier, monospace;
background-color: #ffffff;
/*white-space: pre;*/
}
.csharpcode pre { margin: 0em; }
.csharpcode .rem { color: #008000; }
.csharpcode .kwrd { color: #0000ff; }
.csharpcode .str { color: #006080; }
.csharpcode .op { color: #0000c0; }
.csharpcode .preproc { color: #cc6633; }
.csharpcode .asp { background-color: #ffff00; }
.csharpcode .html { color: #800000; }
.csharpcode .attr { color: #ff0000; }
.csharpcode .alt
{
background-color: #f4f4f4;
width: 100%;
margin: 0em;
}
.csharpcode .lnum { color: #606060; }

 

The ‘suffixrange’ is what determines how big the spread is for the tables and the databases and is variable dependant on the type of datas. This way you can easily have it set to spread very heavily used data across more tables/db’s than less heavy data.

So let’s have a look at some data output using a simple test rig like this:

        Dim Chars As String = "ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789"

        For i As Integer = 1 To Len(Chars)
            Dim ThisChar As String = Mid(Chars, i, 1)
            Dim ThisTable As String = PartitioningGetTable(ThisChar, PartitioningDataType.Message)
            Dim ThisDatabase As String = PartitioningGetDatabase(ThisChar, PartitioningDataType.Message)
            System.Diagnostics.Debug.Print(ThisChar & "," & ThisDatabase & ",Messages" & ThisTable)
        Next i
 

.csharpcode, .csharpcode pre
{
font-size: small;
color: black;
font-family: consolas, “Courier New”, courier, monospace;
background-color: #ffffff;
/*white-space: pre;*/
}
.csharpcode pre { margin: 0em; }
.csharpcode .rem { color: #008000; }
.csharpcode .kwrd { color: #0000ff; }
.csharpcode .str { color: #006080; }
.csharpcode .op { color: #0000c0; }
.csharpcode .preproc { color: #cc6633; }
.csharpcode .asp { background-color: #ffff00; }
.csharpcode .html { color: #800000; }
.csharpcode .attr { color: #ff0000; }
.csharpcode .alt
{
background-color: #f4f4f4;
width: 100%;
margin: 0em;
}
.csharpcode .lnum { color: #606060; }

1st Char of Hash Database Table
A DataShard2 Messages3
B DataShard1 Messages4
C DataShard2 Messages5
D DataShard1 Messages6
E DataShard2 Messages7
F DataShard1 Messages8
G DataShard2 Messages1
H DataShard1 Messages2
I DataShard2 Messages3
J DataShard1 Messages4
K DataShard2 Messages5
L DataShard1 Messages6
M DataShard2 Messages7
N DataShard1 Messages8
O DataShard2 Messages1
P DataShard1 Messages2
Q DataShard2 Messages3
R DataShard1 Messages4
S DataShard2 Messages5
T DataShard1 Messages6
U DataShard2 Messages7
V DataShard1 Messages8
W DataShard2 Messages1
X DataShard1 Messages2
Y DataShard2 Messages3
Z DataShard1 Messages4
0 DataShard2 Messages1
1 DataShard1 Messages2
2 DataShard2 Messages3
3 DataShard1 Messages4
4 DataShard2 Messages5
5 DataShard1 Messages6
6 DataShard2 Messages7
7 DataShard1 Messages8
8 DataShard2 Messages1
9 DataShard1 Messages2

 

All looking good ? OK, let’s change the suffixrange to spread it across 3 databases instead of 2…

 

8 tables, 2 DBs

8 tables, 3 DBs

1st Char of Hash Database Table
A DataShard1 Messages3
B DataShard2 Messages4
C DataShard1 Messages5
D DataShard2 Messages6
E DataShard1 Messages7
F DataShard2 Messages8
G DataShard1 Messages1
H DataShard2 Messages2
I DataShard1 Messages3
J DataShard2 Messages4
K DataShard1 Messages5
L DataShard2 Messages6
M DataShard1 Messages7
N DataShard2 Messages8
O DataShard1 Messages1
P DataShard2 Messages2
Q DataShard1 Messages3
R DataShard2 Messages4
S DataShard1 Messages5
T DataShard2 Messages6
U DataShard1 Messages7
V DataShard2 Messages8
W DataShard1 Messages1
X DataShard2 Messages2
Y DataShard1 Messages3
Z DataShard2 Messages4
0 DataShard1 Messages1
1 DataShard2 Messages2
2 DataShard1 Messages3
3 DataShard2 Messages4
4 DataShard1 Messages5
5 DataShard2 Messages6
6 DataShard1 Messages7
7 DataShard2 Messages8
8 DataShard1 Messages1
9 DataShard2 Messages2
1st Char Database Table
A DataShard2 Messages3
B DataShard3 Messages4
C DataShard1 Messages5
D DataShard2 Messages6
E DataShard3 Messages7
F DataShard1 Messages8
G DataShard2 Messages1
H DataShard3 Messages2
I DataShard1 Messages3
J DataShard2 Messages4
K DataShard3 Messages5
L DataShard1 Messages6
M DataShard2 Messages7
N DataShard3 Messages8
O DataShard1 Messages1
P DataShard2 Messages2
Q DataShard3 Messages3
R DataShard1 Messages4
S DataShard2 Messages5
T DataShard3 Messages6
U DataShard1 Messages7
V DataShard2 Messages8
W DataShard3 Messages1
X DataShard1 Messages2
Y DataShard2 Messages3
Z DataShard3 Messages4
0 DataShard1 Messages1
1 DataShard2 Messages2
2 DataShard3 Messages3
3 DataShard1 Messages4
4 DataShard2 Messages5
5 DataShard3 Messages6
6 DataShard1 Messages7
7 DataShard2 Messages8
8 DataShard3 Messages1
9 DataShard1 Messages2

Whoops!!! What happened there? The table Messages3 is now in 3 different databases. That’s not good. Whilst it’d be ‘workable’ it’s messy. If you ever want to move the data around again it’s going to be very hard work. So what went wrong? Well… look at our code that calls the functions again…..

Dim ThisTable As String = PartitioningGetTable(ThisChar, PartitioningDataType.Message)
Dim ThisDatabase As String = PartitioningGetDatabase(ThisChar, PartitioningDataType.Message)

 

Can you see the problem  ?

.csharpcode, .csharpcode pre
{
font-size: small;
color: black;
font-family: consolas, “Courier New”, courier, monospace;
background-color: #ffffff;
/*white-space: pre;*/
}
.csharpcode pre { margin: 0em; }
.csharpcode .rem { color: #008000; }
.csharpcode .kwrd { color: #0000ff; }
.csharpcode .str { color: #006080; }
.csharpcode .op { color: #0000c0; }
.csharpcode .preproc { color: #cc6633; }
.csharpcode .asp { background-color: #ffff00; }
.csharpcode .html { color: #800000; }
.csharpcode .attr { color: #ff0000; }
.csharpcode .alt
{
background-color: #f4f4f4;
width: 100%;
margin: 0em;
}
.csharpcode .lnum { color: #606060; }

The table and the database are determined on the hash of the row we’re storing. Sounds correct yes? If you ran 8 tables on 4 DB’s it’d look fine against the original data, but since we’re using a non divisible numbers, the suffixes go out of sync. You could just assume that you’d be sensible and move from say 2 DB’s to 4, or we make a teeny tiny code change……..

Dim ThisTable As String = PartitioningGetTable(ThisChar, PartitioningDataType.Message)
Dim ThisDatabase As String = PartitioningGetDatabase(ThisTable, PartitioningDataType.Message)

.csharpcode, .csharpcode pre
{
font-size: small;
color: black;
font-family: consolas, “Courier New”, courier, monospace;
background-color: #ffffff;
/*white-space: pre;*/
}
.csharpcode pre { margin: 0em; }
.csharpcode .rem { color: #008000; }
.csharpcode .kwrd { color: #0000ff; }
.csharpcode .str { color: #006080; }
.csharpcode .op { color: #0000c0; }
.csharpcode .preproc { color: #cc6633; }
.csharpcode .asp { background-color: #ffff00; }
.csharpcode .html { color: #800000; }
.csharpcode .attr { color: #ff0000; }
.csharpcode .alt
{
background-color: #f4f4f4;
width: 100%;
margin: 0em;
}
.csharpcode .lnum { color: #606060; }

 

Not a very complicated change is it, but it should be a fundamental…. The hashcode must just determine which table to use, and the table suffix should always determine the DB. That way you always always have the same tables in the same DB’s.

Hope that helps someone out before they make a massively silly mistake!

I’ll post up real life distribution data based on my keys at a later date.

Advertisements

Every guy needs a picture of a rocketship on his wall.

 

Quite timely this, what with the Shuttle Launch the other day. I bought a picture when visiting Kennedy Space Centre last year, but only just got round to putting it up.

 

 

It’s been there for a week now and it’s doing a great job of inspiring me. Every time I catch a glimpse of it, I’m reminded that:

  • With enough time, money and will power you can do pretty much anything within the boundaries of physics.
  • The Shuttle is huge and complicated, but you can break it down into lots of small easy problems. When designing the launch system, they didn’t have to think about how to get the loading bay doors to open.
  • Even something as insanely powerful as a rocket takes time to build momentum. On the left is T+1, then T+2, T+4, T+6, T+8 and T+9. It moves more from T+8- to T+9 than it does in the first 6 seconds. Nearly every project works in the same way.

What do you see every day that inspires you? If there’s nothing, then maybe there ought to be!

Who needs values these days?

 

Whilst doing some SQL optimization on something else in my system I spotted something odd whizz up SQL Profiler. It was a simple query to get the last 8 U2U’s for a user. However, some queries were doing 75,000 reads and taking near enough a second. What the hell ?

I copied and pasted the query and ran it as straight TSQL (converting it from the parameterised query). Odd…. only 130 reads?!

Running the original parameterized version produced the high read count and a complex query plan, and this was repeatable time and time again. So why was using a query window so much faster that using code???

Stupid code, thats why. I was creating the query like this using AddWithValue

    Dim cmd As Data.SqlClient.SqlCommand = New Data.SqlClient.SqlCommand(SQL, con)
        cmd.Parameters.AddWithValue("@LoggedInUser", PearlUserManagement.GetInfoAboutCurrentUser(PearlConstants.UserInfoItems.UserCode))
 

But the resultant SQL call contained this:

‘@LoggedInUser nvarchar(9)’,@LoggedInUser=N’ABCDEFGHI’

When the code was passing the usercode (9 character alphanumeric) it was passing it as an nvarchar rather than just a char. What if we change the parameterized version from:

 

exec sp_executesql N'SELECT top 8 UserMessages.*, PearlUsers..Users.UserName as UserFromName, PearlUsers..Users.FullName as UserFromNameFull FROM UserMessages LEFT JOIN PearlUsers..Users ON 
UserMessages.UserFrom=PearlUsers..Users.UserCode Where MessageType = 0 and UserTo = @LoggedInUser order by datesent desc',N'@LoggedInUser nvarchar(9)',@LoggedInUser=N'ABCDEFGHI'

To:

exec sp_executesql N'SELECT top 8 UserMessages.*, PearlUsers..Users.UserName as UserFromName, PearlUsers..Users.FullName as UserFromNameFull FROM UserMessages LEFT JOIN PearlUsers..Users ON 
UserMessages.UserFrom=PearlUsers..Users.UserCode Where MessageType = 0 and UserTo = @LoggedInUser order by datesent desc',N'@LoggedInUser char(9)',@LoggedInUser='ABCDEFGHI'

.csharpcode, .csharpcode pre
{
font-size: small;
color: black;
font-family: consolas, “Courier New”, courier, monospace;
background-color: #ffffff;
/*white-space: pre;*/
}
.csharpcode pre { margin: 0em; }
.csharpcode .rem { color: #008000; }
.csharpcode .kwrd { color: #0000ff; }
.csharpcode .str { color: #006080; }
.csharpcode .op { color: #0000c0; }
.csharpcode .preproc { color: #cc6633; }
.csharpcode .asp { background-color: #ffff00; }
.csharpcode .html { color: #800000; }
.csharpcode .attr { color: #ff0000; }
.csharpcode .alt
{
background-color: #f4f4f4;
width: 100%;
margin: 0em;
}
.csharpcode .lnum { color: #606060; }

 

Original query used 13436 reads and took 703ms.

Version using ‘char’ used 73 reads and took 171ms!!

Looking at the query plans for both you can see vastly different patterns:

 

 

The first query was ignoring the index as the index was built on a char and not an nvarchar!!

So quick change of the code from:

    Dim cmd As Data.SqlClient.SqlCommand = New Data.SqlClient.SqlCommand(SQL, con)
        cmd.Parameters.AddWithValue("@LoggedInUser", PearlUserManagement.GetInfoAboutCurrentUser(PearlConstants.UserInfoItems.UserCode))

To

  Dim cmd As Data.SqlClient.SqlCommand = New Data.SqlClient.SqlCommand(SQL, con)
        cmd.Parameters.Add("@LoggedInUser", Data.SqlDbType.Char).Value = PearlUserManagement.GetInfoAboutCurrentUser(PearlConstants.UserInfoItems.UserCode)

.csharpcode, .csharpcode pre
{
font-size: small;
color: black;
font-family: consolas, “Courier New”, courier, monospace;
background-color: #ffffff;
/*white-space: pre;*/
}
.csharpcode pre { margin: 0em; }
.csharpcode .rem { color: #008000; }
.csharpcode .kwrd { color: #0000ff; }
.csharpcode .str { color: #006080; }
.csharpcode .op { color: #0000c0; }
.csharpcode .preproc { color: #cc6633; }
.csharpcode .asp { background-color: #ffff00; }
.csharpcode .html { color: #800000; }
.csharpcode .attr { color: #ff0000; }
.csharpcode .alt
{
background-color: #f4f4f4;
width: 100%;
margin: 0em;
}
.csharpcode .lnum { color: #606060; }

and we’ve reduced disk reads by a factor of 180, and dropped the query time to just a quarter.

So, AddWithValue is great for not having to think about things, but it can cause some REALLY REALLY poor query plans in SQL!!

Shared file storage on a web cluster.

 

Part of my web app allows users to upload their own avatars as a GIF/PNG/JPG. It then resizes the pic automatically, saves out the file (and a smaller thumbnail) into ‘/static/avatars/’ and updates their Avatar_Filename field in the user table.  All I had to do was to grant the user account that IIS runs under the appropriate write permissions for that dir. Simple enough, and it’s worked for ages.

So now we want to use a web cluster we have a problem. If you’re on Server A and you add an avatar, anyone on Server B won’t see it. We need to have shared file storage.

“I know, I’ll just create another website on an IIS server and that can be ‘static.mydomain.com’ and I’ll just save the files to that across the network, and then reference the different hostname… easy, and I’ll be done in time for lunch”.

HAH! Not so. You can’t save files across to a different machine with IIS. IIS has no network access. Well.. you *can* bodge it, but you have to change the user account that IIS runs on to a domain account, then grant it write access. Not only did that sound dangerous security wise, it also sounded a bit……….. shaky. What if I rebuild the server and forget that bit?

A bit of brain storming later and I had a solution. A solution so cunning, that it needed a super accurate diagram to even think of it:

 

So, take the file the user uploads on the web host, resize it, shred it into a byte array and whack it into a SQL blob on the main server. Then call a webservice on the static file server that says “I have a blob in SQL waiting for you with the file name of XYZ, make a file from it!”

The webservice then grabs the blob from SQL using the filename passed to it in the service call, gets the bytearray, saves it with an IO stream and returns True.

Yes, I could pass the byte array directly to the webservice but it’s not worth it.