Partitioning Data in SQL

(Disclaimer: Yes, I know there are ways to ‘let SQL do this for you’ but they’re bloody complicated, and I have a really nice, even hash key to use so it’s simple!)

(Disclaimer 2: This is a pretty long post.)

So, after a few years of usage gathering data, SQL tables can become pretty large. Especially if they’re wide tables. Consequently you also end up with a big database as well. What we want to do is to partition my data out into lots of smaller tables, across multiple databases.  For example, a message is stored in a ‘messages’ table with the company code the message is for in the row. This makes an excellent partition as it’s very unlikely you’ll be wanting to do a query on messages for different companies. Similarly, you might partition user preferences by user.

You could simply partition everything by using the first letter of the name. Things for ‘Acme’ go into ‘MessagesA’ etc. However, the natural distribution of letters will mean some tables are vastly bigger than others.   In my data I have hashcodes for everything which look like this: “F6QHC5TYB”. Think of them as mini GUID’s. Since these are pretty even, having an algorithm that decides where the data goes based on the hashcode should work fine.

Splitting across different tables is only half of the story though. You really want to be able to put those different tables into a few different databases, for example, with 4 tables split across 2 databases:

First Letter of Hash Table Database
     
A Messages1 Database1
B Messages2 Database2
C Messages3 Database1
D Messages4 Database2
E Messages1 Database1
F Messages2 Database2

So Messages1 & Messages3 live in Database1, Messages 2 and 4 live in Database2. All we need to do is to have something that generates a suffix for the table and the database based on the hashcode. You could then simply do:

“SELECT * From ” & GetDatabase(Hashcode) & “..” & GetTable(Hashcode) & ” Where blah blah blah”

Lets knock up a couple of functions then…..

 

For the Database

    Public Shared Function PartitioningGetDatabase(ByVal HashCode As String, ByVal DataType As PartitioningDataType) As String
        Dim DatabaseName As String = ""
        Dim DataBaseSuffix As String = ""
        Dim SuffixRange As Integer = 2


        Select Case DataType
            Case PartitioningDataType.Message
                DatabaseName = "Database"
        End Select


        Dim _HashNumber As Integer = Asc(Left(HashCode, 1))
        Dim _HashOutNumber As Integer = 0

        If _HashNumber > 47 And _HashNumber < 58 Then _HashOutNumber = _HashNumber - 48
        If _HashNumber > 64 And _HashNumber < 91 Then  _HashOutNumber = _HashNumber - 55

        Dim _DataBaseNumber = _HashOutNumber Mod SuffixRange

        DataBaseSuffix = Chr(49 + _DataBaseNumber)

        Return DatabaseName & DataBaseSuffix
    End Function

.csharpcode, .csharpcode pre
{
font-size: small;
color: black;
font-family: consolas, “Courier New”, courier, monospace;
background-color: #ffffff;
/*white-space: pre;*/
}
.csharpcode pre { margin: 0em; }
.csharpcode .rem { color: #008000; }
.csharpcode .kwrd { color: #0000ff; }
.csharpcode .str { color: #006080; }
.csharpcode .op { color: #0000c0; }
.csharpcode .preproc { color: #cc6633; }
.csharpcode .asp { background-color: #ffff00; }
.csharpcode .html { color: #800000; }
.csharpcode .attr { color: #ff0000; }
.csharpcode .alt
{
background-color: #f4f4f4;
width: 100%;
margin: 0em;
}
.csharpcode .lnum { color: #606060; }

 

For the Table

    Public Shared Function PartitioningGetTable(ByVal HashCode As String, ByVal DataType As PartitioningDataType) As String
        Dim TableSuffix As String = ""
        Dim SuffixRange As Integer = 8


        Select Case DataType
            'Here was can set different ranges based on different data types
            Case PartitioningDataType.Message
                SuffixRange = 4
        End Select


        Dim _HashNumber As Integer = Asc(Left(HashCode, 1))
        Dim _HashOutNumber As Integer = 0

        If _HashNumber > 47 And _HashNumber < 58 Then _HashOutNumber = _HashNumber - 48
        If _HashNumber > 64 And _HashNumber < 91 Then _HashOutNumber = _HashNumber - 55


        Dim _TableNumber = _HashOutNumber Mod SuffixRange

        If _TableNumber < 9 Then
            TableSuffix = Chr(49 + _TableNumber)
        Else
            TableSuffix = Chr(56 + _TableNumber)
        End If

        Return TableSuffix
    End Function

.csharpcode, .csharpcode pre
{
font-size: small;
color: black;
font-family: consolas, “Courier New”, courier, monospace;
background-color: #ffffff;
/*white-space: pre;*/
}
.csharpcode pre { margin: 0em; }
.csharpcode .rem { color: #008000; }
.csharpcode .kwrd { color: #0000ff; }
.csharpcode .str { color: #006080; }
.csharpcode .op { color: #0000c0; }
.csharpcode .preproc { color: #cc6633; }
.csharpcode .asp { background-color: #ffff00; }
.csharpcode .html { color: #800000; }
.csharpcode .attr { color: #ff0000; }
.csharpcode .alt
{
background-color: #f4f4f4;
width: 100%;
margin: 0em;
}
.csharpcode .lnum { color: #606060; }

 

The ‘suffixrange’ is what determines how big the spread is for the tables and the databases and is variable dependant on the type of datas. This way you can easily have it set to spread very heavily used data across more tables/db’s than less heavy data.

So let’s have a look at some data output using a simple test rig like this:

        Dim Chars As String = "ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789"

        For i As Integer = 1 To Len(Chars)
            Dim ThisChar As String = Mid(Chars, i, 1)
            Dim ThisTable As String = PartitioningGetTable(ThisChar, PartitioningDataType.Message)
            Dim ThisDatabase As String = PartitioningGetDatabase(ThisChar, PartitioningDataType.Message)
            System.Diagnostics.Debug.Print(ThisChar & "," & ThisDatabase & ",Messages" & ThisTable)
        Next i
 

.csharpcode, .csharpcode pre
{
font-size: small;
color: black;
font-family: consolas, “Courier New”, courier, monospace;
background-color: #ffffff;
/*white-space: pre;*/
}
.csharpcode pre { margin: 0em; }
.csharpcode .rem { color: #008000; }
.csharpcode .kwrd { color: #0000ff; }
.csharpcode .str { color: #006080; }
.csharpcode .op { color: #0000c0; }
.csharpcode .preproc { color: #cc6633; }
.csharpcode .asp { background-color: #ffff00; }
.csharpcode .html { color: #800000; }
.csharpcode .attr { color: #ff0000; }
.csharpcode .alt
{
background-color: #f4f4f4;
width: 100%;
margin: 0em;
}
.csharpcode .lnum { color: #606060; }

1st Char of Hash Database Table
A DataShard2 Messages3
B DataShard1 Messages4
C DataShard2 Messages5
D DataShard1 Messages6
E DataShard2 Messages7
F DataShard1 Messages8
G DataShard2 Messages1
H DataShard1 Messages2
I DataShard2 Messages3
J DataShard1 Messages4
K DataShard2 Messages5
L DataShard1 Messages6
M DataShard2 Messages7
N DataShard1 Messages8
O DataShard2 Messages1
P DataShard1 Messages2
Q DataShard2 Messages3
R DataShard1 Messages4
S DataShard2 Messages5
T DataShard1 Messages6
U DataShard2 Messages7
V DataShard1 Messages8
W DataShard2 Messages1
X DataShard1 Messages2
Y DataShard2 Messages3
Z DataShard1 Messages4
0 DataShard2 Messages1
1 DataShard1 Messages2
2 DataShard2 Messages3
3 DataShard1 Messages4
4 DataShard2 Messages5
5 DataShard1 Messages6
6 DataShard2 Messages7
7 DataShard1 Messages8
8 DataShard2 Messages1
9 DataShard1 Messages2

 

All looking good ? OK, let’s change the suffixrange to spread it across 3 databases instead of 2…

 

8 tables, 2 DBs

8 tables, 3 DBs

1st Char of Hash Database Table
A DataShard1 Messages3
B DataShard2 Messages4
C DataShard1 Messages5
D DataShard2 Messages6
E DataShard1 Messages7
F DataShard2 Messages8
G DataShard1 Messages1
H DataShard2 Messages2
I DataShard1 Messages3
J DataShard2 Messages4
K DataShard1 Messages5
L DataShard2 Messages6
M DataShard1 Messages7
N DataShard2 Messages8
O DataShard1 Messages1
P DataShard2 Messages2
Q DataShard1 Messages3
R DataShard2 Messages4
S DataShard1 Messages5
T DataShard2 Messages6
U DataShard1 Messages7
V DataShard2 Messages8
W DataShard1 Messages1
X DataShard2 Messages2
Y DataShard1 Messages3
Z DataShard2 Messages4
0 DataShard1 Messages1
1 DataShard2 Messages2
2 DataShard1 Messages3
3 DataShard2 Messages4
4 DataShard1 Messages5
5 DataShard2 Messages6
6 DataShard1 Messages7
7 DataShard2 Messages8
8 DataShard1 Messages1
9 DataShard2 Messages2
1st Char Database Table
A DataShard2 Messages3
B DataShard3 Messages4
C DataShard1 Messages5
D DataShard2 Messages6
E DataShard3 Messages7
F DataShard1 Messages8
G DataShard2 Messages1
H DataShard3 Messages2
I DataShard1 Messages3
J DataShard2 Messages4
K DataShard3 Messages5
L DataShard1 Messages6
M DataShard2 Messages7
N DataShard3 Messages8
O DataShard1 Messages1
P DataShard2 Messages2
Q DataShard3 Messages3
R DataShard1 Messages4
S DataShard2 Messages5
T DataShard3 Messages6
U DataShard1 Messages7
V DataShard2 Messages8
W DataShard3 Messages1
X DataShard1 Messages2
Y DataShard2 Messages3
Z DataShard3 Messages4
0 DataShard1 Messages1
1 DataShard2 Messages2
2 DataShard3 Messages3
3 DataShard1 Messages4
4 DataShard2 Messages5
5 DataShard3 Messages6
6 DataShard1 Messages7
7 DataShard2 Messages8
8 DataShard3 Messages1
9 DataShard1 Messages2

Whoops!!! What happened there? The table Messages3 is now in 3 different databases. That’s not good. Whilst it’d be ‘workable’ it’s messy. If you ever want to move the data around again it’s going to be very hard work. So what went wrong? Well… look at our code that calls the functions again…..

Dim ThisTable As String = PartitioningGetTable(ThisChar, PartitioningDataType.Message)
Dim ThisDatabase As String = PartitioningGetDatabase(ThisChar, PartitioningDataType.Message)

 

Can you see the problem  ?

.csharpcode, .csharpcode pre
{
font-size: small;
color: black;
font-family: consolas, “Courier New”, courier, monospace;
background-color: #ffffff;
/*white-space: pre;*/
}
.csharpcode pre { margin: 0em; }
.csharpcode .rem { color: #008000; }
.csharpcode .kwrd { color: #0000ff; }
.csharpcode .str { color: #006080; }
.csharpcode .op { color: #0000c0; }
.csharpcode .preproc { color: #cc6633; }
.csharpcode .asp { background-color: #ffff00; }
.csharpcode .html { color: #800000; }
.csharpcode .attr { color: #ff0000; }
.csharpcode .alt
{
background-color: #f4f4f4;
width: 100%;
margin: 0em;
}
.csharpcode .lnum { color: #606060; }

The table and the database are determined on the hash of the row we’re storing. Sounds correct yes? If you ran 8 tables on 4 DB’s it’d look fine against the original data, but since we’re using a non divisible numbers, the suffixes go out of sync. You could just assume that you’d be sensible and move from say 2 DB’s to 4, or we make a teeny tiny code change……..

Dim ThisTable As String = PartitioningGetTable(ThisChar, PartitioningDataType.Message)
Dim ThisDatabase As String = PartitioningGetDatabase(ThisTable, PartitioningDataType.Message)

.csharpcode, .csharpcode pre
{
font-size: small;
color: black;
font-family: consolas, “Courier New”, courier, monospace;
background-color: #ffffff;
/*white-space: pre;*/
}
.csharpcode pre { margin: 0em; }
.csharpcode .rem { color: #008000; }
.csharpcode .kwrd { color: #0000ff; }
.csharpcode .str { color: #006080; }
.csharpcode .op { color: #0000c0; }
.csharpcode .preproc { color: #cc6633; }
.csharpcode .asp { background-color: #ffff00; }
.csharpcode .html { color: #800000; }
.csharpcode .attr { color: #ff0000; }
.csharpcode .alt
{
background-color: #f4f4f4;
width: 100%;
margin: 0em;
}
.csharpcode .lnum { color: #606060; }

 

Not a very complicated change is it, but it should be a fundamental…. The hashcode must just determine which table to use, and the table suffix should always determine the DB. That way you always always have the same tables in the same DB’s.

Hope that helps someone out before they make a massively silly mistake!

I’ll post up real life distribution data based on my keys at a later date.

Advertisements

2 thoughts on “Partitioning Data in SQL

  1. What are you trying to achieve?

    Would be better to do in SQL rather than ASP?

    Have a table of Companies
    With some fields of: CompanyName, CompanyID (as GUID)

    The CompanyID then is your hash function (good enough for these purposes)

    Then when a new company is added use ASP or StoredProc to allocate the next free DB / Table combo (using simple counters from a reference data table).

    Then when doing the select do two pre-selects to get the Table and DB for the CustomerGUID in question.

    Concatenate these into the SELECT statements.

    Done.

    But I still don’t know what you are trying to achieve. SQL can work very well on humongous data sets. Take a look at query analyzer and tuning tools (y)

  2. Is it better to have one large table or lots of little ones? Given both options wouldn’t differ that greatly in space usage what are the advantages of splitting data out? I can only really see the increased dependencies and complexity between your code and your data. Add in multiple indexes on each table and it all sounds exponentially more difficult to maintain.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s