(Disclaimer: Yes, I know there are ways to ‘let SQL do this for you’ but they’re bloody complicated, and I have a really nice, even hash key to use so it’s simple!)
(Disclaimer 2: This is a pretty long post.)
So, after a few years of usage gathering data, SQL tables can become pretty large. Especially if they’re wide tables. Consequently you also end up with a big database as well. What we want to do is to partition my data out into lots of smaller tables, across multiple databases. For example, a message is stored in a ‘messages’ table with the company code the message is for in the row. This makes an excellent partition as it’s very unlikely you’ll be wanting to do a query on messages for different companies. Similarly, you might partition user preferences by user.
You could simply partition everything by using the first letter of the name. Things for ‘Acme’ go into ‘MessagesA’ etc. However, the natural distribution of letters will mean some tables are vastly bigger than others. In my data I have hashcodes for everything which look like this: “F6QHC5TYB”. Think of them as mini GUID’s. Since these are pretty even, having an algorithm that decides where the data goes based on the hashcode should work fine.
Splitting across different tables is only half of the story though. You really want to be able to put those different tables into a few different databases, for example, with 4 tables split across 2 databases:
First Letter of Hash | Table | Database |
A | Messages1 | Database1 |
B | Messages2 | Database2 |
C | Messages3 | Database1 |
D | Messages4 | Database2 |
E | Messages1 | Database1 |
F | Messages2 | Database2 |
So Messages1 & Messages3 live in Database1, Messages 2 and 4 live in Database2. All we need to do is to have something that generates a suffix for the table and the database based on the hashcode. You could then simply do:
“SELECT * From ” & GetDatabase(Hashcode) & “..” & GetTable(Hashcode) & ” Where blah blah blah”
Lets knock up a couple of functions then…..
For the Database
Public Shared Function PartitioningGetDatabase(ByVal HashCode As String, ByVal DataType As PartitioningDataType) As String Dim DatabaseName As String = "" Dim DataBaseSuffix As String = "" Dim SuffixRange As Integer = 2 Select Case DataType
Case PartitioningDataType.Message DatabaseName = "Database" End Select Dim _HashNumber As Integer = Asc(Left(HashCode, 1)) Dim _HashOutNumber As Integer = 0 If _HashNumber > 47 And _HashNumber < 58 Then _HashOutNumber = _HashNumber - 48 If _HashNumber > 64 And _HashNumber < 91 Then _HashOutNumber = _HashNumber - 55 Dim _DataBaseNumber = _HashOutNumber Mod SuffixRange DataBaseSuffix = Chr(49 + _DataBaseNumber) Return DatabaseName & DataBaseSuffix End Function
.csharpcode, .csharpcode pre
{
font-size: small;
color: black;
font-family: consolas, “Courier New”, courier, monospace;
background-color: #ffffff;
/*white-space: pre;*/
}
.csharpcode pre { margin: 0em; }
.csharpcode .rem { color: #008000; }
.csharpcode .kwrd { color: #0000ff; }
.csharpcode .str { color: #006080; }
.csharpcode .op { color: #0000c0; }
.csharpcode .preproc { color: #cc6633; }
.csharpcode .asp { background-color: #ffff00; }
.csharpcode .html { color: #800000; }
.csharpcode .attr { color: #ff0000; }
.csharpcode .alt
{
background-color: #f4f4f4;
width: 100%;
margin: 0em;
}
.csharpcode .lnum { color: #606060; }
For the Table
Public Shared Function PartitioningGetTable(ByVal HashCode As String, ByVal DataType As PartitioningDataType) As String Dim TableSuffix As String = "" Dim SuffixRange As Integer = 8 Select Case DataType 'Here was can set different ranges based on different data types Case PartitioningDataType.Message SuffixRange = 4 End Select Dim _HashNumber As Integer = Asc(Left(HashCode, 1)) Dim _HashOutNumber As Integer = 0 If _HashNumber > 47 And _HashNumber < 58 Then _HashOutNumber = _HashNumber - 48 If _HashNumber > 64 And _HashNumber < 91 Then _HashOutNumber = _HashNumber - 55 Dim _TableNumber = _HashOutNumber Mod SuffixRange If _TableNumber < 9 Then TableSuffix = Chr(49 + _TableNumber) Else TableSuffix = Chr(56 + _TableNumber) End If Return TableSuffix End Function
.csharpcode, .csharpcode pre
{
font-size: small;
color: black;
font-family: consolas, “Courier New”, courier, monospace;
background-color: #ffffff;
/*white-space: pre;*/
}
.csharpcode pre { margin: 0em; }
.csharpcode .rem { color: #008000; }
.csharpcode .kwrd { color: #0000ff; }
.csharpcode .str { color: #006080; }
.csharpcode .op { color: #0000c0; }
.csharpcode .preproc { color: #cc6633; }
.csharpcode .asp { background-color: #ffff00; }
.csharpcode .html { color: #800000; }
.csharpcode .attr { color: #ff0000; }
.csharpcode .alt
{
background-color: #f4f4f4;
width: 100%;
margin: 0em;
}
.csharpcode .lnum { color: #606060; }
The ‘suffixrange’ is what determines how big the spread is for the tables and the databases and is variable dependant on the type of datas. This way you can easily have it set to spread very heavily used data across more tables/db’s than less heavy data.
So let’s have a look at some data output using a simple test rig like this:
Dim Chars As String = "ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789" For i As Integer = 1 To Len(Chars)
Dim ThisChar As String = Mid(Chars, i, 1) Dim ThisTable As String = PartitioningGetTable(ThisChar, PartitioningDataType.Message) Dim ThisDatabase As String = PartitioningGetDatabase(ThisChar, PartitioningDataType.Message) System.Diagnostics.Debug.Print(ThisChar & "," & ThisDatabase & ",Messages" & ThisTable) Next i
.csharpcode, .csharpcode pre
{
font-size: small;
color: black;
font-family: consolas, “Courier New”, courier, monospace;
background-color: #ffffff;
/*white-space: pre;*/
}
.csharpcode pre { margin: 0em; }
.csharpcode .rem { color: #008000; }
.csharpcode .kwrd { color: #0000ff; }
.csharpcode .str { color: #006080; }
.csharpcode .op { color: #0000c0; }
.csharpcode .preproc { color: #cc6633; }
.csharpcode .asp { background-color: #ffff00; }
.csharpcode .html { color: #800000; }
.csharpcode .attr { color: #ff0000; }
.csharpcode .alt
{
background-color: #f4f4f4;
width: 100%;
margin: 0em;
}
.csharpcode .lnum { color: #606060; }
1st Char of Hash | Database | Table |
---|---|---|
A | DataShard2 | Messages3 |
B | DataShard1 | Messages4 |
C | DataShard2 | Messages5 |
D | DataShard1 | Messages6 |
E | DataShard2 | Messages7 |
F | DataShard1 | Messages8 |
G | DataShard2 | Messages1 |
H | DataShard1 | Messages2 |
I | DataShard2 | Messages3 |
J | DataShard1 | Messages4 |
K | DataShard2 | Messages5 |
L | DataShard1 | Messages6 |
M | DataShard2 | Messages7 |
N | DataShard1 | Messages8 |
O | DataShard2 | Messages1 |
P | DataShard1 | Messages2 |
Q | DataShard2 | Messages3 |
R | DataShard1 | Messages4 |
S | DataShard2 | Messages5 |
T | DataShard1 | Messages6 |
U | DataShard2 | Messages7 |
V | DataShard1 | Messages8 |
W | DataShard2 | Messages1 |
X | DataShard1 | Messages2 |
Y | DataShard2 | Messages3 |
Z | DataShard1 | Messages4 |
0 | DataShard2 | Messages1 |
1 | DataShard1 | Messages2 |
2 | DataShard2 | Messages3 |
3 | DataShard1 | Messages4 |
4 | DataShard2 | Messages5 |
5 | DataShard1 | Messages6 |
6 | DataShard2 | Messages7 |
7 | DataShard1 | Messages8 |
8 | DataShard2 | Messages1 |
9 | DataShard1 | Messages2 |
All looking good ? OK, let’s change the suffixrange to spread it across 3 databases instead of 2…
8 tables, 2 DBs |
8 tables, 3 DBs |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|
Whoops!!! What happened there? The table Messages3 is now in 3 different databases. That’s not good. Whilst it’d be ‘workable’ it’s messy. If you ever want to move the data around again it’s going to be very hard work. So what went wrong? Well… look at our code that calls the functions again…..
Dim ThisTable As String = PartitioningGetTable(ThisChar, PartitioningDataType.Message) Dim ThisDatabase As String = PartitioningGetDatabase(ThisChar, PartitioningDataType.Message)
Can you see the problem ?
.csharpcode, .csharpcode pre
{
font-size: small;
color: black;
font-family: consolas, “Courier New”, courier, monospace;
background-color: #ffffff;
/*white-space: pre;*/
}
.csharpcode pre { margin: 0em; }
.csharpcode .rem { color: #008000; }
.csharpcode .kwrd { color: #0000ff; }
.csharpcode .str { color: #006080; }
.csharpcode .op { color: #0000c0; }
.csharpcode .preproc { color: #cc6633; }
.csharpcode .asp { background-color: #ffff00; }
.csharpcode .html { color: #800000; }
.csharpcode .attr { color: #ff0000; }
.csharpcode .alt
{
background-color: #f4f4f4;
width: 100%;
margin: 0em;
}
.csharpcode .lnum { color: #606060; }
The table and the database are determined on the hash of the row we’re storing. Sounds correct yes? If you ran 8 tables on 4 DB’s it’d look fine against the original data, but since we’re using a non divisible numbers, the suffixes go out of sync. You could just assume that you’d be sensible and move from say 2 DB’s to 4, or we make a teeny tiny code change……..
Dim ThisTable As String = PartitioningGetTable(ThisChar, PartitioningDataType.Message) Dim ThisDatabase As String = PartitioningGetDatabase(ThisTable, PartitioningDataType.Message)
.csharpcode, .csharpcode pre
{
font-size: small;
color: black;
font-family: consolas, “Courier New”, courier, monospace;
background-color: #ffffff;
/*white-space: pre;*/
}
.csharpcode pre { margin: 0em; }
.csharpcode .rem { color: #008000; }
.csharpcode .kwrd { color: #0000ff; }
.csharpcode .str { color: #006080; }
.csharpcode .op { color: #0000c0; }
.csharpcode .preproc { color: #cc6633; }
.csharpcode .asp { background-color: #ffff00; }
.csharpcode .html { color: #800000; }
.csharpcode .attr { color: #ff0000; }
.csharpcode .alt
{
background-color: #f4f4f4;
width: 100%;
margin: 0em;
}
.csharpcode .lnum { color: #606060; }
Not a very complicated change is it, but it should be a fundamental…. The hashcode must just determine which table to use, and the table suffix should always determine the DB. That way you always always have the same tables in the same DB’s.
Hope that helps someone out before they make a massively silly mistake!
I’ll post up real life distribution data based on my keys at a later date.