It’s O.K. It’s Just Us Two Rows Joined: Vertical Partitioning Or When a 1×1 Table Relationship Makes Sense

From a theoretical sense, there would never be any reason to segregate data which has a one to one relationship. Multiple tables were meant to allow one datum to link to multiple others minimal or no duplication. Unfortunately, practice and theory often diverge and when applying techniques in real world applications, and scenarios arise which are less than perfect and so theoretical absolutes are often left behind. Because of this, in certain scenarios, one to one table relationships can be advantageous to keeping all data in a single table.

Security

This is probably the easiest example to show day to day impact on segregating data. It’s very easy to restrict access to a table, but it’s very hard to restrict access to a single column. Say the column is something like employee salary information or customer social security numbers (well you’re encrypting these anyway right?), the organization may not want or even allow anyone but a select set of individuals to view the information. There are a number of ways to handle this scenario when it’s in the same table, like not accessing the table directly by enforcing a company policy that applications may only retrieve data through views or stored procedures. This is a great approach, but it has possible flaws.

  1. It may not be possible. If you are adding the field to an existing table, there may already be legacy applications using direct access to the table and worse yet may use SELECT *. The organization would need to spend significant amounts of time and money to refactor all the applications to remediate the problem. Since this is most likely the costliest solution, and there are other alternatives to solving the problem, it probably isn’t the best choice of action.
  2. Security is only as good as the organization enforcing it. It’s easy to say “Under no circumstance will this occur.” The problem is that most people are busy, and there are always scenarios where “an application can’t use stored procedures or views.” In reality, this probably isn’t true, but organizations always try and move at the fastest possible pace, and instead of taking time to think of an alternative, someone grants an elevated level of access. Should the admins do this? No, but when people get overworked things slip, and, in truth, they don’t deserve all the blame in making a mistake when an organization places them in this position. Regardless of who’s at fault, it’s still a problem and needs to be avoided.

Performance

Performance is a second reason for one to one relationships and can have just as significant of an impact in using them. Often when looking at performance, it’s necessary to look at all the scenarios where table size causes an impact. With small tables, everything is fast, but a small performance oversight when starting can easily become a large problem later on.

Data Access and Locking

Every time SQL Server updates data, it has to lock the updated row. Normally, this is a pretty fast operation, but if the row in question is part of a larger transaction, access to it becomes problematic. The longer it takes to update, the longer other queries must wait before accessing the data, and if this happens frequently, it can severely limit the speed of applications performing actions on that table. There are other solutions to this problem such as using NOLOCK, but this removes the guarantee the accessed data is accurate. If an update is running and something uses NOLOCK to access the data and the update aborts due to an error, the accessed data could be inaccurate as it might contain information from an update which wasn’t committed. (Consider a query to access your bank account information using uncommitted reads. Are you sure you want your account balance to possibly be inaccurate?)

Depending on the scenario, a better solution might be to partition the data into two tables: one where data that changes minimally and can be queried without fear of table locking, and a second table housing the columns which update frequently and lock rows:

Read and Update

This can be taken to an extreme and is not ideal for all situations, but in scenarios where time consuming updates occur and it slows down query access to other columns, it’s an efficient way of ensuring both data integrity querying and increasing application performance.

Triggers

Triggers can be set to fire when anything changes on a table, and this is true even when the exact same value is placed in the cell. Take the following table as an example:

Users Single Table

Let’s say you want to track every time someone changes their username, first name, last name, or password. In this scenario, you could just have the application log the changes in an audit table. That’s a good approach, but it’s limiting in that it assumes only one application handles user management. You could have the user update SQL code placed in a stored procedure and have all applications call it. This assumes that

  1. All developers know (and care) to use the stored procedure
  2. People administrating the database and accounts in production know to use the procedure when updating accounts outside of the application. (Yes, it is true this probably shouldn’t happen, but in reality how often does that stop people who are trying to fix things as fast as possible to keep the organization running?)

This leaves using a trigger which fires and logs updates every time the table changes. The problem with this is that the trigger will fire every time anything changes. Even if the trigger is designed to ignore unnecessary fields and only to insert data into the log table when certain fields change, it still has to run on every update creating unnecessary overhead. By changing the table structure to:
Multiple Tables

this allows the trigger to fire only when updating rarely changed values and reduces the unnecessary overhead of having it evaluate and possibly log changes it’s not concerned with.

Backup and Restore

Operational Logistics are often something that most people aren’t concerned with when building applications, but they should be. Designing a series of tables and not thinking about the target environment can causes long term problems if they don’t understand the physical layout of the network.

A number of companies back up their production databases and restore them into test and development environments. This allows developers to understand what the real data looks like and ensure they are writing code against the production database schema. Production and development database structures eventually diverge (i.e. abandoned development code, quick fixes in production to avert crisis, etc.) meaning that developers aren’t writing code against the actual schema unless they’re updated to match production. To solve this problem, organizations back up the production databases (and hopefully scrub out sensitive information), and restore them to the test environments.

Organizations which house all their environments in a single location, moving large databases often isn’t an issue, data centers but not everyone has that luxury. Let’s say your development team is in Indiana, your production applications are in Vermont, and your Test Systems are in North Carolina.

Moving a small database between them would be relatively easy, but this can become a problem as the database grows. Either the time to transfer the databases increases or the expense of having a larger connection between the centers does.

There are ways to help reduce this issue. The easiest is to

  1. Backup the database
  2. Restore it to someplace else in the same datacenter
  3. Remove the data which isn’t necessary for the test environment and compact the database
  4. Move the smaller database to the other environments.

This approach doesn’t require tables to be split, but it helps. If you have something like,

Binary Data in one table

and you want to keep the XML columns for development to analyze but don’t care about the binary PDFs, you can update each PDF entry to NULL. Depending on the size of the table this may be trivial, but if you have something like 10 million records, that could take some time. Alternatively, it’s much easier to segregate the data like so

Segregated Binary Data

and use TRUNCATE TABLE (or DROP TABLE and then recreate it). A small change to the structure can save a huge amount of processing time.

Partitioning

Depending on the database engine, moving data to different tables to partition data may not be necessary, but to those which it is, segregating data can have a marked performance gain. Most people think of a server having multiple disks, but it’s uncommon to immediately think of a server having access several different types. This is an idea which is gaining popularity and is even moving to desktops, where there are faster drives, say a solid state drive, which are used to store commonly accessed data, and slower drives, maybe a 7200 rpm spindle drive, to house data where speed isn’t an issue. High performance drives are often smaller and more expensive, and because of this, it’s necessary to use them as efficiently as possible. With modern databases, partitioning columns to different disks is possible, but depending on their layout and applications access them, it might be tricky, and with legacy systems this may not be an option. With this in mind, moving large, rarely accessed, data to a different table and moving to a slower drive allows the database to access the other data faster and increases application performance.

Reclaiming Stored Procedure Code

One of the largest problems people have with keeping stored procedures in source control is not a disagreement that they should be but the change to the process in which they are accustomed to. Normally with code, you create a physical file which is then piped to the compiler to output an assembly. With Microsoft’s SQL Server Management (SSMS) Studio the process of creating a file and then running it has changed this, because the database stores the SQL code for procedures in the database negating the need for a physical file. This is a convenient shortcut but makes it difficult to keep SQL source in sync, because inevitably people forget to create the file from Management Studio when they’re done. SSMS has a menu item for generating a file, but the creation options are limiting depending on the need, and it adds additional code around the procedure which most of the time isn’t necessary. Not to mention generating one file is simple enough, but having to do it for hundreds of changed procedures is tedious and time consuming.

Script Procedure

Fortunately, there is a convenient fix for this. Since MSSQL Server stores the code in the database, you can easily retrieve the code and save it yourself. Microsoft created the INFORMATION_SCHEMA.ROUTINES view which returns information such as it’s name, type, schema, and most importantly, the code. Using this approach you can also format the code the way you want when saving it to a file. I always make the scripts so they don’t fail so I prepend the necessary header code and change the create to an alter command before saving it.

File on Github at: https://github.com/kemiller2002/StructuredSight/blob/master/GettingProcedures/GetProcedures.ps1

Wait Wait Don’t Tell Me…..On Second Thought

Consider the following:

Ideally, the example would start at 100 and loop until the counter hits zero, but the statement to decrement the counter is missing. It won’t ever stop running, and there is no visual cue to alert there is an issue. The most logical solution to this would be to add the PRINT statement.

Sadly, this does not solve the issue at hand as the PRINT statement only displays output sent to it after the script finishes executing. It’s great for knowing what occurred after it’s finished but provides no use for telling what is currently happening. Conveniently, there is a solution using the RAISERROR function.

The first parameter is the message to display, and the second and third are the error level and the system state. RAISERROR treats anything with an error level from 0-10 as a message and does not affect code execution. The system state ranges from -1 to 255 and anything outside of that results in an exception (and for the purposes of printing messages, this value is inconsequential). Using RAISERROR and adding the WITH NOWAIT option allows the immediate message output describing how the script is executing as it happens.

informational message

When the error level is raised above 10, Sql Server raises an exception:

Error

This works really well when executing a command from within a .NET application, but an issue arises when executing code from an external process like SqlCmd. When using the RAISERROR function and supplying a severity above 10, it doesn’t always trigger the process’s exit code (Process.ExitCode) to change to something other than 0 which alerts the calling application about an issue.

Exit code 0

Why use SQLCmd when .NET has a built in way of executing code with SqlCommand? The SqlCommand object does not allow the use of the batch seperator: GO. For applications doing Create/Read/Update/Delete (CRUD) operations, not using the GO statement isn’t a problem, but it is important to applications which perform maintenance on databases. Certain SQL execution statementslike CREATE PROCEDURE, must be the first item in a batch meaning chaining multiple statements together in one execution is impossible without using GO.

Fortunately, there is an easy solution to this problem. Change the error level to 20 – 25 or greater and add the WITH LOG option (values greater than 25 are interpreted as 25). This will close the connection to Sql Server and force applications like SqlCmd to return an ExitCode that is something other than 0. Unfortunately to set a severity level between 19-25, the account running the script must either be a member of the sysadmin role or have the Alter Trace permission.

Exit Code Not Zero

It is possible to look at the output and error messages and attempt to parse them and determine if an error occurred instead, but this is error prone and not the standard method for checking for system exceptions. It may work for applications specifically written for this scenario, but applications written to interact with others in the Windows ecosystem look at the exit code to determine process success or failure making setting it to the correct value vital when interacting with different systems.

Oh Snap! (Restore)

One of the most difficult aspects of the software development process is developing for and testing database updates. Unlike updating web sites, and executables which can be either updated by an X-Copy or re-installing an MSI, databases always maintain state. Once it’s been altered, the only way to go back to the previous incarnation is a database restore.

In 2008, Microsoft introduced Snapshot Restore (Snap Restore). It creates a secondary database on the server which tracks the deltas between the state when it was created and the changes made afterward. Furthermore, the live database can be refreshed back to the state of the snapshot, making it great for development and testing. You can write your scripts, test the updates, and if something is amiss, run the restore and try again. It’s also great for production deployments where getting it right the first time is paramount. As long as the deployment occurs when no other updates are happening, all the updates can be validated and quickly backed out if need be.

This can all be accomplished with backing up and restoring a copy of the same database. The problem with this approach is that the secondary will be the same size as the original. With small databases, this isn’t an issue, but it becomes cumbersome as it gets larger.

DB Snapshot creation

Notice the size on disk is only 128k. As updates are made on the real database, this size grows, but it is still far smaller than making a full copy of it.

When comparing the live database and the snap shot they have the same data after the creation:

Database comparison

Now after deleting the data from the live database:

After delete

To get the database back to the original state, just execute the restore command:

Data After Restore

Unfortunately, this only works on the Development and Enterprise versions of SQL Server, and it doesn’t work on any database with Full Text Indexing. With all this being said, MSDN licenses cover the use of the Development Edition server, making it available in development and test environments, and can help refine the update scripts before running them in production.

All the scripts for this can be found here.

MSDN Database Snapshot Documentation

Language Comparison Euler Problem One

Question:

If we list all the natural numbers below 10 that are multiples of 3 or 5, we get 3, 5, 6 and 9. The sum of these multiples is 23.

Find the sum of all the multiples of 3 or 5 below 1000.

F#

F# Example One

or

F# Example Two

C#

C# Example One

or

C# Example Two

JavaScript

JavaScript Example One

SQL

or

PowerShell

or

You Get an Update, and You Get an Update, and You Get an Update

What’s the difference between

and

When they’ve completed, both ensure all the records in the table have the ReadyToProcess value set to 1. When looking strictly at the records in the database, there is no difference. In this example, the end result isn’t what’s important, but it’s how the system there. In the first one it updates all records regardless of what the end value was, and the second only makes an update if the value is different than what is being set. The first update statement isn’t telling SQL Server, “Hey I want the value to be this.” What it’s really telling SQL Server is, “I want you to update the field in this record to this. It may look the same as the previous value, but I still want you to update it.”

Depending on the number of records in the table, and what is affected by the table updating, it might not be much of a problem. The first is less efficient than the second, but it still accomplishes the task. It really starts to be an issue when updating a large number of records, or when something like a trigger is associated with the table. Every time SQL Server updates a record, the associated trigger fires, and this is true even if the new value and the old value are the same. An update is an update no matter what the end value is. If the system logs audit data for updates etc. the non-qualified statement is forcing the system to do a lot more work for absolutely no gain.

Where’s The Scope

When I started programming, it was customary to declare the variables at the top of the method. Many people stated that it was clearer to declare everything at the beginning and in the order it was used (or alphabetically for easier searching), and some languages, such as Delphi (Turbo Pascal) even required it. As the years passed over the past fifteen years, this has been a debated topic.

From my experience, the shift has been towards only declaring a variable just before needing it. In truth this makes a lot of sense, it allows the compiler to enforce scope and allows programmers the luxury of having it not compile when a variable is used when it shouldn’t be. This comes with the drawback that variable declarations become more difficult to find when searching through. If it’s declared at the top, then it’s always in the same place, although in truth modern IDEs alleviate this problem as it’s normally a single key stroke to find its declaration.

So, what is the correct answer between the two styles? As a blanket rule, the answer is: they’re both wrong. Where to declare variables is significantly different than what to name them. Most languages don’t care about what a variable is named. As long as they follow certain rules (don’t start with a number in some, or must start with a $ in others), the compiler/interpreter doesn’t really do much else with it. Where they are declared though can cause significant changes to how the system functions, and most developers forget that between languages this can make a huge difference.
Take C# and JavaScript for example. They look somewhat the same. Are they? No, they aren’t really even close. In C# variables are created at the branch level:

Running Example

Non-running example

In JavaScript they are declared at the function level:

Running Example

The second alert shows “undefined” instead of “I’m Global” even though it’s at the top of the function, because JavaScript moves all variable declaration to be at the top of the method regardless of where they are declared. (Instantiating the variable when it’s declared in the branch doesn’t initialize it until the branch either. Only the declaration is moved to the top of the function).

Believing that all languages work the same, because one can write similar syntax with error can cause major problems. The language that trips a lot of people up and how they should be declared is SQL. Take the following example:

Here’s the output:

Notice how it only prints out 10 even though the counter is being updated each loop? If the @LocalVariable was actually initialized each time the loop executed, it would have counted down to 1 because it would be set back to NULL each time.

What you have to remember, is that even though you wrote it a certain way doesn’t mean that the compiler/interpreter will output the exact same instruction set you thought it would. Assuming a variable will behave a certain way without testing or explicitly setting it, can be the difference between working code, and spending several hours tracking down a problem.

Non Failing Scripts

One of the most frustrating parts of database development is double applying scripts. If you are working directly on the production server, this problem is alleviated by making alterations to the SQL commands along the way. With doing development in a test environment first this becomes much more difficult. If you are creating a new stored procedure, the first time that it’s run you must use the CREATE command, and subsequently the ALTER command every time after. This makes handing the scripts off to a database team to run on the production servers difficult, because you have to account for the environment you are developing becoming out of sync with the environment your scripts will ultimately be run in.

Ideally, you could refresh your database environment to mirror production frequently to keep synchronicity, but in reality this is time consuming and many times not possible. The key becomes to write scripts so they can be applied to any environment either once or multiple times without causing errors. In my previous post I talked about how to use generators to make data scripts re-runnable without accidentally double inserting records. Using SQL Server’s system tables it’s relatively easy to do the same for tables, columns, stored procedures etc.

For stored procedures, most places just prepend a DROP PROCEDURE command and then create the stored procedure each time. The problem with this is that it removes permissions, and it’s easy to miss recreating them. I prefer to create a dummy procedure immediately before, and then using the alter command to change it to what it should be. If the procedure is already in the environment, it skips the step and immediately moves to the alter portion of the script.

The same thing can be done with adding or dropping columns

Virtually everything in MSSQL Server can be queried by the system tables:

  • views : sys.Views
  • constraints: sys.SysConstraints
  • schemas: sys.Schemas
    (you can link to this in your queries if your tables etc. aren’t under the default)

These are just a few of the system tables for querying data. If you need additional information about a database object, you can always link back to sys.objects. For Example, sys.SysConstraints doesn’t contain the constraint name, so to get it, all you need do is create a join.