Quick and Dirty Data Auditing

A lot of times a project, especially as it progresses, lacks proper requirements, time, or the resources, and tracking changes to data is often an afterthought. Rapid changes to the the database schema and code occur on a frequent basis, and with a moving trend to create software with fewer checks and balances (less quality assurance staff, less emphasis from management to allow proper time for code development), errors in saved data becomes more and more prevalent.

With all of this in mind, the ability to track changes in data is a powerful tool for both development and production issue remediation. However, one of the biggest issues faced is maintaining auditing tables when the database schema changes, and trying to ensure they don’t cause a system failure on their own.

Take for example the following table:

and with this a common scenario is to create an auditing table like so:

and a trigger which fires on updates:

With this scenario there are some problems. The most major being that if the live table gets updated and not the audit table, the trigger fails and stops the insert. (This is easily fixed in development, not so much in production.) Hopefully if this is the case, the calling application is verifying the database update succeeds.

This problem can be solved by specifying the columns specifically,

but this leaves the problem that if columns are added to the live table, and either the audit table or the trigger aren’t also updated, new fields may slip through the cracks and won’t be noticed until someone actively checks the table for information. Unfortunately, this is a commonly occurring worse case scenario, because most of the time you only look in a audit table when you need to track down something that might be wrong.

There is actually a fairly simple solution to this problem. Store the updated data as a single text entry. Furthermore it’s just as easy to store it as an XML entry which most runtime environments have a convenient way of parsing. To accomplish this, create an auditing table with the appropriate fields:

The TableId is the sys.tables.object_id so that when a trigger enters data into the table, it is easily discernible which data came from where. The action field says whether it was an insert or a delete. (Updates to a table do a delete of the previous data and then an insert.) The TransactionEntryId links multiple entries into the audit table together to show which insert and deletes are the result of an update.

Next, create the trigger and apply it to the appropriate table or tables.

In truth, this doesn’t work for all scenarios. Tables with a high transaction volume, or ones with large amounts of data for each record may suffer problems, but it works well enough in many scenarios to be a viable solution even if it isn’t one for the long term. (Like any solution, before jumping to conclusions about performance, the change should be tested.) Also, by keeping the trigger contents generic, this same code can be applied to multiple places without needing to change it alleviating the possibility of accidentally making a mistake in the setup process.

Depending on the scenario, it can beneficial to be able to enable auditing at will. In an ideal scenario, it would be as easy as applying the trigger to the table when necessary, but in many organizations, this isn’t possibly without a formal deployment process, and when looking into a problem normally auditing needs to be added quickly. By adding another table which tells triggers when to record data, auditing can be turned on by adding an entry.

and edit the trigger template

The drawback to this approach is that the triggers will still fire on each data change in the table, but it does alleviate some of the overhead of the second insert statement and the audit tables growing.

It’s O.K. It’s Just Us Two Rows Joined: Vertical Partitioning Or When a 1×1 Table Relationship Makes Sense

From a theoretical sense, there would never be any reason to segregate data which has a one to one relationship. Multiple tables were meant to allow one datum to link to multiple others minimal or no duplication. Unfortunately, practice and theory often diverge and when applying techniques in real world applications, and scenarios arise which are less than perfect and so theoretical absolutes are often left behind. Because of this, in certain scenarios, one to one table relationships can be advantageous to keeping all data in a single table.

Security

This is probably the easiest example to show day to day impact on segregating data. It’s very easy to restrict access to a table, but it’s very hard to restrict access to a single column. Say the column is something like employee salary information or customer social security numbers (well you’re encrypting these anyway right?), the organization may not want or even allow anyone but a select set of individuals to view the information. There are a number of ways to handle this scenario when it’s in the same table, like not accessing the table directly by enforcing a company policy that applications may only retrieve data through views or stored procedures. This is a great approach, but it has possible flaws.

  1. It may not be possible. If you are adding the field to an existing table, there may already be legacy applications using direct access to the table and worse yet may use SELECT *. The organization would need to spend significant amounts of time and money to refactor all the applications to remediate the problem. Since this is most likely the costliest solution, and there are other alternatives to solving the problem, it probably isn’t the best choice of action.
  2. Security is only as good as the organization enforcing it. It’s easy to say “Under no circumstance will this occur.” The problem is that most people are busy, and there are always scenarios where “an application can’t use stored procedures or views.” In reality, this probably isn’t true, but organizations always try and move at the fastest possible pace, and instead of taking time to think of an alternative, someone grants an elevated level of access. Should the admins do this? No, but when people get overworked things slip, and, in truth, they don’t deserve all the blame in making a mistake when an organization places them in this position. Regardless of who’s at fault, it’s still a problem and needs to be avoided.

Performance

Performance is a second reason for one to one relationships and can have just as significant of an impact in using them. Often when looking at performance, it’s necessary to look at all the scenarios where table size causes an impact. With small tables, everything is fast, but a small performance oversight when starting can easily become a large problem later on.

Data Access and Locking

Every time SQL Server updates data, it has to lock the updated row. Normally, this is a pretty fast operation, but if the row in question is part of a larger transaction, access to it becomes problematic. The longer it takes to update, the longer other queries must wait before accessing the data, and if this happens frequently, it can severely limit the speed of applications performing actions on that table. There are other solutions to this problem such as using NOLOCK, but this removes the guarantee the accessed data is accurate. If an update is running and something uses NOLOCK to access the data and the update aborts due to an error, the accessed data could be inaccurate as it might contain information from an update which wasn’t committed. (Consider a query to access your bank account information using uncommitted reads. Are you sure you want your account balance to possibly be inaccurate?)

Depending on the scenario, a better solution might be to partition the data into two tables: one where data that changes minimally and can be queried without fear of table locking, and a second table housing the columns which update frequently and lock rows:

Read and Update

This can be taken to an extreme and is not ideal for all situations, but in scenarios where time consuming updates occur and it slows down query access to other columns, it’s an efficient way of ensuring both data integrity querying and increasing application performance.

Triggers

Triggers can be set to fire when anything changes on a table, and this is true even when the exact same value is placed in the cell. Take the following table as an example:

Users Single Table

Let’s say you want to track every time someone changes their username, first name, last name, or password. In this scenario, you could just have the application log the changes in an audit table. That’s a good approach, but it’s limiting in that it assumes only one application handles user management. You could have the user update SQL code placed in a stored procedure and have all applications call it. This assumes that

  1. All developers know (and care) to use the stored procedure
  2. People administrating the database and accounts in production know to use the procedure when updating accounts outside of the application. (Yes, it is true this probably shouldn’t happen, but in reality how often does that stop people who are trying to fix things as fast as possible to keep the organization running?)

This leaves using a trigger which fires and logs updates every time the table changes. The problem with this is that the trigger will fire every time anything changes. Even if the trigger is designed to ignore unnecessary fields and only to insert data into the log table when certain fields change, it still has to run on every update creating unnecessary overhead. By changing the table structure to:
Multiple Tables

this allows the trigger to fire only when updating rarely changed values and reduces the unnecessary overhead of having it evaluate and possibly log changes it’s not concerned with.

Backup and Restore

Operational Logistics are often something that most people aren’t concerned with when building applications, but they should be. Designing a series of tables and not thinking about the target environment can causes long term problems if they don’t understand the physical layout of the network.

A number of companies back up their production databases and restore them into test and development environments. This allows developers to understand what the real data looks like and ensure they are writing code against the production database schema. Production and development database structures eventually diverge (i.e. abandoned development code, quick fixes in production to avert crisis, etc.) meaning that developers aren’t writing code against the actual schema unless they’re updated to match production. To solve this problem, organizations back up the production databases (and hopefully scrub out sensitive information), and restore them to the test environments.

Organizations which house all their environments in a single location, moving large databases often isn’t an issue, data centers but not everyone has that luxury. Let’s say your development team is in Indiana, your production applications are in Vermont, and your Test Systems are in North Carolina.

Moving a small database between them would be relatively easy, but this can become a problem as the database grows. Either the time to transfer the databases increases or the expense of having a larger connection between the centers does.

There are ways to help reduce this issue. The easiest is to

  1. Backup the database
  2. Restore it to someplace else in the same datacenter
  3. Remove the data which isn’t necessary for the test environment and compact the database
  4. Move the smaller database to the other environments.

This approach doesn’t require tables to be split, but it helps. If you have something like,

Binary Data in one table

and you want to keep the XML columns for development to analyze but don’t care about the binary PDFs, you can update each PDF entry to NULL. Depending on the size of the table this may be trivial, but if you have something like 10 million records, that could take some time. Alternatively, it’s much easier to segregate the data like so

Segregated Binary Data

and use TRUNCATE TABLE (or DROP TABLE and then recreate it). A small change to the structure can save a huge amount of processing time.

Partitioning

Depending on the database engine, moving data to different tables to partition data may not be necessary, but to those which it is, segregating data can have a marked performance gain. Most people think of a server having multiple disks, but it’s uncommon to immediately think of a server having access several different types. This is an idea which is gaining popularity and is even moving to desktops, where there are faster drives, say a solid state drive, which are used to store commonly accessed data, and slower drives, maybe a 7200 rpm spindle drive, to house data where speed isn’t an issue. High performance drives are often smaller and more expensive, and because of this, it’s necessary to use them as efficiently as possible. With modern databases, partitioning columns to different disks is possible, but depending on their layout and applications access them, it might be tricky, and with legacy systems this may not be an option. With this in mind, moving large, rarely accessed, data to a different table and moving to a slower drive allows the database to access the other data faster and increases application performance.

Reclaiming Stored Procedure Code

One of the largest problems people have with keeping stored procedures in source control is not a disagreement that they should be but the change to the process in which they are accustomed to. Normally with code, you create a physical file which is then piped to the compiler to output an assembly. With Microsoft’s SQL Server Management (SSMS) Studio the process of creating a file and then running it has changed this, because the database stores the SQL code for procedures in the database negating the need for a physical file. This is a convenient shortcut but makes it difficult to keep SQL source in sync, because inevitably people forget to create the file from Management Studio when they’re done. SSMS has a menu item for generating a file, but the creation options are limiting depending on the need, and it adds additional code around the procedure which most of the time isn’t necessary. Not to mention generating one file is simple enough, but having to do it for hundreds of changed procedures is tedious and time consuming.

Script Procedure

Fortunately, there is a convenient fix for this. Since MSSQL Server stores the code in the database, you can easily retrieve the code and save it yourself. Microsoft created the INFORMATION_SCHEMA.ROUTINES view which returns information such as it’s name, type, schema, and most importantly, the code. Using this approach you can also format the code the way you want when saving it to a file. I always make the scripts so they don’t fail so I prepend the necessary header code and change the create to an alter command before saving it.

File on Github at: https://github.com/kemiller2002/StructuredSight/blob/master/GettingProcedures/GetProcedures.ps1

Wait Wait Don’t Tell Me…..On Second Thought

Consider the following:

Ideally, the example would start at 100 and loop until the counter hits zero, but the statement to decrement the counter is missing. It won’t ever stop running, and there is no visual cue to alert there is an issue. The most logical solution to this would be to add the PRINT statement.

Sadly, this does not solve the issue at hand as the PRINT statement only displays output sent to it after the script finishes executing. It’s great for knowing what occurred after it’s finished but provides no use for telling what is currently happening. Conveniently, there is a solution using the RAISERROR function.

The first parameter is the message to display, and the second and third are the error level and the system state. RAISERROR treats anything with an error level from 0-10 as a message and does not affect code execution. The system state ranges from -1 to 255 and anything outside of that results in an exception (and for the purposes of printing messages, this value is inconsequential). Using RAISERROR and adding the WITH NOWAIT option allows the immediate message output describing how the script is executing as it happens.

informational message

When the error level is raised above 10, Sql Server raises an exception:

Error

This works really well when executing a command from within a .NET application, but an issue arises when executing code from an external process like SqlCmd. When using the RAISERROR function and supplying a severity above 10, it doesn’t always trigger the process’s exit code (Process.ExitCode) to change to something other than 0 which alerts the calling application about an issue.

Exit code 0

Why use SQLCmd when .NET has a built in way of executing code with SqlCommand? The SqlCommand object does not allow the use of the batch seperator: GO. For applications doing Create/Read/Update/Delete (CRUD) operations, not using the GO statement isn’t a problem, but it is important to applications which perform maintenance on databases. Certain SQL execution statementslike CREATE PROCEDURE, must be the first item in a batch meaning chaining multiple statements together in one execution is impossible without using GO.

Fortunately, there is an easy solution to this problem. Change the error level to 20 – 25 or greater and add the WITH LOG option (values greater than 25 are interpreted as 25). This will close the connection to Sql Server and force applications like SqlCmd to return an ExitCode that is something other than 0. Unfortunately to set a severity level between 19-25, the account running the script must either be a member of the sysadmin role or have the Alter Trace permission.

Exit Code Not Zero

It is possible to look at the output and error messages and attempt to parse them and determine if an error occurred instead, but this is error prone and not the standard method for checking for system exceptions. It may work for applications specifically written for this scenario, but applications written to interact with others in the Windows ecosystem look at the exit code to determine process success or failure making setting it to the correct value vital when interacting with different systems.

Environments, You All Look the Same to Me

Several years ago, I was working on moving data from the reporting server to the development environment so I could test structural changes. Thinking I was in the development environment, and not realizing my level of access was such that it would work in production, I ran TRUNCATE TABLE Users and didn’t bother to verify what environment I was actually running commands in. It was shortly after my panicked call to the DBA requesting help, that I found out the default access to the reporting server was Administrator. (Clearly there needed to be a discussion about permissions given to production servers, but that’s a different story). Fortunately, he easily fixed it, and I learned that the only way to be sure this wasn’t going to happen again was to never to have two different environments open in the same SQL Serve Management Studio (SSMS).

Registered Server Menu
One problem I’ve always had with SSMS, is the fact that you can have multiple servers open at the same time with no differentiation. It’s nice to be able to open them simultaneously, but it is also dangerous from a data integrity standpoint. Did you run that update on the production database server or the development one? It’s easy to say, “Well keep track of what you are doing!” That’s true, you should be cognizant, but people are people and sometimes you get distracted. It’s too easy to make a mistake, and there should be something that says, “HEY YOU’RE ON A PRODUCTION SERVER BE CAREFUL WHAT YOU ARE DOING!” After several years of bemoaning this issue I found a solution to my problem. You can change the properties on a registered server to show a different color on the open connections to it. Simply open the Registered Server Properties.

Registered Server Properties

Select the Connection Properties tab.

Connection Properties

Check the Use Custom Color, press the Select button, and choose the color.

Selected Color

Now when you open a new query window from that server, the bottom of the window is the selected color.

Query Window

Oh Snap! (Restore)

One of the most difficult aspects of the software development process is developing for and testing database updates. Unlike updating web sites, and executables which can be either updated by an X-Copy or re-installing an MSI, databases always maintain state. Once it’s been altered, the only way to go back to the previous incarnation is a database restore.

In 2008, Microsoft introduced Snapshot Restore (Snap Restore). It creates a secondary database on the server which tracks the deltas between the state when it was created and the changes made afterward. Furthermore, the live database can be refreshed back to the state of the snapshot, making it great for development and testing. You can write your scripts, test the updates, and if something is amiss, run the restore and try again. It’s also great for production deployments where getting it right the first time is paramount. As long as the deployment occurs when no other updates are happening, all the updates can be validated and quickly backed out if need be.

This can all be accomplished with backing up and restoring a copy of the same database. The problem with this approach is that the secondary will be the same size as the original. With small databases, this isn’t an issue, but it becomes cumbersome as it gets larger.

DB Snapshot creation

Notice the size on disk is only 128k. As updates are made on the real database, this size grows, but it is still far smaller than making a full copy of it.

When comparing the live database and the snap shot they have the same data after the creation:

Database comparison

Now after deleting the data from the live database:

After delete

To get the database back to the original state, just execute the restore command:

Data After Restore

Unfortunately, this only works on the Development and Enterprise versions of SQL Server, and it doesn’t work on any database with Full Text Indexing. With all this being said, MSDN licenses cover the use of the Development Edition server, making it available in development and test environments, and can help refine the update scripts before running them in production.

All the scripts for this can be found here.

MSDN Database Snapshot Documentation

Language Comparison Euler Problem One

Question:

If we list all the natural numbers below 10 that are multiples of 3 or 5, we get 3, 5, 6 and 9. The sum of these multiples is 23.

Find the sum of all the multiples of 3 or 5 below 1000.

F#

F# Example One

or

F# Example Two

C#

C# Example One

or

C# Example Two

JavaScript

JavaScript Example One

SQL

or

PowerShell

or

You Get an Update, and You Get an Update, and You Get an Update

What’s the difference between

and

When they’ve completed, both ensure all the records in the table have the ReadyToProcess value set to 1. When looking strictly at the records in the database, there is no difference. In this example, the end result isn’t what’s important, but it’s how the system there. In the first one it updates all records regardless of what the end value was, and the second only makes an update if the value is different than what is being set. The first update statement isn’t telling SQL Server, “Hey I want the value to be this.” What it’s really telling SQL Server is, “I want you to update the field in this record to this. It may look the same as the previous value, but I still want you to update it.”

Depending on the number of records in the table, and what is affected by the table updating, it might not be much of a problem. The first is less efficient than the second, but it still accomplishes the task. It really starts to be an issue when updating a large number of records, or when something like a trigger is associated with the table. Every time SQL Server updates a record, the associated trigger fires, and this is true even if the new value and the old value are the same. An update is an update no matter what the end value is. If the system logs audit data for updates etc. the non-qualified statement is forcing the system to do a lot more work for absolutely no gain.

Where’s The Scope

When I started programming, it was customary to declare the variables at the top of the method. Many people stated that it was clearer to declare everything at the beginning and in the order it was used (or alphabetically for easier searching), and some languages, such as Delphi (Turbo Pascal) even required it. As the years passed over the past fifteen years, this has been a debated topic.

From my experience, the shift has been towards only declaring a variable just before needing it. In truth this makes a lot of sense, it allows the compiler to enforce scope and allows programmers the luxury of having it not compile when a variable is used when it shouldn’t be. This comes with the drawback that variable declarations become more difficult to find when searching through. If it’s declared at the top, then it’s always in the same place, although in truth modern IDEs alleviate this problem as it’s normally a single key stroke to find its declaration.

So, what is the correct answer between the two styles? As a blanket rule, the answer is: they’re both wrong. Where to declare variables is significantly different than what to name them. Most languages don’t care about what a variable is named. As long as they follow certain rules (don’t start with a number in some, or must start with a $ in others), the compiler/interpreter doesn’t really do much else with it. Where they are declared though can cause significant changes to how the system functions, and most developers forget that between languages this can make a huge difference.
Take C# and JavaScript for example. They look somewhat the same. Are they? No, they aren’t really even close. In C# variables are created at the branch level:

Running Example

Non-running example

In JavaScript they are declared at the function level:

Running Example

The second alert shows “undefined” instead of “I’m Global” even though it’s at the top of the function, because JavaScript moves all variable declaration to be at the top of the method regardless of where they are declared. (Instantiating the variable when it’s declared in the branch doesn’t initialize it until the branch either. Only the declaration is moved to the top of the function).

Believing that all languages work the same, because one can write similar syntax with error can cause major problems. The language that trips a lot of people up and how they should be declared is SQL. Take the following example:

Here’s the output:

Notice how it only prints out 10 even though the counter is being updated each loop? If the @LocalVariable was actually initialized each time the loop executed, it would have counted down to 1 because it would be set back to NULL each time.

What you have to remember, is that even though you wrote it a certain way doesn’t mean that the compiler/interpreter will output the exact same instruction set you thought it would. Assuming a variable will behave a certain way without testing or explicitly setting it, can be the difference between working code, and spending several hours tracking down a problem.

Find a Guid In a Haystack

A little while ago we had a problem when an unidentified Guid came up in an error email. Whoever wrote the original message, knew exactly where to look to find the associated record, but by now this person was long gone, and we were left with a problem on our hands. There was an error with an account, and we didn’t know where to look. Obviously, if there was only one table with a Guid as a field, it would be easy to find, but the problem becomes a lot larger when every table has one or multiple fields with it as a type. You’re left blankly staring at it, and ultimately facing the realization that you have to go through each table one at a time field by field.

We all agreed this was a waste, and we needed to speed up the process. One, we can’t stop development so everyone can look for it, and two, if it happens again, we’re back in the same spot. To handle this, I created the following script. It uses the system tables to look up each column type and build a select statement to search the database one field at a time.

Did it take a little time? Yep. Did it get the job done? Absolutely. It didn’t give us the table name, but it gave us all the associated fields to the record, and that was more than enough to find it. The real win was that none of us had to spend time doing something that a computer can do for us. Its time spent searching the database is nothing compared to how much money the company would spend if we had to look for it.

You can modify it to look for anything in the database by changing the type the cursor statement looks for tp.name = ‘uniqueidentifier’. You can also change this from = to like if you want to search in parts of character fields @ColumnName + ‘ like ”%’ + @SearchId + ‘%”’. This is what I did when I had to modify the script to search for password information a system was saving to the database when it had to be scrubbed out.