The database for the main application that I work on is fairly large, about 50 Gb or so. Not the largest SQL Server database I’ve seen, by far, but one that is non-trivial to move across the network. This has a significant impact on the development process of our team. The size of the database combined with the mediocre size of the hard drives on the laptops we use means that keeping a local copy of production data is unfeasible.
Getting a local database’s schema in sync with that on production would be easy, but in our case, that’s not enough to create a working system. Our application has a large amount what I call “static data”, such as menu structures or sets of permissions. So getting a database that’s “up to date” means not only getting the schema in sync with production, but also ensuring this static data is up to date as well.
Using some sort of tool for evolutionary database design like Rails ActiveRecord migrations would alleviate some of these problems, because schema and static data changes would be a part of the source code. Developers could just run the migrations on their local databases after they updated from source control. However, this still wouldn’t solve the whole problem. In order to effectively develop features within the application, our developers need a reasonable set of test data so that the system isn’t just completely empty.
There are tools out there, such as Red Gate’s SQL Data Generator or the data generators in Visual Studio, that will do a pretty good job creating test data by looking at database column and table names, foreign keys, and such. This might work out even for such a large system as ours, except that a lot of key tables have “polymorphic” relationships, meaning that the foreign key that they contain could point to the primary key in a number of different tables, depending on the particular piece of data.
For example, say we have an “Invoices” table. We have a multi-tenant system, and our customers often base their invoicing on different things. Some might base invoices based on each individual service they performed for their clients, while others might base them on the amount of time logged in the time and expense module for a client. In each case, the invoice database record needs to point back to a record in the table that’s most relevant, given the customer’s business processes. Another example of this kind of relationship might be audit records, which might point back to just about any other table in the system.
Since these “polymorphic” associations are not defined as proper foreign keys in the database, those data generation tools wouldn’t be able to figure out that the columns were foreign keys at all, and as far as I’ve been able to figure, it’s not possible to define foreign key relationships with a number of different tables manually. And even if it were, I don’t think I could prevent the tool from associating and invoice from a company that bases it’s invoices on services performed with a time and expense entry.
There are a couple of ways that our developers cope with this, neither of which are ideal. The first, which most of our team members use, is to develop against one of several shared database instances on our data tier servers. The problems associated with using shared databases for development are well established; developers simply can’t be as productive when stepping all over each other with data and schema changes.
The second, which I use, is to keep an instance of the database on an external hard drive. This keeps me isolated from the changes made by other developers, and it’s a significantly better experience than using a shared database, but problems start to crop up when I get latest from source control. Developers will check in source code changes that require data or schema changes in order to work, and my local database won’t have those changes.
So, at the end of the day, the only reliable way to get an up-to-date copy of the schema is to restore a database from the last backup of production. Since the database is so big, that restore takes multiple hours, which can seriously impede the development process. This actually impacts developers using shared databases even more than me, because when one of those shared databases has to be refreshed, multiple developers are put out of commission.
The only way I’ve thought of to make this a little better is to manually create a script that will cherry-pick a certain number of rows from what’s essentially the “root” table of our database, and spider out to include all the data related to those cherry-picked rows, while also including all rows from the tables that contain static data. The end result would be a much smaller test database that contains a meaningful subset of production data that could me moved around and refreshed in minutes or seconds rather than hours. The problems with this idea are that it would be onerous to create the script in the first place, since our database contains over 500 tables, and keeping the script up to date with any changes to tables or columns.
I wish there was an easier answer to this. I have a few ideas in the back of my head about writing a tool that might help me create those scripts, but I think it would still end up being a very manual process of defining relationships that only a human being with knowledge of the system would be able to come up with. If any readers have experience with this kind of thing, I’d love to hear how you dealt with it.