I'm a huge fan of SQL Server's database mirroring concept. We've been using it on our application (60GB DB over 220 tables, 10's to 100's of millions of rows) for almost 3 years on SQL 2005. Log shipping has its place (it's pivotal to our offsite disaster recovery plan), and clustering is great if you have a huge replicated SAN, but, at least on paper, DB mirroring is the lowest-cost and most approachable option. In reality however, it has some warts.
We started out with synchronous mirroring in a high safety + witness configuration. This is great, as we could easily take down the primary DB server for maintenance during "slow" periods with minimal effect on the running application (a few transactions might fail, which we recover from gracefully). As our database grew, though, we started seeing massive system slowdowns during peak usage periods. Investigation showed that the lag was coming from the commit overhead on the mirror, which might grow to 30s or more causing timeouts (high safety mode requires that the transaction be fully stored on the mirror server before returning control to the client). More investigation revealed that the disk write I/O on the mirror server's data volume was between 10x-500x the principal, which outstripped the disk array's ability to keep up. With a lot of angry customers and idled operators waiting around, we didn't have a lot of time to invest in root-cause analysis, so we switched over to asynchronous mirroring to keep the doors open (async mirroring doesn't hold up the client transaction waiting for the log to copy to the mirror). Luckily, Microsoft Startup Accelerator (now Bizspark) had hooked us up with SQL Enterprise licenses, so async mirroring was an option for us- it's not on SQL Standard! With async mirroring, a catastrophic loss of the primary server pretty much guarantees some data loss, so it's not ideal.
Awhile back, we upgraded all our DB server storage to SSDs in a RAID10 config, resulting in a massive performance boost on our aging hardware. We figured this would allow us to go back to synchronous mirroring mode with no problems. While not as severe, we still experienced painful slowdowns during peak write periods, and had to switch back to async mirroring again. Even with async mirroring, the write volume to the mirror data disk was still consistently hundreds of times that of the primary. As we hadn't planned for these ridiculous mirror write volumes, we were starting to worry about our mirror server's SSDs burning out prematurely (SSDs have a limited write volume before the flash cells start to fail).
Flash forward to last month- we've purchased spanking new 12-core DB servers with the latest and greatest SSDs in RAID10, 64G of memory, and SQL 2008 on Windows Server 2008R2. We wanted to spend the time to get high safety synchronous mirroring in place again, so we wrote a little simulator app to see if SQL 2008 on our new servers had the same nasty I/O issues. It did. On average, the data write volume was constant, and 250-500x higher on the mirror (writing constant 3-7MB/s 24/7 is a quick death sentence for an SSD rated at 5GB/day for 5 years)!
Time to call in Microsoft. After explaining the situation, the first response was "as designed". Really? Our write volumes aren't all that high, so if this is true, I have a hard time believing that database mirroring is useful on a database of any size. In any case, had we gone live this way, our mirror machine's SSDs would've been shot within a matter of months. After an initial call of "BS!", I got a little more detail: apparently SQL Server not only ships the log data over in real-time, it also performs recovery on the DB for every transaction to minimize the failover time (which IS nice and snappy, usually <1s). Turns out, there is an undocumented trace flag that disables the per-transaction recovery process, at the cost of a higher failover delay. This sounded like exactly what I needed. So what is this magic trace flag?
DBCC TRACEON(3499, -1)
This should be run on both the primary and mirror DBs, since they switch roles during failover. It worked exactly as advertised for us. The mirror server's disk I/O was now in lock-step with the primary, and we could once again use full-safety mirroring with a witness. The failover times were definitely increased, but in our testing, they're still sub-10s, which is perfectly workable for us.
I've only found two references to this trace flag online- one in a presentation by an MS employee that says you should test extensively (which we are), the other in an unrelated KB article about upgrading DBs with fulltext indexes to 2008 from 2005. I've found a handful of people griping about this problem in forums over the years, with no responses. Hopefully this will take care of others' issues as well as it did ours. We were within inches of switching to a sub-second log shipping scenario to replace mirroring because of this issue, and now it's looking like we won't have to. Just wish it was a little better documented.