fwiw, Cassandra automatically takes a snapshot (which is basically instantaneous -- see http://www.riptano.com/blog/whats-new-cassandra-066 for a brief explanation why) before truncating or dropping a column family, exactly because it makes recovering from a foot/gun incident like this so much less painful.
> MySQL. Not that using anything else would have prevented this or would help restore the table faster...
The slowness of the restore sounds like they have to insert a text dump back into the database, rather than simply copy over the binary files. Even copying over 1 TB from a single hard drive or between EC2 instances would only take a few hours, not days.
And they weren't using binlogs.
And they didn't separate out their test from production environment, which is understandable when you're in demo mode but not after a couple years and half a million users.
If your database is on a socket then that would mean tests are running on a production machine, which just seems crazy. If it is on TCP than that means access isn't restricted properly. At any rate how is this Rails specific?
I spent years in Java prior to Ruby and through the various frameworks in Java, I never remember making the mistake or working with anyone who made the mistake of blowing away the production data when running tests.
Since working in Rails, have had at least 2-3 experiences on the team with truncated/dropped tables or mangled data in production, and it isn't just the current team.
Many other languages and frameworks don't have it as part of the normal process used by most developers to accidentally do this as much as it has seemed to happen in Rails 2, due to the conventions used.
With Rails, the same project for the most part is in dev, test, and prod. Developers want database.yml revisioned with the rest of the app. Many Rails developers at some point run production against production DB locally, in order to run migrations on production prior to Capistrano deploys, etc., or maybe just to run a production console to run reports or tweak data. So, there is a very real chance for wild things to happen in such a risky environment. I'm not anti-Rails by any means, but I've not heard of this happening as much as it has historically with Rails in other frameworks.
I think it could happen just as easily in Java, especially in continuous integration scripts, which are nearly fully automated. I've nearly done something in Hibernate before just by running ant scripts that looked at my environment variables to see which database it should truncate. If I forget to change back the environment variable and my machines were able to access the prod servers, same thing could happen. Only reason it didn't happen for me is because I usually have pretty strict rules in my pg_hba.conf file that prevents foreign hosts from connecting.
Not really. Reasonably complex java apps would use jndi to store a single db connect name, which the app server maps internally to db credentials.
That way, the app just says "connect me to /x" and the container gives the app /x, which might be the test or prod db connection, the app never knows, or cares. You get the db that the operations guys configured for -that- environment.
Jndi is horrible in many ways, and so are app containers, but there are a few niceties that ruby-land could learn from.
However in this case, it happened on their continuous integration server -- which probably does more than just "rake test". On our hudson scripts, it does something like "rake db:reset" because some concurrent scripts require data to be persisted in the database and not rolled-back per test.
So when it's running something like "rake db:reset", which simply recreates all tables & seeds the database, it just looks at your environment and assumes you know what you're doing. Hudson (and other continuous integration systems) are pretty automated so unless it prompted you, it would run the db scripts quietly.
No system could prevent it, but there are databases(Oracle and MongoDB off the top of my head, and I'm sure there are others) that implement slave delays specifically to help mitigate human error like this.
Incorrect. The article states that they lost data between backup time and whenever the incident happened.
Any reasonable quality transactional database system has a transaction log for a reason. If you were using commercial DB system like Microsoft SQL Server or Oracle (even a decade ago), this would not be an issue. No data loss. Businesses should care about their data, and I guess this is why the commercial databases are still doing fine in a landscape increasingly dominated by FOSS everywhere.
I realize licensing costs do matter, but I can't fathom why people put up with the sad excuse of a RDBMS which is MySQL. For any nontrivial tasks it is slow, it's unreliable, attempting to secure your data by taking database backups (which seemingly can't even provide you with transactional safety!!!) renders the DB unusable and locked while backup is performed, having databases bigger than what you can store in memory makes it perform like a flat-file, etc etc ad infitum.
Surely there must be something better people can use which is still free? Postgres for instance?
FWIW, even MySQL let's you store "binlogs", which allow you to "rerun" all commands which changed the data since your last backup - if you've configured it to.
Which makes me think, last time I had anything to do with adminning MySQL, binlogs were required for replication, I'm guessing this means github aren't replicating that database anywhere either...
You are right on. The cardinal rule of traditional relational databases is that data loss is unacceptable. A database administrator who allows data loss is by definition incompetent. A database that allows data loss is by definition not a database.
Commercial databases such as Oracle, SQL Server, and DB2 are built from the ground up to make it possible to recover everything up through the last committed transaction before the incident. I have always presumed that the leading open source traditional relational databases (mySQL and PostgreSQL) provide this same capability.
In the github case (mySQL) it seems that full recovery may have been possible, but it was not possible within an acceptable timeframe. This may be a flaw in the software (lack of features to make quick recovery possible), or a flaw in the strategy (not configured to make quick recovery possible).
However, any of these databases can easily be administered in a manner that allows data loss. The github event is not necessarily an indication of a problem with mySQL. It is unquestionably an indication of a problem with the administration of the database.
Can anyone confirm that full recovery from this type of incident (within a reasonable timeframe) should be possible with both mySQL and PostgreSQL if administered properly?