Alright so the company that hosts the realm server had a drive fail on them recently. Usually this is not a big deal as they have a backup drive which takes over resulting in virtually no downtime (maybe a quick disconnect but nothing major). However their backup drive failed and they needed to setup new drives and install all of their customer's data back onto the drive from an older save.
It took them 3 and a half days to fix this issue. This is in part due to the rarity and difficulty of this issue and the fact that it happened on a weekend when support is minimal. I was very unhappy with this response time but the company has given us some reassuring words:
"As previously mentioned the event that both drives should fail is extremely rare but as part of this fix we have now put in place a disaster relief measure. If this was to ever happen again then we can reboot the whole server with one action. In addition, individual accounts are back up every 2 days as standard."
The problem is their backup was much older than 2 days which means all of our character data and the realm's backup data which were all up-to-date as of before the crash, got replaced by this older backup the company had. This means some accounts no longer exist and some characters have been severely rolled back. Now usually this wouldn't be a problem because we've recently setup a new backup system that saves player's characters on a different server so that in the event of something like this happening, I can still give everyone a fresh copy of their character. However today while downloading these backups I realized that not all chars were being backed up correctly. This is because the server reached such a high number of files in one folder that it stopped showing files past a certain amount. So the program which is in charge of comparing player's save files and determining which ones need a backup, was not scanning a portion of the files. We were not aware of this issue until all of this happened.
So while the main issue comes from our server provider, we also didn't have proper systems in place to reduce the damages. This has since been fixed. Our offsite backup system no longer suffers from this limitation and our service provider is keeping 2-day backups.
This was an absolute nightmare and I'm already looking forward to the ladder reset in about 1 month so we can put this all behind us. To apologize, the company has offered us 1 month free on the realm cost and a RAM upgrade on the machine. This combined with their new systems in place to prevent this from happening again assures me we are still in good hands. I was very upset with them but honestly this was their first error and they dealt with it well (albeit too slowly).
Thanks to everyone who has patiently waited.
Posted by GreenDude, 10 months 21 days ago