Time for Tech Talk
Since I went ahead and posted something, I guess the flood gates have been opened. Well let's get into something substantial, shall we? Yes, we shall.
The past few weeks I've been reobsessed with BanLink. It's been running fairly well for about four months now. During that time period, both Travis and I have taken breaks from BanLink and even SL in general. We'd been working on it for a while before its release, so we deserved it, dangit. Thankfully, it runs pretty well on its own. I've popped additional enhancements and tweaks into the web site from time to time, and Travis has done the same to the in-world boxes. With each release, the system gets a little better. Most of this "better" is transparent to the end-user, but system-wise things are improving.
None of this is to say I haven't had a large to-do list for the web side of things. There is a lot on there, much of which is fairly easy to do, but I needed the inspiration to do it. The real fun stuff, however, sort of kicked in in the past month, when I decided to pay closer attention to the servers running this shebang. The servers, I noticed, have been running with a fairly high load average. Jumping up past 20 on occasion. Now these are dual CPU machines, so 20 is a tad on the excessive side (we'd prefer less than 2, for those unsure of what the heck a load average is).
It was time to dive into the servers themselves and take a closer look. It seems that there has been a whole lot of email traffic being queued up, and bounced, and retried, and so forth, which wasn't pleasing. After some digging, I discovered that boxtrapper had been enabled on a couple of user accounts. Boxtrapper is a spam blocker that runs on the server. It's the kind that automatically responds to your email message when you send to someone, requiring you to confirm that you're not a spammer. Once you respond to it's automated email, it'll allow the original one to go through. In theory, it works great. In practice, it sucks up disk space and CPU time. I'd had it disabled for quite some time, but apparently some had found it prior to that disabling and had it running on their accounts. Once I was able to disable boxtrapper completely, the servers breathed a sigh of relief.
But it didn't end there. Oh, no. The servers, my friends, were still being hit hard. The load average was hanging out at around 1.8 for much of the day. Under light load, like in the morning when less seem to be in SL, the load average would be around 0.7 – 0.8. I can handle a load average that's less than one. That would be wonderful, in fact. But it just wasn't consistent enough. It was time to take a look at the stats on BanLink to see what was going on in there. I've had ideas on how to optimize things, but hadn't implemented anything major in quite some time. After watching the process list for a bit, it was clear what was going on. The BanLink controller app would run just fine for a while, then about six would be kicked off at once at which point the database would go into overload. I knew the database would eventually be an issue since it really didn't take long to surpass one million records, but in general the database tables are pretty lean. I didn't expect the DB to become quite the bottleneck that it had become. It was time for some performance tuning.
After some poking around, I'd realized that I hadn't really tuned the database at all since it's install on this server. Well that won't do. So after actually fiddling with the DB settings and a restart, things looked like they improved. The load average was hanging out at around 0.8 for much of the time that I was watching it. Yay!
No, boo! Of course that's not the end of this story.
Taking a peek at things a few days later, we were back to square one. Load averages were jumping up to 5 or 6; sometimes into the teens. Yup, same issue as before. Once a few controllers were launched, the database would run a little slower, more processes would be launched before the first can complete, more processes, more slowness, until we have a real back load of processes waiting to complete. I was envisioning BanLink objects sending query after query until all of them in-world had been queued up and we've got 40+ processes waiting for a response from the server, which is now running so slow it can't even display this blog entry. So I did something delightfully simple…and maybe a little bit evil. I changed the BanLink controller just ever so slightly. It now decides whether or not it should run. Ooh, artificial intelligence, you say? Well I will venture to say that this is smarter than many people who don't know when to quit when they're ahead, but I don't expect it will be sending out an army of robots to take over the world just yet. All it does is check the server load. If it's just too darned high, it quits. Simple.
"But," you say, "what about the systems that are running and a griefer shows up and starts causing trouble? What about that? How will they be banned or ejected with this system if it's under heavy load?" Well, honestly, it won't really make a whole lot of difference with this addition. All it *is* doing is making sure that the system runs nicely. In other words, you can read this blog because BanLink won't take up all the CPU time even if there's a lot going on. And will it change things if you've got a griefer and are waiting for them to be ejected? Not really. If the server was already under a heavy load, you'd have to wait five or six seconds for the process to complete, if you were lucky. Now all it does is bail out and wait for you to query the system again. Most BanLink systems are querying every 5-10 seconds, so there's not a lot of time loss. And with this new code in place, the load average hasn't really jumped up into the teens like it used to. Sure, there's the potential for a site to be lost in the mix, but this is a quick and dirty stop-gap until the next code release (which is coming, I promise).
Yay! Right? It is Yay time, isn't it?
No, not quite, but we're almost there.
Now we've got the DB running better than it was. We've also got BanLink running a little better than it was. Now, let's really look and see how things are performing. Last weekend I wrote a bit of code which allows me to take a look at what BanLink is really doing. I should've done this long ago, as it's really interesting to see. After putting this up, we found that we're getting over 60,000 BanLink queries from the in-world objects every day. This averages at a little over 40 queries every minute. That may not sound like a whole lot, but we're close to a query per second. And honestly, at night, it's very much that. But the stats went to show a little more info that was useful. For instance, the average server load during the 24 hour period was over 2.5. Not quite what I'd been hoping for. Also, the average execution time for the BanLink controller is over 3.5 seconds. Yikes!
Alright, time for one more round of optimizations.
This time I went a bit deeper into the DB optimizations. It might be a bit more memory intensive, but I think it's needed at this point. I added a few additional tweaks which I hadn't thrown in there before. Next was to see if we could tune PHP just a bit. This time, it was up to eAccelerator. I'd thought about installing it a while ago, but hadn't quite gone the distance. eAccelerator is an optimizer for PHP which simply caches the PHP code in its compiled state so that each subsequent run doesn't have to do the compile. Sounds simple enough. Kind of stupidly simple concept, in fact. But it's not what PHP does by default. I'm a pretty big PHP advocate, but that really is just silly. So after a bit of time installing that, tweaking the OS just a tad, and letting it loose, we saw some major improvements. Load average now sits at 0.5 and the average execution time is down to 1 second. There's still a bit of inconsistency, as I'll see the average execution time a low as 0.2 seconds. I'd like to see 0.2 seconds as the average, but I guess this leaves room for those improvements that I hope to be getting to in the near future.
So is it Yay time? Yes, my friends, it is now Yay time. The servers are purple.
[...] I’m a big fan of the idea of user-created trust and ratings networks, although most of them seem to not work very well. I recently came across two interesting examples in the virtual world of Second Life, though, which are worth pointing out here. The first is Dale Glass’s TrustNet, a fee-based subscription system with a slightly confusing Web site (here’s the basic product description). The second is known as BanLink, created by Travis Lambert of »The Shelter« and Mera Pixel. Both systems seem to have their good and bad points. I present them here merely as example of ways to address issues of trust, ratings, conflict resolution and land bans in a virtual world like Second Life, not necessarily as product endorsements. I came across BanLink in a blog post by Mera, and TrustNet through a BlogHUD post I’ve since lost the link for, apologies. [...]