Phish.net Technology Report

Thursday 01/05/2012 by sethadam1

PHISH.NET TECHNOLOGY REPORT

sethadam1 From time to time, people ask me to post something about the technology behind Phish.net. I'd started writing this piece several times, but ultimately, never completed it. Given the problems that Phish.net experienced in the last few days, it seemed the right time to share not only a little bit about how we can host a site that sustains an onslaught of traffic during shows, but also powers our Phish.net API and, in turn, sites around the internet.

As Phish.net has grown in breadth and size, the codebase has ballooned into a major project. There are thousands of users - often tens of thousands of concurrent users - during shows. Maintaining a site of this size can be unwieldy. There are currently over a quarter of a million lines of PHP and CSS, over 700,000 lines of code altogether that make up Phish.net and its services (including the API, the administrative backend, the mobile site, etc). Read on for the painfully technical details.

In designing the site, which relies heavily on dynamically built pages assembling data from a database, it's important to constantly be thinking about efficiency. Poorly written database queries, inefficient code, memory intensive scripts, even slow-to-parse client-side scripting can affect the user experience dramatically. Because there was so much code, I decided to write myself a so-called "database API" that would allow me to talk to the database in a consistent and measured way. To be fair, there are already PHP libraries that do this, however, writing my own lightweight one meant there was no unnecessary overhead or initialization. So, to do this, I wrote a set of functions I've started to call "generics," because these core functions can be altered if you change database platforms, and, for the most part, you have a cross-platform codebase[1]. These wrapper functions control database access which accomplishes a few things: they reduce the code needed in any given function library, they allow for a database-neutral codebase, and they make writing code much faster.

Let's start by looking at the core generics for a MySQL database. These functions are the basis for much of the Phish.net procedural code.

The generics allow us to write procedure code libraries that are introduced on demand (e.g. at the top of a script: include $libs.'setlist.lib.php';) . These functions are specific and fast. So, for example, a function might be as simple as this:

function get_user($userid) { return _dbrow("SELECT userid,username,email,role,whatever,else FROM users WHERE userid='".intval($userid)."'"); }

We also rely heavily on caching. So much of our data is static. As a result, we cache off information frequently. We do so using one of two methods.

Method 1: Time-based caching We build shell scripts that are executed via cron jobs and use links to request certain pages and write them to a cache. This cache is the only thing used to build live data. We use this for the front page, the song list, recent setlists, and most of our public API methods. Method 2: Request-based caching When a user requests a resource that isn't heavily requested, such as a user-specific RSS feed, any archived setlist, or the forum scoreboard, the page checks the last modification date of the cache. If the cache is stale, the requesting user reloads it. The page is wrapped in ob_start(), the HTML is written to the output buffer, then it's captured and cached to a file. Lastly, the buffer is flushed to the screen and output buffering is disabled via ob_end_flush().

Caching reduces load on a server significantly when you use it for query intensive pages.

Another trick is to tune your SQL queries. On one site, I was able to reduce the number of db queries from 70+ to just 3. However, the three queries took longer to run than the 70. Eventually, I got it down to 12. Sometimes, using JOINs can reduce the number of queries needed, but make sure your queries modifications are an improvement to the baseline. Furthermore, use Session data for mostly static info (or something like memcached, if possible).

When you're looking for speed, make sure your javascript and css are well written. Libraries like jQuery, MooTools, and Prototype add a lot of overhead. That said, they offer an amazing array of tools. We use jQuery largely because the Sizzle selector library is second to none for page manipulation and it's a pretty common platform. Notice the Every Time Played charts, which now use client-side scripting for things like sorting, moving a ton of work off of the server. Off-loading that processing onto the client brings us enough benefit to justify loading another 100k of a jQuery extension.

Loading element dynamically via AJAX can make a significant difference in response time. Even though a heavy page may take 15 seconds to load, you might be able to load in 1 second and dynamically drop in the heavy portion. We've experimented with this in many different areas of Phish.net.

Let's Talk about Yesterday

Starting on January 3, we started to see a slow down in performance. We hadn't yet introduced our newest feature, Achievements, so it wasn't that, and it was post-NYE, so it wasn't a spike in traffic. The load on the server was hovering at 8.

Upon closer examination, we saw major issues with database performance. The whole thing was slowing down in a big way. It's important to note that it was performance that was hit, but not operation of the service. So, there was no corrupt data, there was nothing broken, there wasn't anything "wrong" per se, just that the database system couldn't keep up. We were issuing something like 80+ queries per second to the database, mostly from the forum. So parts of the site that didn't talk to the DB were fine. The parts that used our central discussion table, which is most of the site, were dying.

SQL We converted the database tables from MyISAM to InnoDB. MyISAM locks a table while it's updating. The discussion table, which holds forum comments, reviews, blog comments, news comments, etc, has well over 250,000 records. So, when you update a record (like, say a thread count in the forum, or when you edit a comment), the whole table is locked as that update occurs, and as such, the pending queries back up as they wait for that UPDATE query to finish. InnoDB locks only the rows being edited. The trade-off - because, of course, there has to be one - is that the database size is much larger and requires more CPU power to work well. We've got the resources, so we made the jump. We converted all tables to InnoDB and set it as the default for new tables.

I trimmed the slow query log. We log db queries that take a long time to run. I renamed it, archived it, and re-touched the file, then we looked at the last logged queries for starting points.

We turned on MySQL's internal query cache. Wow. In the first two minutes, we had over 37,000 hits to the cache.

JSON Since the forum is so active and is accessing mostly the same data, I cached the front page of the forum, and the AJAX-based reload of the forum; this cache builds every 30 seconds. It's not an HTML cache, like most "cached" sites, it's fetching the data and storing the results of the query in JSON. As a result, ignore lists are still respected. That reduces the queries issues by a lot. I also reduced the AJAX reload frequency to every 45 seconds.

The cache is reloaded as such: a unique filename is built per user. The cache is loaded into this unique file, and then PHP's rename() does the work to replace the cache. Why? Because *nix based system rename files atomically. As such, we don't have file lock issues with multiple people having collisions as they try to reload the same cache file. Yes, it might get reloaded a few times in a second, but then it's cached properly for the next 30 seconds.

We reconfigured MySQL to use new InnoDB configuration directives. You probably don't want those details, but it was pretty nerd-awesome to see how you shape them for your specific resources.

We removed a database field index. Now, this is interesting: the query to get a single forum thread was running 8-50 seconds. 8-50 SECONDS! It used to run in about .3 seconds. No idea how this suddenly got slow. Probably just the size of the database, the number of records, reaching critical mass. So we actually removed a database index on the field that tells us the type of comment (forum, review, blog comment, etc), which made a HUGE difference.

We applied a ton of security updates and rebooted the server. When in doubt, a reboot does a lot of things, including restarting all daemons and caching.

This was a big problem, for sure, but the result is that the configuration is "right" for our size and use. And the server is flying now.

[1] Not really, given differences between SQL implementations, but it's true that in many cases, things run unaltered.

Update 1/5/12 4:45 PM: I'd like to also acknowledge for the amazing team at Endpoint for their support. Their expertise was critical in delivering the solutions, and hey, they're proud to host us.

If you liked this blog post, one way you could "like" it is to make a donation to The Mockingbird Foundation, the sponsor of Phish.net. Support music education for children, and you just might change the world.

35 comments - Link: http://phi.sh/b/4f05e2d1

Comments

2012-01-05 3:56 pm, comment by kipconner

Nerdy Tech Talk! Yes, Please! I noticed the slowdown the other morning and just figured you were uploading code. I have been using the site like crazy for the last few days to update my spread sheet so it was clearly my fault!

Your work doesn't go unnoticed!

Score: 1

2012-01-05 3:56 pm, comment by travelin_light

I have absolutely no idea what any of that means, but thanks!

Score: 6

2012-01-05 4:01 pm, comment by tripsforjoeg

Methinks I should've been listening in all those computer science classes instead of doing friends calc 2 homework for cash.

Score: 2

2012-01-05 4:04 pm, comment by switz

This is a great read, for both the technically inclined and the not so technically inclined. Thank you, Adam, for this window into Phish.net's server!

Score: 0

2012-01-05 4:06 pm, comment by RunawayJim

As if there wasn't enough nerdery on this site.

Score: 4

2012-01-05 4:12 pm, comment by MainePhishingSince88

wow. thats serious. thanks sir.

Score: 0

2012-01-05 4:22 pm, comment by Croat

Nice work and I have never thought about the massive scalability of a site like this. I guess it's naive of me to think that only a small esoteric group of fans use this site...

I recruit for a well-known "search" company. I would love to see a technical interview question regarding the design of phish.net considering it's highly fluctuating user base.

Score: 0

2012-01-05 4:39 pm, comment by fessmiq

Thank You for all you do!

Score: 0

2012-01-05 4:44 pm, comment by PiNT23

As an 'oldie' of the techno-nerd universe... next time, reboot first!

Score: 1

2012-01-05 5:08 pm, comment by YorkvilleBeerLover

very cool.

I'm a Technical Recruiter - let me find you your next gig!
mikedolan@hotmail.com

Score: 0

2012-01-05 5:13 pm, comment by ProfJibboo

Thank you for everything.

Score: 0

2012-01-05 5:21 pm, comment by frantic0blivion

thank you to all involved in maintaining the phish net. this site provides an invaluable service/resource & the phish community at large is better for it. be safe in the offseason, everyone & we'll see you all on lot in a few months. peace

Score: 1

2012-01-05 5:26 pm, comment by sausagemahoney

Adam, you rock beyond words. Thanks for everything you do for this site.

Score: 0

2012-01-05 5:50 pm, comment by brando

Well done Adam.
I heard about the slow down and knew you were on it.
A great read for sure. The different caching steps I find the most interesting, as well as the new database structure of InnoDB. Now that you write this, it makes so much sense. Never really thought about how a table locks while updating a record with many lines of data. Cool how InnoDB only locks the record and not the whole table.

Great to hear how Endpoint was so successful with their support as well !!

As we are beginning a full overhaul of CashorTrade.org this January, this article raises some interesting points of focus for us. With so many lines of code and so many records in a database...using the proper techniques become imperative for the overall speed and delivery of content. I am hoping to use caching methods and JSON where needed. We'll see. There is a lot to learn in the process.

Thanks for all you do...and all you have done Adam. We sincerely appreciate it.
Brando and Dusty

Score: 1

2012-01-05 5:53 pm, comment by TheBag

Score: 1

2012-01-05 6:12 pm, comment by Dundun

A wise man once said computers are all magic-- powered by smoke and sparks. Once the smoke and sparks leave the computer, it's dead.

That said, great that you kept the magic in the system and the site is back running sprints. I've dealt with a couple threshold/locking issues in some of my production code and they're never easy to deal with (although reboots tend to do wonders). Sounds like you found the root cause pretty quick though, impressive.

I never knew there was a .net API. Maybe I'll cook something up for Android in the next couple weeks (or is there already something in the works?)

@YorkvilleBeerLover said:

very cool.

I'm a Technical Recruiter - let me find you your next gig!
mikedolan@hotmail.com

A technical recruiter with a hotmail address? That's a -1, buddy.

Joking... but, seriously...

Score: 2

2012-01-05 8:11 pm, comment by showhe

Loved the insight! Thanks!

Score: 1

2012-01-05 8:28 pm, comment by HARRYHOOD213

You lost me at "From time to time".......

Score: 1

2012-01-05 9:32 pm, comment by Phlat_Brim_Kid

Thanks @sethadam1, for fixing the issue. Also thanks for that explanation. But the only part i understood was that you jumped outta a phone booth with a S on your chest!

Score: 0

2012-01-05 11:08 pm, comment by waxbanks

AWESOME

Score: 0

2012-01-05 11:39 pm, comment by forbin1

no idea what all that means...but thank you to everyone who does a great job with this site..

Score: 0

2012-01-06 1:36 am, comment by ADAWGWYO

Very well done sir. You are a true asset and professional.

Score: 0

2012-01-06 8:51 am, comment by johnnyd

WOW!

Score: 0

2012-01-06 9:21 am, comment by jackl

Huzzah for Adam, our resident IT guru who has made Phish.net 3.0 what it is along with the amazing content of our site team contributors and users. Truly the "third edition" of The Phish Companion raised like the phoenix on the Internet.

I'd recommend this blog post be linked to our front page menu bar "Site > Technology" for future reference and updated as required.

"Any sufficiently advanced technology is indistinguishable from magic."

--Arthur C. Clarke, "Profiles of The Future", 1961 (Clarke's third law)
English physicist & science fiction author (1917 - )

Score: 0

2012-01-06 11:00 am, comment by YorkvilleBeerLover

@dundun said:

A wise man once said computers are all magic-- powered by smoke and sparks. Once the smoke and sparks leave the computer, it's dead. That said, great that you kept the magic in the system and the site is back running sprints. I've dealt with a couple threshold/locking issues in some of my production code and they're never easy to deal with (although reboots tend to do wonders). Sounds like you found the root cause pretty quick though, impressive. I never knew there was a .net API. Maybe I'll cook something up for Android in the next couple weeks (or is there already something in the works?) @YorkvilleBeerLover said:
very cool. I'm a Technical Recruiter - let me find you your next gig! mikedolan@hotmail.com
A technical recruiter with a hotmail address? That's a -1, buddy. Joking... but, seriously...

thats the address I use for phish folks

Score: 0

2012-01-06 1:26 pm, comment by Robert_Zimmerman

The files are INSIDE the computer...

Score: 0

2012-01-06 7:54 pm, comment by barefootbob

Gredo Shot First.

[/img][/img]

Score: 2

2012-01-07 1:29 pm, comment by pemulis

As a sysadmin and phish nerd, I think this is my favorite thing ever posted on .net.

Thank you so, so much for your work on this site.

Score: 0

2012-01-08 11:03 pm, comment by kayakush22

Would love to help if u guys are having problems. I have worked alot with PJSON and ajax if you need any assistance

Score: 0

2012-01-10 4:43 am, comment by reydempto

As a web designer, this article was fascinating to me! Thanks for giving me a little insight to the epic lines of code within the dawtnet

Score: 0

2012-01-10 5:06 pm, comment by King_Williamson

Even in the industrial industry, in which I work, a reboot (or cold-cock, which is an industry accepted term) fixes 70-80 percent of comm issues. I'm asking a PLC to execute code based on I/O conditons. To throw us for a loop, sometimes the PLC decides to tell US to fuck off; I'm not gonna do what you tell me.

I think by approaching the machine as somewhat human, when I tell IT to fuck off, in some weird way, I feel better. It's like playing God with the disconnect handle.

I need a vacation.

Score: 0

2012-01-12 11:51 pm, comment by phishybanjo

tl dr

Score: 0

2012-01-14 8:29 pm, comment by corybill

UMMMMMM....why in the hell are you locking down an entire table on a database request. You should be locking a single cell or single tuple on updates, inserts, & deletes and only the tuples you are retrieving your SELECT statements. Connection Pooling is used and databases use an 'all or none' mentality with updates. The architecture used of locking an entire table unnecessarily slows down client server response time and seems to be architecturally inferior.

Is there a reason why you are doing this? I'd be really interested to hear because I'm sure you have a good reason, as all programmers usually do.

Thanks for an amazing site!

Cory

Score: 0

2012-01-14 8:36 pm, comment by corybill

Sorry, didn't mean to be so negative at the beginning of that post. Love your site guys. Thanks so much.

Cory

Score: 0

2012-01-17 1:03 am, comment by thebuttlerhitit

Thank you for your efforts.

Score: 0