From time to time, people ask me to post something about the technology behind Phish.net. I'd started writing this piece several times, but ultimately, never completed it. Given the problems that Phish.net experienced in the last few days, it seemed the right time to share not only a little bit about how we can host a site that sustains an onslaught of traffic during shows, but also powers our Phish.net API and, in turn, sites around the internet.
As Phish.net has grown in breadth and size, the codebase has ballooned into a major project. There are thousands of users - often tens of thousands of concurrent users - during shows. Maintaining a site of this size can be unwieldy. There are currently over a quarter of a million lines of PHP and CSS, over 700,000 lines of code altogether that make up Phish.net and its services (including the API, the administrative backend, the mobile site, etc). Read on for the painfully technical details.
In designing the site, which relies heavily on dynamically built pages assembling data from a database, it's important to constantly be thinking about efficiency. Poorly written database queries, inefficient code, memory intensive scripts, even slow-to-parse client-side scripting can affect the user experience dramatically. Because there was so much code, I decided to write myself a so-called "database API" that would allow me to talk to the database in a consistent and measured way. To be fair, there are already PHP libraries that do this, however, writing my own lightweight one meant there was no unnecessary overhead or initialization. So, to do this, I wrote a set of functions I've started to call "generics," because these core functions can be altered if you change database platforms, and, for the most part, you have a cross-platform codebase[1]. These wrapper functions control database access which accomplishes a few things: they reduce the code needed in any given function library, they allow for a database-neutral codebase, and they make writing code much faster.
Let's start by looking at the core generics for a MySQL database. These functions are the basis for much of the Phish.net procedural code.
The generics allow us to write procedure code libraries that are introduced on demand (e.g. at the top of a script: include $libs.'setlist.lib.php';) . These functions are specific and fast. So, for example, a function might be as simple as this:
function get_user($userid) { return _dbrow("SELECT userid,username,email,role,whatever,else FROM users WHERE userid='".intval($userid)."'"); }We also rely heavily on caching. So much of our data is static. As a result, we cache off information frequently. We do so using one of two methods.
Method 1: Time-based caching We build shell scripts that are executed via cron jobs and use links to request certain pages and write them to a cache. This cache is the only thing used to build live data. We use this for the front page, the song list, recent setlists, and most of our public API methods. Method 2: Request-based caching When a user requests a resource that isn't heavily requested, such as a user-specific RSS feed, any archived setlist, or the forum scoreboard, the page checks the last modification date of the cache. If the cache is stale, the requesting user reloads it. The page is wrapped in ob_start(), the HTML is written to the output buffer, then it's captured and cached to a file. Lastly, the buffer is flushed to the screen and output buffering is disabled via ob_end_flush().Caching reduces load on a server significantly when you use it for query intensive pages.
Another trick is to tune your SQL queries. On one site, I was able to reduce the number of db queries from 70+ to just 3. However, the three queries took longer to run than the 70. Eventually, I got it down to 12. Sometimes, using JOINs can reduce the number of queries needed, but make sure your queries modifications are an improvement to the baseline. Furthermore, use Session data for mostly static info (or something like memcached, if possible).
When you're looking for speed, make sure your javascript and css are well written. Libraries like jQuery, MooTools, and Prototype add a lot of overhead. That said, they offer an amazing array of tools. We use jQuery largely because the Sizzle selector library is second to none for page manipulation and it's a pretty common platform. Notice the Every Time Played charts, which now use client-side scripting for things like sorting, moving a ton of work off of the server. Off-loading that processing onto the client brings us enough benefit to justify loading another 100k of a jQuery extension.
Loading element dynamically via AJAX can make a significant difference in response time. Even though a heavy page may take 15 seconds to load, you might be able to load in 1 second and dynamically drop in the heavy portion. We've experimented with this in many different areas of Phish.net.
Let's Talk about YesterdayStarting on January 3, we started to see a slow down in performance. We hadn't yet introduced our newest feature, Achievements, so it wasn't that, and it was post-NYE, so it wasn't a spike in traffic. The load on the server was hovering at 8.
Upon closer examination, we saw major issues with database performance. The whole thing was slowing down in a big way. It's important to note that it was performance that was hit, but not operation of the service. So, there was no corrupt data, there was nothing broken, there wasn't anything "wrong" per se, just that the database system couldn't keep up. We were issuing something like 80+ queries per second to the database, mostly from the forum. So parts of the site that didn't talk to the DB were fine. The parts that used our central discussion table, which is most of the site, were dying.
We converted the database tables from MyISAM to InnoDB. MyISAM locks a table while it's updating. The discussion table, which holds forum comments, reviews, blog comments, news comments, etc, has well over 250,000 records. So, when you update a record (like, say a thread count in the forum, or when you edit a comment), the whole table is locked as that update occurs, and as such, the pending queries back up as they wait for that UPDATE query to finish. InnoDB locks only the rows being edited. The trade-off - because, of course, there has to be one - is that the database size is much larger and requires more CPU power to work well. We've got the resources, so we made the jump. We converted all tables to InnoDB and set it as the default for new tables.
I trimmed the slow query log. We log db queries that take a long time to run. I renamed it, archived it, and re-touched the file, then we looked at the last logged queries for starting points.
We turned on MySQL's internal query cache. Wow. In the first two minutes, we had over 37,000 hits to the cache.
Since the forum is so active and is accessing mostly the same data, I cached the front page of the forum, and the AJAX-based reload of the forum; this cache builds every 30 seconds. It's not an HTML cache, like most "cached" sites, it's fetching the data and storing the results of the query in JSON. As a result, ignore lists are still respected. That reduces the queries issues by a lot. I also reduced the AJAX reload frequency to every 45 seconds.
The cache is reloaded as such: a unique filename is built per user. The cache is loaded into this unique file, and then PHP's rename() does the work to replace the cache. Why? Because *nix based system rename files atomically. As such, we don't have file lock issues with multiple people having collisions as they try to reload the same cache file. Yes, it might get reloaded a few times in a second, but then it's cached properly for the next 30 seconds.
We reconfigured MySQL to use new InnoDB configuration directives. You probably don't want those details, but it was pretty nerd-awesome to see how you shape them for your specific resources.
We removed a database field index. Now, this is interesting: the query to get a single forum thread was running 8-50 seconds. 8-50 SECONDS! It used to run in about .3 seconds. No idea how this suddenly got slow. Probably just the size of the database, the number of records, reaching critical mass. So we actually removed a database index on the field that tells us the type of comment (forum, review, blog comment, etc), which made a HUGE difference.
We applied a ton of security updates and rebooted the server. When in doubt, a reboot does a lot of things, including restarting all daemons and caching.
This was a big problem, for sure, but the result is that the configuration is "right" for our size and use. And the server is flying now.
[1] Not really, given differences between SQL implementations, but it's true that in many cases, things run unaltered.
Update 1/5/12 4:45 PM: I'd like to also acknowledge for the amazing team at Endpoint for their support. Their expertise was critical in delivering the solutions, and hey, they're proud to host us.
If you liked this blog post, one way you could "like" it is to make a donation to The Mockingbird Foundation, the sponsor of Phish.net. Support music education for children, and you just might change the world.
You must be logged in to post a comment.
Phish.net is a non-commercial project run by Phish fans and for Phish fans under the auspices of the all-volunteer, non-profit Mockingbird Foundation.
This project serves to compile, preserve, and protect encyclopedic information about Phish and their music.
Credits | Terms Of Use | Legal | DMCA
The Mockingbird Foundation is a non-profit organization founded by Phish fans in 1996 to generate charitable proceeds from the Phish community.
And since we're entirely volunteer – with no office, salaries, or paid staff – administrative costs are less than 2% of revenues! So far, we've distributed over $2 million to support music education for children – hundreds of grants in all 50 states, with more on the way.
Your work doesn't go unnoticed!
I recruit for a well-known "search" company. I would love to see a technical interview question regarding the design of phish.net considering it's highly fluctuating user base.
I'm a Technical Recruiter - let me find you your next gig!
mikedolan@hotmail.com
I heard about the slow down and knew you were on it.
A great read for sure. The different caching steps I find the most interesting, as well as the new database structure of InnoDB. Now that you write this, it makes so much sense. Never really thought about how a table locks while updating a record with many lines of data. Cool how InnoDB only locks the record and not the whole table.
Great to hear how Endpoint was so successful with their support as well !!
As we are beginning a full overhaul of CashorTrade.org this January, this article raises some interesting points of focus for us. With so many lines of code and so many records in a database...using the proper techniques become imperative for the overall speed and delivery of content. I am hoping to use caching methods and JSON where needed. We'll see. There is a lot to learn in the process.
Thanks for all you do...and all you have done Adam. We sincerely appreciate it.
Brando and Dusty
That said, great that you kept the magic in the system and the site is back running sprints. I've dealt with a couple threshold/locking issues in some of my production code and they're never easy to deal with (although reboots tend to do wonders). Sounds like you found the root cause pretty quick though, impressive.
I never knew there was a .net API. Maybe I'll cook something up for Android in the next couple weeks (or is there already something in the works?)
@YorkvilleBeerLover said: A technical recruiter with a hotmail address? That's a -1, buddy.
Joking... but, seriously...
I'd recommend this blog post be linked to our front page menu bar "Site > Technology" for future reference and updated as required.
"Any sufficiently advanced technology is indistinguishable from magic."
--Arthur C. Clarke, "Profiles of The Future", 1961 (Clarke's third law)
English physicist & science fiction author (1917 - )
[/img][/img]
Thank you so, so much for your work on this site.
I think by approaching the machine as somewhat human, when I tell IT to fuck off, in some weird way, I feel better. It's like playing God with the disconnect handle.
I need a vacation.
Is there a reason why you are doing this? I'd be really interested to hear because I'm sure you have a good reason, as all programmers usually do.
Thanks for an amazing site!
Cory
Cory