Loading...

Scaling To The Extreme

Daniel Marashlian

Co-founder & CTO at Portfolium

Loading...

Problem

"I was the CTO of Plixi (formerly TweetPhoto), the company that allowed people to put photos on Twitter (before Twitter took over everything themselves in 2011). As Twitter scaled to a top 10 site in the world, we were with them, as we provided all the media for them through third-party developers. We grew to a top 180 site in the world, leading to a lot of traffic. As we approached about 100 million API calls a day, we started getting a lot of outages on the web servers. We needed to understand what was going on."

Actions taken

"There was a period of craziness, where nobody slept for a month. Our lead architect, senior developer and I tried to figure out what was wrong, and we even brought in an expert in server management to help. As a temporary patch, we started to automatically restart the IIS web server every hour to free up web servers. We knew it was a terrible patch, but it helped to cover a part of the problem while we looked for a more long-term solution. We then added hundreds of watchers in Windows Server to watch everything the operating system was doing. As spikes of traffic would come in, and when the disk would try to write things to the database, it would lock the reads. This all happened in microseconds, but because we had 100 million API calls a day, this continuous cause and effect chain eventually led to everything blocking up. Once we figured this out, we called our Rackspace rep and explained the problem. They gave us a few options - there were a few devices they recommended, and they gave us the option of implementing a giant SAN cluster but this would have been an expensive nightmare to manage. However, after the call, I got a recommendation from one of my colleagues. He suggested talking to a new company called Fusion-io, which at the time was making the fastest hard-drives in the world. We called them and worked with Rackspace to get four of the servers shipped over to Texas and installed. It took us just ten minutes to install them, and as soon as we had moved our data over to the new drives, everything was suddenly smooth. From the technology side of the business, it really saved the company."

Lessons learned

"A lot of people will build websites with tons of page views; however, it's generally not as much as Plixi experienced. When you say scale, I always think about that company. We only had eight people total in the company, but had a huge amount of traffic and 45 million users to support. To fix the problem, it was important for me to be heavily involved with both the hardware and software so I understood the whole stack. Team collaboration, not only internally, but with our vendor Rackspace was also key."


Be notified about next articles from Daniel Marashlian

Daniel Marashlian

Co-founder & CTO at Portfolium


Engineering ManagementPerformance MetricsTechnical ExpertiseSoftware DevelopmentEmerging TechnologiesCareer GrowthCareer ProgressionSkill DevelopmentIndividual Contributor RolesCTO

Connect and Learn with the Best Eng Leaders

We will send you a weekly newsletter with new mentors, circles, peer groups, content, webinars,bounties and free events.


Product

HomeCircles1-on-1 MentorshipBounties

© 2024 Plato. All rights reserved

LoginSign up