How I designed an SRE Hackathon; wait, what?
Hackathons generally focus on rapid ideation and creation of software by small, sharp teams. Ideathons on the other hand have people from different backgrounds brainstorming to solve predefined problems. They deal with software development and problem solving respectively.
Time to pull hairs!
In my 18 years of TechOps, never have I heard the word “SRE Hackathon”. I’ve heard of AI, Mobility and IoT focused hackathons and they’ve all been geared towards building software in those specific technology streams. When it comes to SRE, you’re talking about maintaining and instilling reliability in existing software. So how does one go about creating an event for SRE’s? To complicate matters further, what if the audience being targeted are groups of undergraduate students who haven’t the foggiest about what SRE means? Making matters worse, we had to pull off this first of its kind SRE themed hackathon within a month while balancing out our day jobs, which is as you would guess, Site Reliability Engineering.
The objective
I decided we needed to first create a purpose for such an event before thinking of how we’d structure it. First, in Sri Lanka, the word “SRE” has been under constant abuse with various interpretations being portrayed in the recent past. It has become a recent trend of sorts similar to NFT’s with many organizations choosing to adopt the SRE moniker as a new label for traditional IT Operations. Second, SRE and IT Operations are seen as “mediocre” jobs in the country as many prefer to work in Development engagements. I wholeheartedly disagree with this notion and believe SRE’s must be Engineers who can affect changes to design while appreciating the role of operations; basically those who can go the extra mile over a regular Developer. Third, due to this lack of understanding, while the country is known for its ingenuity with regards to innovation lab and startup culture, the digital services offered from Sri Lanka are almost always Development focused or BPO-centric. With this in mind, the primary purpose was clear: dispel the false veil cast around SRE and show new entrants that this role is indeed Engineering and not simple administration. This is more laser focused compared to a lofty longer term goal of pinning Sri Lanka on the map for its SRE capability. To get there, we decided to run this event annually with each one having a slightly different objective and theme.
The core
With the main objective out of the way, I had to think of how we’d run this at its core. It couldn’t just be an ideathon to solve operational problems as I needed participants to get a feel of hands-on design and coding. A mere hackathon wouldn’t work either since it wasn’t a pure software development contest. I had to merge the two and craft challenges that would entail elements of software development as well as operational problem solving. In light of this, I thought up 3 different challenges, each covering a separate theme under SRE:
- Incident response, root cause analysis and debugging
- Toil elimination and automation
- Observability and monitoring
For each, we decided to have goals or milestone markers that teams would need to achieve in order to earn points. Examples of these are: successfully deploying the application container, running a causal analysis or successfully instrumenting the code to name a few.
In each case (except for the 2nd challenge), we built a custom application that had to be deployed and exposed its source code on Github. The application for challenge 1 had faults intentionally injected into its design and code with appropriate signals being produced via telemetry. The 3rd challenge had a series of microservices without any logging or tracing incorporated. As for the 2nd challenge, this was a well curated scenario around SRE team email and alert management — mailboxes were set to receive automated mails simulating end users and monitoring systems. The team had to build a system to automatically categorize and track these inbound mails as one of the goals.
We wouldn’t want teams to waste time procuring environments so we purchased and provided individual cloud instances for this. Of course, we conducted a trial run of the entire series of challenges ourselves to prove they worked perfectly before wiping the slate clean. Guidance to deploy the applications and details about the challenges were provided via video as well as readme files on Github.
Execution
Each team could pick their challenge ahead of the start date as these were announced in advance. We also had to run the hackathon remotely due to COVID-19 concerns so we used a cohesive virtual hackathon platform (Hubilo in this instance) rather than a nuts and bolts setup of Slack, Zoom, etc. On the day of the event, we’d replay details of the challenges and goals to be achieved.
We gave all teams an opportunity to present their approach and replay their understanding of the challenges picked before starting on the solution. As required, the judging panel provided advice and assistance to guide them.
After this, each team would have approximately 6 hours to complete the selected challenge. Once done, they’d have to check-in their code and conduct a presentation to the judging panel. The judges would inspect the design, code and function as well as question the teams on different aspects to validate their understanding. Points were awarded based on a pre-defined scoring matrix.
Of importance is that the judging panel (comprised of SME’s who built the challenges) provided guidance in each team’s virtual room throughout the event. This ranged from clarifications to proactive guidance and course correction as required. We took turns in each of these virtual rooms so the guidance was balanced and the teams felt supported. This also enabled us to gauge the performance of each participant and team well ahead of the final judging session. After all, we wanted this to be a learning experience rather than just be focused on the end-product delivered (I told you this wasn’t a regular hackathon).
Outcomes
While we encourage the teams to continue building on what was developed during the hackathon, we also intend to work with them to enhance it further once they’ve joined up with our organization. More important than this however, was the learning outcome for the participants. In a nutshell, I believe the experience helped them learn the following technical aspects with hands-on practice:
- Systems design
- Github usage
- Cloud instances
- Environment setup
- Deployment with Docker containers
- Troubleshooting and diagnosis via logs
- Debugging code
- Distributed tracing with OpenTelemetry
- APM with Signoz
- Toil identification
- NLP and word associations
- Service level monitoring
- Identifying SLI’s
- Security vulnerability diagnostics
- PHP and Java development
- Microservices and monoliths
- Application testing
In addition to the above, they also experienced many non-tech elements that are key to working in the industry:
- Leadership and team structure
- Coordination
- Time management
- Problem / requirement comprehension and elaboration
- Teamworking in a remote setting
- Communication
- Corporate presentation
In closing
I believe any hackathon should be curated with a clear intent that serves the betterment of the people rather than mere organizational goals. This was clearly one such event and I’m glad I was given the opportunity to orchestrate it. As tiring as it was, I thoroughly enjoyed every moment of design and execution. The adrenaline boost during the event especially was pretty awesome. Hey, it’s no wonder I love SRE all the same!
My team and I have successfully trail-blazed Application Support in Sri Lanka through our work over the years. To see another pioneering achievement in the form of the first SRE themed hackathon in the country? That’s mind-blowing. Await the second run in 2022; I’ve got a few more fancy ideas up my sleeve!