WHAT YOU'LL DO
Platform NOC (Network Operations Center) Engineers (PNEs) are the first line of defense in ensuring the health and availability of the organization’s platform and systems. As part of the global triage team, they actively monitor system performance, respond to alerts, and execute runbooks, SOPs (Standard Operating Procedures), and MOPs (Maintenance Operating Procedures) to address operational issues.
This role requires a strong focus on maintaining uptime and reliability, collaborating with engineers to escalate complex issues, and contributing to the continuous improvement of operational processes. It is ideal for someone with foundational experience in kubernetes based systems and a passion for reliability and operational excellence.
Key Differentiators from Other Engineering Roles:
- Focus on Execution: Platform NOC Engineers are hands-on operators, focused on monitoring, incident response, and executing predefined procedures. They are not responsible for designing systems but work closely with SREs and Platform Engineers to ensure the stability of those systems
- Pathway for Growth: this role provides exposure to the organization’s systems and operations, serving as a stepping stone to more advanced roles in platform engineering or SRE
This position is perfect for those starting their journey in cloud and platform operations, offering the opportunity to gain experience with cutting-edge systems and tools while contributing directly to the organization’s reliability and performance.
Responsibilities:
- Active System Monitoring
- Use monitoring tools (e.g., Datadog, Prometheus, or similar) to observe the health of platform systems and services continuously
- Proactively identify and respond to performance anomalies, outages, or unusual system behavior
- Maintain awareness of ongoing incidents and collaborate with relevant teams to ensure timely resolution
- Incident Response and Triage
- Act as the first responder to system alerts, determining the severity and scope of issues
- Execute predefined runbooks, SOPs, and MOPs to mitigate incidents and restore services
- When incidents exceed the scope of triage procedures, escalate issues to appropriate engineering teams (e.g., SREs or Platform Engineers)
- Operational Procedures
- Follow and improve operational processes for incident management, system health checks, and routine maintenance tasks
- Maintain and update runbooks, ensuring accuracy and relevance to current systems and practices
- Participate in post-incident reviews to improve documentation and operational readiness
- Collaboration and Communication
- Provide clear, concise communication during incidents, ensuring stakeholders know the status and progress of the resolution
- Collaborate with SREs, Platform Engineers, and other teams to enhance monitoring, alerting, and operational tools
- Actively participate in training sessions to stay current on new systems and tools introduced by engineering teams
- Continuous Improvement
- Identify monitoring, documentation, and procedure gaps and suggest improvements to enhance efficiency and effectiveness
- Assist in testing new runbooks, tools, and processes to improve incident response times
- Contribute to the automation of routine tasks to reduce manual toil
WHO YOU ARE
Experience:
- 1-3 years of experience in technical operations, system administration, or entry-level cloud engineering roles.
- Familiarity with cloud platforms (AWS, GCP, Azure), kubernetes, and basic computing, storage, and networking concepts.
- Experience with monitoring and alerting tools (e.g., Datadog, Prometheus, Grafana) is a plus.
Skills:
- Strong troubleshooting and problem-solving skills, with the ability to follow processes and escalate appropriately
- Proficiency in scripting or automation tools (e.g., Python, Bash) is a bonus
- Familiarity with incident management processes and ITIL practices
Mindset:
- Detail-oriented and committed to maintaining system health and uptime
- Eager to learn and grow, with a passion for operational excellence
- Collaborative and communicative, able to work effectively in a global, distributed team