Platform NOC Engineer

Publicat 11.02.2025 | Expiră 05.04.2025

Descriere job

WHAT YOU'LL DO

Platform NOC (Network Operations Center) Engineers (PNEs) are the first line of defense in ensuring the health and availability of the organization’s platform and systems. As part of the global triage team, they actively monitor system performance, respond to alerts, and execute runbooks, SOPs (Standard Operating Procedures), and MOPs (Maintenance Operating Procedures) to address operational issues.

This role requires a strong focus on maintaining uptime and reliability, collaborating with engineers to escalate complex issues, and contributing to the continuous improvement of operational processes. It is ideal for someone with foundational experience in kubernetes based systems and a passion for reliability and operational excellence.

Key Differentiators from Other Engineering Roles:

Focus on Execution: Platform NOC Engineers are hands-on operators, focused on monitoring, incident response, and executing predefined procedures. They are not responsible for designing systems but work closely with SREs and Platform Engineers to ensure the stability of those systems
Pathway for Growth: this role provides exposure to the organization’s systems and operations, serving as a stepping stone to more advanced roles in platform engineering or SRE

This position is perfect for those starting their journey in cloud and platform operations, offering the opportunity to gain experience with cutting-edge systems and tools while contributing directly to the organization’s reliability and performance.

Responsibilities:

Active System Monitoring
- Use monitoring tools (e.g., Datadog, Prometheus, or similar) to observe the health of platform systems and services continuously
- Proactively identify and respond to performance anomalies, outages, or unusual system behavior
- Maintain awareness of ongoing incidents and collaborate with relevant teams to ensure timely resolution
Incident Response and Triage
- Act as the first responder to system alerts, determining the severity and scope of issues
- Execute predefined runbooks, SOPs, and MOPs to mitigate incidents and restore services
- When incidents exceed the scope of triage procedures, escalate issues to appropriate engineering teams (e.g., SREs or Platform Engineers)
Operational Procedures
- Follow and improve operational processes for incident management, system health checks, and routine maintenance tasks
- Maintain and update runbooks, ensuring accuracy and relevance to current systems and practices
- Participate in post-incident reviews to improve documentation and operational readiness
Collaboration and Communication
- Provide clear, concise communication during incidents, ensuring stakeholders know the status and progress of the resolution
- Collaborate with SREs, Platform Engineers, and other teams to enhance monitoring, alerting, and operational tools
- Actively participate in training sessions to stay current on new systems and tools introduced by engineering teams
Continuous Improvement
- Identify monitoring, documentation, and procedure gaps and suggest improvements to enhance efficiency and effectiveness
- Assist in testing new runbooks, tools, and processes to improve incident response times
- Contribute to the automation of routine tasks to reduce manual toil

WHO YOU ARE

Experience:

1-3 years of experience in technical operations, system administration, or entry-level cloud engineering roles.
Familiarity with cloud platforms (AWS, GCP, Azure), kubernetes, and basic computing, storage, and networking concepts.
Experience with monitoring and alerting tools (e.g., Datadog, Prometheus, Grafana) is a plus.

Skills:

Strong troubleshooting and problem-solving skills, with the ability to follow processes and escalate appropriately
Proficiency in scripting or automation tools (e.g., Python, Bash) is a bonus
Familiarity with incident management processes and ITIL practices

Mindset:

Detail-oriented and committed to maintaining system health and uptime
Eager to learn and grow, with a passion for operational excellence
Collaborative and communicative, able to work effectively in a global, distributed team

Aplică extern

Braze Romania

8 anunțuri active

3.00

12 evaluări

Oportunități de avansare

Pachet salarial

Timp la birou vs. timp liber

Management

Proceduri și valori

Criterii job

Tip job	Full-time
Orașe	Bucharest