[NI942] - Senior Site Reliability Engineer- ELK Expert

Location: ,

Category: IT Engineer & Developer Jobs

Senior Site Reliability Engineer (SRE) – ELK Expert | Platform Engineering Practice

Location: India (Remote) - Must be available to work in the EST (US/Canada) Time Zone.

Role Summary:

Are you a Senior Site Reliability Engineer (SRE) with deep ELK expertise, ready to take ownership of large-scale observability infrastructure?

We're looking for an SRE with 7+ years of experience, including 4+ years specializing in the ELK stack (Elasticsearch, Logstash, Kibana), to join our Platform Engineering Practice. In this role, you’ll design, manage, and scale ELK clusters ingesting 2–3+ TB/day, enhance reliability across distributed systems, and drive automation within Azure cloud environments. This is a high-impact engineering opportunity focused on performance, observability, and operational excellence at scale.

Why Join Us

- Career Growth: Work alongside industry experts on cutting-edge cloud technologies

- Competitive Compensation and Benefits: We recognize and reward top talent

- Exciting, Impactful Work: Design and build scalable, resilient cloud environments

- Strategic Platform Role: Contribute to the foundation of next-gen observability and reliability infrastructure

What You Will Do

- Design and Optimize Cloud Infrastructure: Architect scalable, fault-tolerant systems on Microsoft Azure

- Automate Everything: Use Terraform, Ansible, and GitHub Actions to streamline deployment and configuration

- Ensure Reliability and Performance: Proactively monitor, troubleshoot, and resolve production issues using Prometheus, Grafana, and Azure Monitor

- Enhance Security and Compliance: Implement security best practices across DevOps workflows

- Collaborate and Innovate: Work closely with engineering, security, and operations teams to drive automation and efficiency

- Manage and scale large ELK clusters handling 2–3+ TB/day log volumes, ensuring high availability and performance

- Optimize ELK architecture: Implement efficient index lifecycle policies, shard strategies, and hot-warm-cold tiered storage

- Build and tune log pipelines: Scale Logstash and Beats pipelines across distributed environments

- Support Kibana observability layers: Create dashboards, visualizations, and custom alerting frameworks (e.g., Watcher, ElastAlert)

What You Bring

- 7+ years of experience in Site Reliability Engineering, DevOps, or Cloud Engineering

- 4+ years of dedicated, hands-on experience with ELK (Elasticsearch, Logstash, Kibana)

- Strong experience managing large-scale ELK clusters in production with heavy ingestion (multi-TB/day)

- Deep knowledge of index tuning, shard allocation, ILM policies, and scaling ELK components

- Expertise in GitHub Actions, Terraform, Ansible, and Infrastructure as Code (IaC)

- Proficiency in Python, Go, or Bash for automation and scripting

- Deep understanding of Kubernetes, Docker, and cloud-native architectures

- Experience with observability tools such as Prometheus, Grafana, Azure Monitor

- Ability to work in a fast-paced, collaborative environment and solve complex operational issues

Education

- Bachelor’s or Master’s degree in Computer Science, Information Technology, or a related field

Certifications (Nice to Have)

- Microsoft Azure certifications: AZ-104, AZ-400

Apply on Company Website You will be redirected to the employer’s website