Job description for Hadoop SRE:

• Carry out SRE duties for Big Data on various open-source platforms such as Hadoop, Spark, and HBASE.

• Keep an eye on the platforms and adhere to runbooks/SOPs to manage platform and application problems.

The prime responsibility of Site Reliability Engineering is to make sure that environment is secure and safe.
All security findings should be remediated within required resolution date defined by governance.
We do not allow outage, even for a second. If any issue happens, as owner of the environment we do the needful to make sure those environments are up and running.
Root cause analysis should be within hours. We make sure that findings are remediated in Production environment after all tests and checks in lower environments.
As owner of environment, we keep track of all activities planned or happening in our environments.
We are responsible for deploying new code in the environment.
We look and analyze our environment regularly. If there is a manual task, we do automation of that.
We are increasing selfheal capabilities and will continue to do the same until environments become auto-heal.
If a new service is coming under our support or if migration of old environment is going to happen to new technologies, we start interaction with product developer to sketch out planning for production.
As our business is running round the clock, we work in shift and synchronize with multiple locations and multiple tracks (sub team).
We make sure that every activity is being recorded as per incident or change management process. Technical and related run books need to be prepared and shared with the team.
Engineering degree in IT or Computer Science
Having 4+ years of IT experience with expertise in DevOps, Build and release Engineering, Cloud Infrastructure and Automation, Tech support.
Ability to work as team player
Good written and communication skills
Punctual to office time and work
Great with problem solving and troubleshooting
Ability to effectively prioritize and coordinate
Ability to learn fast and implement latest technology trends in the industry
Good understanding of CI/CD technologies.
Core Skills on Dockers, DevOps, Linux.
DevOps experience with Jenkins, Ansible, Docker, Kubernetes.
Good experience in java-based web applications
Implementing CI CD process
Expertise in Trouble shouting on java applications in Tomcat services and Web application in Apache.
Good Exposure on Virtualization and Containers (Docker).
Ability to build deployment, build scripts and automated solutions using scripting languages such as Shell scripting (Bash) / Java Script / Python / Other
Worked with Docker and created multiple containers and images and had experience on writing the Docker file
Created the deployments, services, and ingress flows for the application setup in the Kubernetes cluster.
Participated in release level discussions and gone through the total SDLC and Agile methodology

Support On-Call for all DevOps activities.

• Familiarize yourself with the cluster maintenance processes and implement changes as per the documented installation and validation plans.

• Showcase robust troubleshooting and debugging skills, aiming to pinpoint and rectify the issue, while also offering advice on how to prevent such problems in the future.

• Conduct thorough root cause analysis of major production incidents, document for future reference, and put in place proactive measures to enhance system reliability.

• Automate routine tasks using scripts or automation tools to lessen manual work, decrease the chance of human errors, and boost system reliability.

Technical Skills required:

At least 2-3 years of experience for a junior level role and 5+ for mid-level/senior level working as a Hadoop Site reliability engineer.
High level Knowledge on Hadoop platforms and core Hadoop components.
Troubleshooting both Hadoop platform service, application problems and identifying the root cause.
Writing ansible playbooks and automate manual tasks using Ansible, shell scripting and python scripting.
Should be familiar with Unix/Linux system internals, networking, and distributed systems.

For more details and Job description/ client/ rate, Please message or send cv careers@zetazsystems.com

Site Reliability Engineer

Apply for this position

Want to join us ?

Address

Contact

info@zetazsystems.com

Extra links

Follow Us