Site Reliability Engineer
Listed on 2026-06-18
-
IT/Tech
Cloud Computing: Infrastructure & Operations, Systems Engineer, SRE/Site Reliability
Meet the Team
We are an agile team inside Cisco IT, building the next generation No
SQL and Vector Databases on cloud platforms that will be demonstrated by all of Cisco as we move to cloud native applications. This is a small team of highly motivated individuals demonstrating Agile scrum methodology. Our team is responsible for building and operating Hybrid Cloud Database services in a Dev Sec Ops model. We move at a fast pace and are passionate about cloud and automation.
We have a history of building clouds at a large scale and are looking for someone who as passionate about it as we are.
In this role, you will ensure production services are scalable, resilient, high-performing, and secure. You’ll support uptime through an On-Call rotation, monitoring, and alerting to meet SLOs and SLAs. Reliability is strengthened by conducting Disaster Recovery drills and managing incidents—investigating root causes, applying remediation, and driving continuous improvement. You’ll define reliability and security requirements for systems and components to meet company, customer, and regulatory objectives.
Operational efficiency is enhanced by automating repetitive tasks and mitigating failure points. You’ll also develop tools and techniques for early detection of issues in products, packaging, processes, and product reliability.
- Serves as an experienced professional resource, independently applying best practices and business knowledge to improve products or services while guiding and supporting less experienced colleagues.
- Understands project and/or department needs and establishes relationships with appropriate cross‑functional partners to gather input, collect information, and complete work steps.
- Designs and deploys small to mid-size or moderately complex solutions to optimize reliability, availability, latency, and performance.
- Builds automated platforms and applies design, deployment, and coding expertise to enhance reliability, scalability, and velocity; designs and tests high availability and disaster recovery measures across regions and customers.
- Forecasts and builds reports to determine at what point resources will be at capacity.
- Designs and implements tools to monitor and provide transparency into the performance and reliability of our infrastructure; collaborates with Developers and Ops to identify issues, serves as on‑call SRE, and leads post‑mortems and root cause analyses.
- Builds and ensures security controls are in place in architectural design, collaborates with security in designing or reviewing security controls, and may actively contribute in security incident response.
- Bachelor’s degree in Computer Science or a related field
- 5+ years of technical expertise with cloud databases and experience with Vector databases (such as Pinecone, Weaviate, or Milvus) and/or with at least two of the following:
PostgreSQL, MySQL, or MongoDB - Experience with AI frameworks (OpenAI API, Langchain, etc)
- Experience with designing, administering, and maintaining Vector DB or Cloud DB architecture, including provisioning, upgrades, operations, backups, security, and performance
- Experience with CI/CD framework and tools like Git/Github, Jenkins
- Experience with automating DB tasks using Python, Database Lifecycle Management
- Experience with public cloud like AWS, GCP, or Azure, or container technologies like Kubernetes and Openshift
- Backend experience with Python or other programming skills
- Demonstrated experience in building scalable databases on hybrid cloud infrastructure
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).