Senior Site Reliability Engineer; Production Excellence Job Redwood City area,California USA,IT/Tech

Position: Senior Site Reliability Engineer, (Production Excellence)

About Poshmark

Poshmark is a leading fashion resale marketplace powered by a vibrant, highly engaged community of buyers and sellers and real-time social experiences. Designed to make online selling fun, more social and easier than ever, Poshmark empowers its sellers to turn their closet into a thriving business and share their style with the world. Since its founding in 2011, Poshmark has grown its community to over 130 million users and generated over $10 billion in GMV, helping sellers realize billions in earnings, delighting buyers with deals and one-of-a-kind items, and building a more sustainable future for fashion.

For more information, please visit , and for company news, visit

Senior Site Reliability Engineer (Production Excellence)

We are looking for a Senior Site Reliability Engineer to serve as the guardian of our complex, web-scale ecosystem. You won't just be "managing" systems; you will be the architect of their health, ensuring they are monitored, automated, and designed to scale flawlessly. The ideal candidate is an SRE purist who believes that automation is the antidote to toil and that deep application knowledge is the key to operating large-scale systems.

6-Month Accomplishments

Audit & Observe: Deep-dive into the Poshmark tech stack and infrastructure requirements.
Automate Toil: Master and improve existing automation tools/frameworks within the Cloud Ops organization.
Primary Integration: Transition from secondary on-call support to a primary contributor on small to medium-scale architectural projects.

12+ Month Accomplishments

System Ownership: Execute complex communications and infrastructure projects independently.
Precision Alerting: Engineer meaningful alerts and high-fidelity dashboards that reduce "alert fatigue" and focus on system health.
Architectural Evolution: Identify systemic gaps and lead the implementation of infrastructure improvements to bolster uptime.
Incident Leadership: Serve as a core pillar of the on-call rotation, leading incident response and blameless post-mortems.

Responsibilities

Serve as the primary point of accountability for the health, performance, and capacity of mission-critical, internet-facing services.
Partner with development teams beginning at the design phase to ensure all platforms are built with "operability" and "recoverability" at their core.
Improve and exchange tools that automate the deployment and monitoring of custom applications in large-scale UNIX environments.
Thrive in a fast-paced environment where you bridge the gap between "moving fast" and "staying up"
Participate in a structured 12x7 on-call rotation designed to maintain 24/7 support for production environments.

Desired Skills

Battle
- Proven Experience:

5–8+ years in a Systems Engineering or Site Reliability role, specifically within a startup or fast-growing environment.
Scale Mastery: Proven track record in a UNIX-based, large-scale web operations role.
Production Support Mindset: Extensive experience providing 24/7 support for high-traffic production environments.
Cloud Architecture: Expert-level experience with AWS, GCP, or Azure.
The SRE Toolkit:
- CI/CD & Config: Jenkins, Ansible, and Terraform.
- Observability: Hands-on experience with Datadog, New Relic, Graphite, or Nagios.
- Orchestration: Deep knowledge of Kubernetes, Docker
- Code: Strong scripting/coding skills used for infrastructure-as-code and automation.

Technologies we use:

Languages/Servers: Ruby, JavaScript, Node.js, Tomcat, Nginx, HAProxy.
Data & Messaging: Mongo

DB, Rabbit

MQ, Redis, Elastic Search.
Infrastructure: AWS (EC2, RDS, Cloud Front, S3), Kubernetes, Docker.

Note:
1) Poshmark is currently unable to provide visa sponsorship for this position.
2) This is a hybrid role based out of Redwood City, CA.

#J-18808-Ljbffr


Increase/decrease your Search Radius (miles)



Job Posting Language