
Principal Simulation & Reliability Architect
Added
12/15/2025
How Syndicated Job Posts Work
This Role is Closed
This is a Featured Job
Note: We've kept the name of the company private. If you'd like to know the company before requesting an intro, just email us at hello [at] fractionaljobs.io
Architect simulation, evaluation, and reliability systems (from frameworks to workflows to tooling) so AI teams can model, test, and operate complex agentic architectures reliably at scale.
Role Overview
The Principal Simulation & Reliability Architect will lead the design of modular simulation environments, reliability tooling, and observability patterns that help teams understand and improve multi-step agentic AI workflows. This role is both architectural and hands-on: you will prototype internal tools, establish foundational patterns, and collaborate closely with the founder, data scientist, and synthetic data teams.
Responsibilities
- Design modular simulation environments for multi-step agent workflows and decision policies.
- Model interactions among agents, tools, and document flows to surface behavior and failure modes.
- Define evaluation patterns for agentic systems (task success, factuality, procedure adherence, suitability).
- Build regression, validation, and inspection tooling for simulation outputs.
- Identify and instrument key events and metrics for monitoring, triage, and investigation workflows.
- Integrate simulations with modern observability tooling (OpenTelemetry, Arize, Grafana).
- Develop trace schemas and system health signals to support reliability insights.
- Establish architectural patterns and internal frameworks for future engineering hires.
- Contribute to the roadmap and technical foundations of Reins AI’s simulation and reliability platform.
Qualifications
- 6+ years architecting or building complex ML, simulation, workflow, or observability systems.
- Strong Python engineering fundamentals and experience developing internal tooling or frameworks.
- Ability to design abstractions and end-to-end technical architectures.
- Familiarity with multi-step AI workflows or agentic patterns (any framework).
- Strong debugging intuition and systems-thinking mindset.
- Excellent communication skills and comfort working in a fast-moving, founder-led environment.
Preferred Skills
- Experience with simulation frameworks, synthetic data workflows, or agentic evaluation.
- Background in reliability engineering, monitoring, or triage system design.
- Exposure to regulated domains (audit, finance, healthcare).
- Knowledge of distributed systems or ML pipeline design.
- Experience with observability tooling (OpenTelemetry, Arize, Grafana, Datadog).
- Familiarity with agentic frameworks such as LangGraph, Semantic Kernel, or CrewAI.
Employment Details
This will start as a 4-6 month contract engagement (20 hours/week) with a clear path to full-time employment as we finalize 2026 project scopes. We’ll jointly evaluate fit, scope, and structure during that period.
Optimal start date:
December 19, 2025
How to Apply
Note: This is a syndicated job post. Fractional Jobs found it on the web, but we are not working with the client directly, so we don't have control over or knowledge of the application process. To apply, click on the "View Application" button and follow the application's instructions. Let us know how it goes!
How to Get in Touch
Hit that "Request Intro" button below. Include any relevant links so we can get to know you better.
Your brief intro note should clearly address:
If we think there's a fit, we'll reach out to schedule an intro call. Looking forward!
MoreEngineeringJobs
Send fractional jobs,
playbooks, and more to
%20(1).webp)