AI Operations Specialist
Responsible for ensuring the stability, reliability, and performance of AI systems in production. This role blends Machine Learning Operations (MLOps), Bot Operations, and Quality Assurance (QA) practices, supporting both backend models and customer-facing bots.
The specialist will manage deployments, monitor system health, conduct testing, and validate quality to guarantee seamless AI operations and a consistent customer experience.
Key Responsibilities
- System Monitoring & Reliability
Monitor AI systems to ensure uptime, performance, and error-free operation.
Implement automated monitoring and alerting systems for AI performance.Define SLOs and error budgets for AI features.Testing & Quality AssuranceDesign, develop, and maintain automated test scripts for web, mobile, and API testing.
Execute manual test cases for exploratory and non-automatable scenarios.Identify, log, and track bugs to closure using Jira, Azure DevOps, or equivalent tools.Ensure adherence to QA methodologies, tools, and best practices.AI & Bot OperationsManage version control, deployment, and release cycles of AI models and bots.
Conduct offline / online evaluations and A / B tests for prompts, models, and policies.Track and troubleshoot production issues (latency, escalations, fallback errors).Design and orchestrate conversations using Voiceflow and related tools.Observability, Security & ComplianceWork with observability / evaluation tools (Langfuse, Arize / Phoenix, W&B, Prometheus, Grafana).
Implement guardrails, safety red-teaming, and prompt-injection defenses.Manage PII handling, content safety filters, and data loss prevention (DLP).Document incidents, perform RCA reporting, and ensure compliance with data privacy and security policies.Collaboration & Continuous ImprovementCollaborate with developers, data scientists, and business teams to resolve operational issues.
Participate in incident on-call rotations, maintain runbooks, and conduct disaster recovery (DR) tests.Recommend improvements to enhance system resilience, reduce downtime, and optimize costs.Requirements
Bachelor's degree in Computer Science, Engineering, or related field (or equivalent experience).3+ years of experience in operations, DevOps, or AI / ML support roles.Proven track record managing large, complex, multi-stakeholder projects.Strong knowledge of Machine Learning practices (model monitoring, retraining, pipelines).Familiarity with conversational AI platforms (Amazon Lex, Salesforce Einstein Bots, ElevenLabs).Integration experience with Amazon Connect, Genesys Cloud CX, NICE CXone, or similar.Proficiency with test automation tools (Selenium, Playwright, Cypress, Appium, or equivalent).Experience with API testing tools (Postman, RestAssured, Karate).Strong scripting and automation skills (Python, Bash, CI / CD pipelines).C1 English proficiency.Nice to Have
Experience with performance / load testing tools (JMeter, Locust, Gatling).Knowledge of cloud platforms (AWS, Azure, GCP).Familiarity with Git or other version control systems.QA certification (ISTQB or equivalent).#J-18808-Ljbffr