Skip to main content

Alarms and Error States

I'm seeing alarms or red error states in the Operations view.

Alarm Center showing active alarms Alarm Center showing 513 active alarms with severity breakdown and Workflow.HTTP-ServiceInputWorkflow.Rollback error details

Common Symptoms

  • Red/Orange badges in Engine State
  • Alarms appearing in Alarm Center
  • Error states on workflows, services, or connections
  • Notifications about system issues

Understanding Alarm Severity

SeverityColorMeaningAction Required
CRITICAL🔴 RedSystem failure, data loss riskImmediate attention
MAJOR🟠 OrangeSignificant impact, degraded serviceAddress soon
MINOR🟡 YellowLimited impact, workaround availableAddress when convenient
WARNING🟡 YellowPotential issue, monitoring recommendedReview
INFO🔵 BlueInformational onlyNone

Diagnosis Checklist

1. Check the Alarm Center

  1. Go to Operations → Alarm Center
  2. Review active alarms
  3. Click on an alarm for details

Key information to note:

  • Source component (which workflow/service)
  • Alarm message
  • First occurrence time
  • Count (how many times it fired)

2. Check Engine State

Engine State overview Engine State overview showing color-coded component status: 19 workflows, 6 services, 16 sources, 18 sinks with warning indicator on Sharepoint-Source-For-Copy

In Operations → Engine State:

  1. Look for red/orange status indicators
  2. Expand workflows to see processor-level states
  3. Check the middle panel for node-specific issues

3. Review Component Logs

Alarm detail view Expanded alarm detail view for Workflow.HTTP-ServiceInputWorkflow.Rollback showing LAY-04047 rollback error, LAY-02603 input processor error, and LAY-00100 connection refused stack trace

For any component showing errors:

  1. Select the component in Engine State
  2. Click the Log tab
  3. Look for error messages around the alarm time
  4. Check for stack traces or exception details

Common Alarm Types

Runtime Errors

Symptoms: Processor failures, script errors, exceptions

Resolution:

  1. Check processor logs for the specific error
  2. Fix JavaScript/Python code issues
  3. Verify resource availability (memory, disk)
  4. Restart the component if needed

Connection Failures

Symptoms: Source/sink connection alarms, timeout errors

Resolution:

  1. Check Connection Asset configuration
  2. Verify network connectivity
  3. Confirm external service availability
  4. See Connection Issues

Resource Exhaustion

Symptoms: Disk full, memory low, thread pool exhausted

Resolution:

  1. Check cluster node resources
  2. Free disk space or add storage
  3. Adjust memory settings
  4. Review processing load and scale if needed

State Synchronization Issues

Symptoms: CLUSTER_ROLE_MISMATCH, deployment sync failures

Resolution:

  1. Check cluster node health
  2. Verify network between nodes
  3. Review cluster configuration
  4. May require cluster restart in severe cases

Engine State Reference

Workflow States

StateColorMeaning
HEALTHY🟢 GreenRunning normally
PROCESSING🟢 GreenActively processing messages
STARTING🟡 YellowInitializing
STOPPING🟡 YellowShutting down
INITIALIZATION_FAILED🔴 RedFailed to start
ERROR🔴 RedRuntime error

Service States

StateColorMeaning
UNUSED🟢 GreenAvailable but not used by any workflow
USED🟢 GreenActive and in use
VERIFYING_CONFIGURATION🟡 YellowChecking config
INITIALIZATION_FAILED🔴 RedFailed to initialize
DEPENDENCY_FAILURE🔴 RedRequired dependency failed

Resource States

StateColorMeaning
USABLE🟢 GreenAvailable and working
UNUSABLE🔴 RedFailed or unavailable

Responding to Alarms

Confirming Alarms

When you've started working on an issue:

  1. Select the alarm in Alarm Center
  2. Click Confirm Alarm
  3. This silences notifications but keeps the alarm visible

Clearing Alarms

Alarms typically clear automatically when:

  • The underlying issue is resolved
  • The component recovers
  • The alarm condition no longer exists

Some alarms may need manual clearing after verification.


Prevention

Proactive Monitoring

  1. Regular review: Check Alarm Center daily
  2. Dashboard setup: Use Engine State for at-a-glance health
  3. Trend analysis: Look for recurring issues

Configuration Best Practices

  1. Validate before deploy: Always run validation
  2. Test in dev: Verify changes in development first
  3. Monitor resources: Watch disk, memory, CPU trends
  4. Set up alerts: Configure notifications for critical alarms

See Also