Fix Ansible SSM 'stty -echo' Timeout Failures in GitHub Actions
Troubleshoot and resolve intermittent 'DISABLE ECHO command' timeout failures when running Ansible playbooks via AWS SSM in GitHub Actions. Learn timeout configurations, pipelining settings, and retry strategies for stable deployments.
How to troubleshoot and fix intermittent ‘DISABLE ECHO command 'stty -echo' timeout’ failures when running Ansible playbook via AWS SSM in GitHub Actions?
Issue Description
- Ansible deployment to EC2 instances via SSM fails intermittently with timeout on DISABLE ECHO command ‘stty -echo’.
- Error occurs at task path:
ansible/playbooks/deploy_app.yml:2. - SSM connection sometimes shows ‘Connection Lost’ and gets stuck.
- Increasing timeout from 60s to 120s and testing SSM connection helps partially, but issue persists.
- Deployment sometimes succeeds.
GitHub Actions Workflow Step
- name: Run Ansible deployment via SSM
env:
AWS_DEFAULT_REGION: ${{ env.AWS_REGION }}
AWS_REGION: ${{ env.AWS_REGION }}
run: |
ansible-playbook \
-i ansible/inventory/hosts.ini \
ansible/playbooks/deploy_app.yml \
-e "ecr_registry=${{ steps.deploy-vars.outputs.ecr-registry }}" \
-e "ecr_repository=${{ steps.deploy-vars.outputs.ecr-repository }}" \
-e "deploy_environment=${{ needs.determine-environment.outputs.environment }}" \
-e "image_tag=latest" \
-e "aws_region=eu-west-2" \
-vvv
Ansible Playbook (deploy_app.yml)
The playbook pulls Docker images from ECR, logs in using SSM parameters, fetches env vars from SSM Parameter Store, stops old containers, starts new ones (ec2server_app, celery worker/beat, nginx), and cleans up.
Key sections:
- ECR Login:yaml
- name: Get ECR login password shell: aws ecr get-login-password --region {{ aws_region }} register: ecr_password changed_when: false no_log: true become: false retries: 3 delay: 5 until: ecr_password.rc == 0 - name: Log in to Docker with ECR credentials shell: | echo "{{ ecr_password.stdout }}" | docker login --username AWS --password-stdin {{ ecr_registry }} # ... - SSM Parameters:yaml
- name: Fetch environment variables from SSM shell: | aws ssm get-parameters-by-path \ --path "/{{ deploy_environment | upper }}/BE/DJ" \ --region {{ aws_region }} \ --recursive \ --with-decryption \ --query 'Parameters[*].[Name,Value]' \ --output json # ... - Docker containers started with
become: trueonwebservershosts.
Full playbook involves Docker network creation, image pull, container management (nginx:latest and app image), crontab, celery, and image pruning.
Questions
- Why does the ‘stty -echo’ timeout occur during SSM Ansible execution (likely related to sudo/become)?
- How to stabilize SSM connections for reliable Ansible runs?
- Recommended timeouts, retries, or configuration for AWS SSM with Ansible in CI/CD?
- Any Ansible.cfg or SSM plugin settings to prevent echo disable timeouts?
Ansible deployments via AWS SSM intermittently fail with ‘stty -echo’ timeouts due to sudo privileges and network latency during become operations. These failures occur when SSM connections struggle with privilege escalation commands over slow or unreliable networks, causing the connection to drop before the stty -echo command completes. To fix this, you’ll need to extend timeouts, enable pipelining, configure retries, and optimize GitHub Actions network settings for consistent SSM connectivity.
Contents
- Understanding the ‘stty -echo’ Timeout Issue
- Why the Timeout Occurs During SSM Ansible Execution
- Stabilizing SSM Connections for Reliable Ansible Runs
- Recommended Timeouts and Retries for AWS SSM
- Ansible Configuration Settings to Prevent Echo Disable Timeouts
- GitHub Actions Workflow Optimizations
- Advanced Troubleshooting Techniques
- Best Practices for Long-Term Stability
Understanding the ‘stty -echo’ Timeout Issue
The “DISABLE ECHO command ‘stty -echo’ timeout” error in Ansible SSM deployments occurs when privilege escalation (become/sudo) operations hang during terminal setup. This specifically affects become tasks running over SSM connections where the remote host executes stty -echo to disable command echo during sudo password prompts. When network latency or SSM throttling delays this command, Ansible’s default timeout (10-30 seconds) expires before the operation completes, causing connection loss and playbook failure.
The intermittent nature of these failures points to underlying network instability between GitHub Actions and your EC2 instances, or SSM throttling during peak usage. The error typically manifests during become tasks in your playbook - especially those requiring sudo privileges for Docker operations. Understanding this root cause is crucial for applying targeted fixes rather than simply increasing timeouts blindly.
Why the Timeout Occurs During SSM Ansible Execution
The ‘stty -echo’ timeout specifically occurs during privilege escalation operations when Ansible attempts to disable terminal echo for security purposes. This happens because:
-
Sudo Interactions: When tasks use
become: true, Ansible must create an interactive shell session where sudo prompts may appear. Thestty -echocommand disables command visibility during these prompts, but requires immediate execution. -
SSM Connection Delays: AWS SSM connections rely on persistent sessions that can drop under network pressure. GitHub Actions runners experience intermittent latency between your CI environment and AWS regions, causing stty commands to timeout.
-
Docker Container Operations: Tasks starting/stopping containers with
become: trueare particularly vulnerable, as they often trigger longer privilege escalation sequences with multiple sudo interactions.
According to Nick vs Networking, this becomes problematic when Ansible machines are being setup, as timeouts occur during hostname changes and privilege escalations. The same mechanism affects your Docker operations during container lifecycle management.
Stabilizing SSM Connections for Reliable Ansible Runs
To stabilize SSM connections for consistent Ansible execution:
-
Enable Pipelining: Add
pipelining=Trueto your[ssh_connection]section in ansible.cfg. This reduces round-trips by sending multiple commands in a single SSH session, critical for SSM connections where each new connection adds overhead. -
Implement Connection Retries: Configure your GitHub Actions workflow to retry failed SSM connections automatically. Add a
retrystep before your Ansible playbook execution with exponential backoff. -
Use Wait for Connection: Include a wait task before playbook execution to ensure SSM agents are responsive:
yaml- name: Wait for SSM connection wait_for_connection: timeout: 120 delay: 10 retries: 5 -
Optimize Network Path: Ensure GitHub Actions runners use direct, low-latency paths to your AWS region. Consider deploying runners in the same region as your EC2 instances to minimize transit delays.
The AWS SSM documentation emphasizes using wait_for_connection to handle intermittent agent readiness issues before executing playbooks.
Recommended Timeouts and Retries for AWS SSM
For reliable SSM connections in CI/CD environments, implement these timeout configurations:
-
Connection Timeout: Set
ansible_ssh_timeout=120in ansible.cfg to extend the initial connection window:ini[ssh_connection] ssh_timeout = 120 -
Command Timeout: Configure
ansible_command_timeout=120for long-running commands like Docker operations:ini[defaults] command_timeout = 120 -
Become Timeout: Set
ansible_become_timeout=120to handle sudo privilege escalation delays:ini[privilege_escalation] become_timeout = 120 -
Task Retries: Add retry parameters directly to vulnerable tasks:
yaml- name: Log in to Docker with ECR credentials shell: > echo "{{ ecr_password.stdout }}" | docker login --username AWS --password-stdin {{ ecr_registry }} retries: 3 delay: 10 until: ecr_password.rc == 0 -
SSM Plugin Timeout: Configure
[ssm]section in ansible.cfg:ini[ssm] timeout = 120 retries = 3
As noted in the Server Mitogen configuration guide, extending ConnectTimeout to 120s handles bastion delays and SSM throttling effectively.
Ansible Configuration Settings to Prevent Echo Disable Timeouts
Prevent ‘stty -echo’ timeouts with these Ansible-specific configurations:
-
Override Default SSH Arguments: Modify ansible.cfg to add stable SSM connection parameters:
ini[ssh_connection] ssh_args = -o ControlMaster=auto -o ControlPersist=60s -o ServerAliveInterval=30 -o ServerAliveCountMax=3 -
Disable Interactive Prompts: Add
become_flags='-n'to skip sudo prompts in your playbook:yaml- hosts: webservers become: true become_flags: '-n' tasks: # Docker operations -
Use Non-Interactive Shell: Set
ansible_shell_type=shin ansible.cfg to avoid bash-specific stty issues. -
Configure Async Status: Use undocumented async_status polling to extend waits:
yaml- name: Long-running Docker operation shell: long_command async: 300 poll: 10 -
Disable Python Interpreter Discovery: Set
ansible_python_interpreter=/usr/bin/python3to avoid interpreter discovery delays that trigger become timeouts.
The Stack Overflow discussion highlights these ssh_args configurations and explains how async_status polling can extend wait times during command execution.
GitHub Actions Workflow Optimizations
Optimize your GitHub Actions workflow to reduce SSM connection failures:
-
Add Network Resilience: Before the Ansible step, include network checks:
yaml- name: Test network connectivity shell: | aws ssm describe-instance-information --region ${{ env.AWS_REGION }} --filters Key=InstanceIds,Values=$INSTANCE_ID env: INSTANCE_ID: ${{ ec2_instance_id }} -
Use Self-Hosted Runners: Deploy GitHub Actions runners in the same AWS region as your EC2 instances to minimize latency.
-
Implement Circuit Breakers: Wrap Ansible execution in a script that retries on specific error patterns:
bash#!/bin/bash MAX_RETRIES=3 RETRY_DELAY=30 for i in $(seq 1 $MAX_RETRIES); do ansible-playbook ... && exit 0 if [[ $i -eq $MAX_RETRIES ]]; then echo "Final attempt failed" exit 1 fi sleep $RETRY_DELAY done -
Configure Regional Endpoints: Explicitly specify regional SSM endpoints in your AWS CLI calls:
yamlenv: AWS_ENDPOINT_URL: https://ssm.{{ env.AWS_REGION }}.amazonaws.com -
Cache Dependencies: Cache Docker images and Ansible collections to reduce SSM operations during runs.
Advanced Troubleshooting Techniques
When standard timeout adjustments fail, use these advanced techniques:
-
Enable Debug Logging: Add
-vvvto your ansible-playbook command to capture detailed SSM connection traces. Look for “stty -echo” commands in the output. -
Monitor SSM Sessions: Use AWS Systems Manager Session Manager to manually test privilege escalation commands:
bashaws ssm start-session --target i-1234567890abcdef0 --document-name AWS-StartInteractiveCommand --parameters command="sudo echo test" -
Check S3 Bucket Permissions: Ensure your S3 bucket (specified by
ansible_aws_ssm_bucket_name) has proper permissions for script downloads. Add this to your playbook:yaml- name: Verify SSM bucket access shell: curl -s https://{{ ansible_aws_ssm_bucket_name }}.s3.amazonaws.com/ansible-connection-test -
Capture SSM Error Details: Parse Ansible output for specific error codes in GitHub Actions:
yaml- name: Parse SSM errors if: failure() run: | grep -i "stty.*timeout" ansible.log && echo "Detected echo timeout" -
Use SSM Session Logging: Enable SSM session logging to capture failed stty commands:
yaml- name: Enable SSM session logging shell: aws ssm update-instance-information --region ${{ env.AWS_REGION }} --instance-id $INSTANCE_ID --attribute-values '{"SessionLoggingEnabled": true}'
The GitHub issue discussion reveals that malformed echo commands in stty/sudo cause failures, and enabling pipelining helps bypass these issues.
Best Practices for Long-Term Stability
Ensure reliable SSM-based Ansible deployments with these best practices:
-
Implement Health Checks: Regularly test SSM connectivity with a dedicated playbook:
yaml- name: SSM health check wait_for_connection: timeout: 60 delay: 5 -
Optimize SSM Agent Configuration: Update SSM agents to latest versions and configure:
bashsudo systemctl restart amazon-ssm-agent -
Use Connection Plugins: Switch to community.aws.aws_ssm plugin with optimized settings:
yaml[ssh_connection] ssh_common_args = -o ControlMaster=auto -o ControlPersist=600 -o ServerAliveInterval=30 -
Implement Canary Deployments: Test privilege escalation tasks on a staging instance before production deployment.
-
Monitor SSM Metrics: Set up CloudWatch alarms for SSM session failures and connection latency.
-
Document Timeout Settings: Maintain a central configuration repository with timeout standards that align with your network conditions.
According to Bobcares, combining increased timeouts with retries and non-interactive sudo flags resolves 90% of become timeout issues in SSM environments.
Sources
- Stack Overflow: Increase timeout of SSH command in Ansible
- Reddit: Ansible does not respect timeout in playbook
- Reddit: Having trouble using aws ssm for connection
- Server Fault: SSH timeout with Ansible Mitogen plugin
- Ansible Docs: Network connection options
- Nick vs Networking: Ansible – Timeout on Become
- Bobcares: Ansible Privilege Escalation Timeout
- Ansible Docs: AWS SSM connection
- GitHub Issue: ssm connection plugin fails at gathering facts
- Stack Overflow: Async_status delay for polling
Conclusion
Resolving intermittent ‘stty -echo’ timeouts in Ansible SSM deployments requires a multi-layered approach addressing connection stability, privilege escalation delays, and CI/CD resilience. By implementing extended timeouts (120s), enabling pipelining, configuring retries for become operations, and optimizing GitHub Actions network paths, you can eliminate most SSM timeout failures. The most critical fixes include adding ansible_become_timeout=120, using become_flags='-n' to skip interactive prompts, and setting up proper SSM bucket configurations. For long-term stability, establish health checks and monitoring of SSM sessions, as these timeouts often indicate deeper network or infrastructure issues that require ongoing attention.