NeuroAgent

How to Prevent AWS SSO Token Expiration in Long-Running Jobs

Learn how to programmatically refresh AWS SSO credentials to prevent token expiration during long-running Metaflow jobs. Complete guide with scripts and best practices.

Question

How to force refresh AWS SSO login credentials to prevent token expiration during long-running jobs?

I’m using Metaflow with S3 and need to authenticate via AWS CLI single sign-on (aws sso login). The issue is that I can’t programmatically refresh my SSO login credentials when they expire.

Problem scenario:

  • SSO timeout is set to 24 hours
  • Day 1: Login at 10am using aws sso login
  • Day 2: Start a program at 9am using aws sso login again
  • The token is still valid at login, but expires at 10am
  • My Metaflow flow crashes when it can no longer access S3 after token expiration

How can I prevent this issue? I haven’t found relevant information through searches. Should I manually delete tokens in ~/.aws/sso/cache?

I also have long-running jobs that exceed the 24-hour timeout period.

NeuroAgent

AWS SSO tokens automatically expire after 24 hours by design, but you can programmatically refresh them using AWS CLI commands, custom scripts, or by leveraging the AWS SSO cache mechanism. For long-running jobs exceeding 24 hours, you need to implement automatic credential refresh logic or use alternative authentication methods like IAM roles or long-lived STS tokens with MFA.

Contents


Understanding AWS SSO Token Caching

When you authenticate using aws sso login, the AWS CLI performs an OIDC/OAuth flow and stores a JSON token file in the SSO cache directory, typically located at ~/.aws/sso/cache/. This cache contains critical information:

json
{
  "accessToken": "Bearer_token_value",
  "expiresAt": "2025-01-15T10:00:00Z",
  "clientId": "client_id_value",
  "clientSecret": "secret_value",
  "region": "us-east-1",
  "startUrl": "https://your-sso-start-url"
}

The expiresAt timestamp determines when the token becomes invalid. You cannot directly refresh this token programmatically - it’s designed to be refreshed by running aws sso login again when expired. According to the AWS documentation, SSO tokens are short-lived security measures that reduce the risk of leaked credentials compared to long-lived access keys.


Programmatic Credential Refresh Methods

Method 1: AWS SSO Token Refresh Script

Create a script that checks token expiration and refreshes when needed:

bash
#!/bin/bash
CACHE_DIR="$HOME/.aws/sso/cache"
TOKEN_FILE=$(ls -t $CACHE_DIR/*.json | head -1)

if [ -f "$TOKEN_FILE" ]; then
    EXPIRES_AT=$(jq -r '.expiresAt' "$TOKEN_FILE")
    CURRENT_TIME=$(date -u +"%Y-%m-%dT%H:%M:%SZ")
    
    if [[ "$CURRENT_TIME" > "$EXPIRES_AT" ]]; then
        echo "Token expired, refreshing..."
        aws sso login --profile your-sso-profile
    else
        echo "Token still valid"
    fi
else
    echo "No token found, logging in..."
    aws sso login --profile your-sso-profile
fi

Method 2: Python Script with Boto3

For programmatic access, use the AWS SDK to get role credentials:

python
import boto3
import json
import os
from datetime import datetime, timedelta

def get_sso_credentials(sso_start_url, sso_region, sso_account_id, sso_role_name):
    """Get SSO credentials using AWS SSO token from cache"""
    sso_client = boto3.client('sso', region_name=sso_region)
    
    # Read cached token
    cache_file = os.path.expanduser('~/.aws/sso/cache/*.json')
    with open(cache_file, 'r') as f:
        token_data = json.load(f)
    
    # Get role credentials
    response = sso_client.get_role_credentials(
        accessToken=token_data['accessToken'],
        accountId=sso_account_id,
        roleName=sso_role_name
    )
    
    credentials = response['roleCredentials']
    return {
        'aws_access_key_id': credentials['accessKeyId'],
        'aws_secret_access_key': credentials['secretAccessKey'],
        'aws_session_token': credentials['sessionToken'],
        'expiration': credentials['expiration']
    }

def check_and_refresh_credentials():
    """Check if credentials need refresh and return new ones"""
    # Your SSO configuration
    config = {
        'sso_start_url': 'https://your-sso-start-url',
        'sso_region': 'us-east-1',
        'sso_account_id': '123456789012',
        'sso_role_name': 'YourRoleName'
    }
    
    return get_sso_credentials(**config)

Method 3: Using AWS STS with MFA

For longer-lived credentials, use AWS STS with MFA:

bash
#!/bin/bash
MFA_ARN="arn:aws:iam::123456789012:mfa/your-user"
DURATION=86400  # 24 hours
PROFILE="your-profile"

echo -n "Enter MFA code: "
read TOKEN_CODE

# Get temporary credentials
CREDS=$(aws sts get-session-token \
    --serial-number "$MFA_ARN" \
    --token-code "$TOKEN_CODE" \
    --duration-seconds "$DURATION" \
    --profile "$PROFILE" \
    --output json)

# Extract credentials
AWS_ACCESS_KEY_ID=$(echo $CREDS | jq -r '.Credentials.AccessKeyId')
AWS_SECRET_ACCESS_KEY=$(echo $CREDS | jq -r '.Credentials.SecretAccessKey')
AWS_SESSION_TOKEN=$(echo $CREDS | jq -r '.Credentials.SessionToken')

# Export for use in scripts
export AWS_ACCESS_KEY_ID AWS_SECRET_ACCESS_KEY AWS_SESSION_TOKEN

Solutions for Long-Running Jobs

Solution 1: Automatic Credential Refresh in Metaflow

Modify your Metaflow script to handle credential refresh:

python
import os
import subprocess
import time
from datetime import datetime, timedelta
import boto3

class RefreshableCredentials:
    def __init__(self):
        self.credentials = None
        self.expiration = datetime.now()
        self.refresh_threshold = timedelta(minutes=30)
    
    def refresh_if_needed(self):
        """Refresh credentials if they're about to expire"""
        if datetime.now() > (self.expiration - self.refresh_threshold):
            self.refresh_credentials()
    
    def refresh_credentials(self):
        """Refresh AWS credentials"""
        try:
            # Run aws sso login to refresh token
            subprocess.run(['aws', 'sso', 'login'], check=True)
            
            # Create new boto3 session
            session = boto3.Session()
            self.credentials = session.get_credentials()
            self.expiration = datetime.now() + timedelta(hours=1)  # SSO tokens last 1 hour after refresh
            
        except subprocess.CalledProcessError as e:
            print(f"Failed to refresh credentials: {e}")
            raise

# Usage in Metaflow flow
class MyFlow(FlowSpec):
    @step
    def start(self):
        self.credentials = RefreshableCredentials()
        self.next(self.process_data)
    
    @step
    def process_data(self):
        # Refresh credentials before long operation
        self.credentials.refresh_if_needed()
        
        # Set environment variables for boto3
        os.environ['AWS_ACCESS_KEY_ID'] = self.credentials.credentials.access_key
        os.environ['AWS_SECRET_ACCESS_KEY'] = self.credentials.credentials.secret_key
        os.environ['AWS_SESSION_TOKEN'] = self.credentials.credentials.token
        
        # Your S3 operations here
        s3 = boto3.client('s3')
        # ... your code ...
        
        self.next(self.end)
    
    @step
    def end(self):
        pass

Solution 2: IAM Roles for EC2/ECS

If running on AWS infrastructure, use IAM roles instead of SSO credentials:

bash
# For EC2 instances
aws ec2 associate-iam-instance-profile \
    --instance-id i-1234567890abcdef0 \
    --iam-instance-profile Name=YourIAMRoleName

# For ECS tasks
aws ecs run-task \
    --cluster your-cluster \
    --task-definition your-task-definition \
    --launch-type FARGATE \
    --network-configuration "awsvpcConfiguration={subnets=[subnet-12345678],securityGroups=[sg-123456]}"

Solution 3: Long-Lived Credentials with MFA

For truly long-running jobs, use AWS STS to get credentials up to 12 hours with MFA:

python
import boto3
from datetime import datetime, timedelta

def get_long_lived_credentials():
    """Get credentials valid for up to 12 hours with MFA"""
    sts_client = boto3.client('sts')
    
    # Get MFA device ARN
    iam_client = boto3.client('iam')
    mfa_devices = iam_client.list_mfa_devices(UserName='your-username')
    mfa_arn = mfa_devices['MFADevices'][0]['SerialNumber']
    
    # Prompt for MFA code
    mfa_code = input("Enter MFA code: ")
    
    # Get session token
    response = sts_client.get_session_token(
        DurationSeconds=43200,  # 12 hours
        SerialNumber=mfa_arn,
        TokenCode=mfa_code
    )
    
    credentials = response['Credentials']
    expiration = credentials['Expiration']
    
    return {
        'access_key': credentials['AccessKeyId'],
        'secret_key': credentials['SecretAccessKey'],
        'session_token': credentials['SessionToken'],
        'expiration': expiration
    }

Metaflow Integration Best Practices

Configuration Approach 1: Environment Variables

Set up Metaflow to use environment variables for AWS credentials:

bash
# In your .bashrc or profile
export AWS_PROFILE=your-sso-profile
export AWS_CONFIG_FILE=~/.aws/config

# Metaflow configuration
export METAFLOW_SERVICE_ROLE=your-service-role
export METAFLOW_DEFAULT_DATASTORE=s3://your-bucket
export METAFLOW_DEFAULT_METADATA_SERVICE=rds-postgres://your-db

Configuration Approach 2: Metaflow AWS Plugin

Use the Metaflow AWS plugin for better integration:

python
from metaflow import FlowSpec, step, S3

class AWSFlow(FlowSpec):
    @step
    def start(self):
        # Use S3 with automatic credential handling
        s3_path = S3('your-bucket/your-file.txt')
        
        # Your code here
        self.next(self.process)
    
    @step
    def process(self):
        # Metaflow handles credential refresh automatically
        # when using S3 operations
        self.next(self.end)
    
    @step
    def end(self):
        pass

Configuration Approach 3: Custom Credential Provider

Create a custom credential provider for Metaflow:

python
import boto3
from botocore.credentials import DeferredRefreshableCredentials

class SSOCredentialProvider:
    def __init__(self):
        self.session = boto3.Session()
    
    def get_credentials(self):
        # Check if we need to refresh
        if hasattr(self, '_credentials') and self._credentials is not None:
            exp_time = self._credentials.time_remaining()
            if exp_time > 300:  # 5 minutes
                return self._credentials
        
        # Refresh credentials
        try:
            self.session.refresh()
            self._credentials = self.session.get_credentials()
            return self._credentials
        except Exception as e:
            print(f"Failed to refresh credentials: {e}")
            raise

# Use in Metaflow
credentials = SSOCredentialProvider()
os.environ['AWS_ACCESS_KEY_ID'] = credentials.get_credentials().access_key
os.environ['AWS_SECRET_ACCESS_KEY'] = credentials.get_credentials().secret_key
os.environ['AWS_SESSION_TOKEN'] = credentials.get_credentials().token

Manual Token Management

When to Delete Cache Files

You should manually delete cache files in these scenarios:

bash
# Delete all SSO cache files
rm -rf ~/.aws/sso/cache/*

# Delete specific expired tokens
find ~/.aws/sso/cache -name "*.json" -mtime +1 -delete

# Check token expiration dates
ls -la ~/.aws/sso/cache/
cat ~/.aws/sso/cache/*.json | jq '.expiresAt'

Automated Cache Cleanup

Create a cron job to clean up old tokens:

bash
# Edit crontab
crontab -e

# Add this line to clean tokens older than 24 hours
0 2 * * * find ~/.aws/sso/cache -name "*.json" -mtime +1 -delete >/dev/null 2>&1

However, manually deleting tokens is not the recommended solution for your use case. It’s a temporary fix that requires manual intervention and doesn’t solve the underlying problem of long-running jobs exceeding token expiration.


Security Considerations

Risk of Token Storage

Be aware that AWS SSO cache files are stored in plain text, which poses security risks:

bash
# Check file permissions
ls -la ~/.aws/sso/cache/
# Should be: -rw------- (600 permissions)

Best Practices

  1. Use short token durations - The default 1-hour token duration is a good balance
  2. Enable MFA - Always require multi-factor authentication for SSO access
  3. Use IAM roles - favor instance roles over long-lived credentials
  4. Monitor token usage - Regularly review token expiration logs
  5. Implement credential rotation - Automatically refresh credentials before expiration

Alternative Approaches

For truly long-running jobs, consider these alternatives:

  1. Use AWS IAM Roles - Assign roles to EC2 instances, ECS tasks, or Lambda functions
  2. Use AWS STS AssumeRole - Get temporary credentials with custom durations
  3. Use OIDC Federation - Connect GitHub Actions or other CI/CD systems directly to AWS
  4. Use AWS Credentials File - Generate long-lived keys with proper security measures

Conclusion

To prevent AWS SSO token expiration during long-running jobs:

  1. Implement automatic credential refresh using Python scripts that check token expiration and run aws sso login when needed
  2. Use AWS STS with MFA for credentials lasting up to 12 hours
  3. Leverage IAM roles for infrastructure-based authentication when possible
  4. Integrate credential refresh directly into Metaflow flows using environment variable management
  5. Avoid manual token deletion as it’s not scalable and requires constant monitoring

The most sustainable solution is to implement programmatic credential refresh in your Metaflow workflows, ensuring seamless access to AWS resources without interruption. For production environments, consider using IAM roles or STS AssumeRole for better security and easier management.

Sources

  1. Using AWS Identity Center (SSO) tokens to script across multiple accounts - DEV Community
  2. What is SSO? - Single Sign-On Explained - AWS
  3. Automatically refresh temporary AWS CLI credentials with MFA - Medium
  4. AWS Security Token Service STS and usage - DEV Community
  5. AWS SSO | DBeaver Documentation