I have two source tables, each with a flag indicating whether a row has been processed (after writing - false). I need to: 1. Select data from these tables 2. Process it (perform joins and other operations) 3. Write the result to a separate table with calculated data 4. Return the result to the client 5. Update the processing flag of rows in the source tables 6. Ensure rollback of changes when problems occur.

NeuroAgent

CTE with DML: Updating Data in Queries

Data Processing Optimization: How to use CTE with UPDATE/INSERT operations, ensure transaction atomicity, and avoid locks when processing large volumes of data.

#database-optimization #data-processing #dml-operations #error-handling #sql-cte #sql-performance #transaction-management

11/08/2025, 12:50 AM

How can I retrieve data from source tables, process it, and update records in those tables?

I have two source tables, each with a flag indicating whether a row has been processed (initially false after insertion). I need to:

Select data from these tables
Process it (perform joins and other operations)
Write the results to a separate table with calculated data
Return the results to the client
Update the processing flag for rows in the source tables
Ensure rollback of changes if issues occur

Here’s my current approach without considering updates and inserts:

sql

with t1 as (
    -- first data table
    select * from table1
    where "IsProcessed" = false
), t2 as (
    -- second data table
    select * from table2
    where "IsProcessed" = false
),
-- data processing happens here (joins and other operations)
result as (
    -- this is the final result
)
select * from result;

Question: How bad of a practice would it be to include updates and inserts in CTEs? As I understand from the documentation, this is possible:

sql

with t1 as (
    -- first data table
    select * from table1
    where "IsProcessed" = false
), t2 as (
    -- second data table
    select * from table2
    where "IsProcessed" = false
),
-- data processing happens here (joins and other operations)
result as (
    -- this is the final result
),
upd1 as(
    update table1
    set "IsProcessed" = true
    from t1
    where t1.id = table1.id
),
upd2 as(
    update table2
    set "IsProcessed" = true
    from t2
    where t2.id = table2.id
),
ins1 as(
    insert into result_table
    select * from result
)
select * from result

If this approach is indeed problematic, what are better alternatives for solving this kind of task?

NeuroAgent

Your approach of using CTE for data processing is viable, but includes risks related to transaction management and error handling. Modern DBMS do support the use of DML operations (UPDATE, INSERT) inside CTEs, however this approach requires careful planning and additional precautions to ensure data integrity.

Analysis of Your Current Approach
Advantages and Disadvantages of CTE with DML Operations
Best Practices for Transactions and Error Handling
Alternative Approaches to Your Task
Recommended Solution Structure
Performance Optimization

Analysis of Your Current Approach

Your current CTE-based approach has both strengths and potential issues:

Strengths:

Readability: CTEs allow breaking down complex logic into logical blocks
Modularity: Each processing stage is isolated and understandable
Reusability: Local CTEs can be used multiple times within a single query

Problematic Aspects:

sql

-- Problem: Sequential execution of DML operations in CTE can lead to locking
with t1 as (select * from table1 where "IsProcessed" = false),
     t2 as (select * from table2 where "IsProcessed" = false),
     result as (/* data processing */),
     upd1 as (update table1 set "IsProcessed" = true from t1 where t1.id = table1.id),
     upd2 as (update table2 set "IsProcessed" = true from t2 where t2.id = table2.id),
     ins1 as (insert into result_table select * from result)
select * from result;

As noted in the official Microsoft documentation, when an error occurs, all changes made up to that point in the current transaction are rolled back, including changes made by triggers.

Advantages and Disadvantages of CTE with DML Operations

Advantages:

Atomicity: All operations are performed within a single transaction
Readability: Processing logic is encapsulated in one place
Complexity Management: Division into stages simplifies debugging

Disadvantages:

Limited Scope: As noted on AlmaBetter Bytes, a CTE can only be referenced in the query that immediately follows its definition and is only visible within the SELECT, INSERT, UPDATE, or DELETE statement in which it is defined.
Locking Risks: Sequential UPDATE operations can cause long table locks
Error Handling: Difficulty in implementing a reliable rollback mechanism for partial failures
Performance: Executing DML operations in CTEs can be less optimal than separate transactions

Best Practices for Transactions and Error Handling

1. Using Explicit Transactions with TRY-CATCH

sql

BEGIN TRY
    BEGIN TRANSACTION;
    
    -- Process data and calculate results
    with t1 as (select * from table1 where "IsProcessed" = false),
         t2 as (select * from table2 where "IsProcessed" = false),
         result as (/* data processing */)
    
    -- Insert results
    INSERT INTO result_table
    SELECT * FROM result;
    
    -- Update processing flags
    UPDATE table1
    SET "IsProcessed" = true
    WHERE id IN (SELECT id FROM t1);
    
    UPDATE table2
    SET "IsProcessed" = true
    WHERE id IN (SELECT id FROM t2);
    
    COMMIT TRANSACTION;
END TRY
BEGIN CATCH
    IF @@TRANCOUNT > 0
        ROLLBACK TRANSACTION;
    
    -- Error logging
    DECLARE @ErrorMessage NVARCHAR(4000) = ERROR_MESSAGE();
    DECLARE @ErrorSeverity INT = ERROR_SEVERITY();
    DECLARE @ErrorState INT = ERROR_STATE();
    
    -- You can add logging to an error table
    INSERT INTO error_log (error_message, error_time)
    VALUES (@ErrorMessage, GETDATE());
    
    -- Re-raise error to client
    RAISERROR (@ErrorMessage, @ErrorSeverity, @ErrorState);
END CATCH

2. Batch Data Processing

As recommended on GeeksforGeeks, when processing large volumes of data, operations should be divided into smaller transactions or batches to avoid overloading the system:

sql

-- Process in batches of 1000 records
DECLARE @BatchSize INT = 1000;
DECLARE @ProcessedRows INT = 0;

WHILE EXISTS (SELECT 1 FROM table1 WHERE "IsProcessed" = false)
BEGIN
    BEGIN TRY
        BEGIN TRANSACTION;
        
        INSERT INTO result_table
        SELECT * FROM table1 t1
        JOIN table2 t2 ON t1.id = t2.id
        WHERE t1."IsProcessed" = false
        ORDER BY t1.id
        LIMIT @BatchSize;
        
        UPDATE table1
        SET "IsProcessed" = true
        WHERE id IN (
            SELECT id FROM table1
            WHERE "IsProcessed" = false
            ORDER BY id
            LIMIT @BatchSize
        );
        
        COMMIT TRANSACTION;
        SET @ProcessedRows = @ProcessedRows + @BatchSize;
    END TRY
    BEGIN CATCH
        IF @@TRANCOUNT > 0
            ROLLBACK TRANSACTION;
        
        -- Handle error
        BREAK;
    END CATCH
END

3. Setting XACT_ABORT

For reliable transaction operation, it’s recommended to set SET XACT_ABORT ON:

sql

SET XACT_ABORT ON;
BEGIN TRY
    BEGIN TRANSACTION;
    -- your processing logic
    COMMIT TRANSACTION;
END TRY
BEGIN CATCH
    IF @@TRANCOUNT > 0
        ROLLBACK TRANSACTION;
    -- handle error
END CATCH

Alternative Approaches to Your Task

1. Separation into Stored Procedures

sql

CREATE PROCEDURE ProcessData
AS
BEGIN
    BEGIN TRY
        BEGIN TRANSACTION;
        
        -- Step 1: Process and save results
        INSERT INTO result_table
        SELECT * FROM table1 t1
        JOIN table2 t2 ON t1.id = t2.id
        WHERE t1."IsProcessed" = false;
        
        -- Step 2: Update flags
        UPDATE table1
        SET "IsProcessed" = true
        WHERE id IN (
            SELECT DISTINCT t1.id 
            FROM table1 t1
            JOIN table2 t2 ON t1.id = t2.id
            WHERE t1."IsProcessed" = false
        );
        
        COMMIT TRANSACTION;
    END TRY
    BEGIN CATCH
        IF @@TRANCOUNT > 0
            ROLLBACK TRANSACTION;
        
        -- Log error
        INSERT INTO error_log (error_message, error_time)
        VALUES (ERROR_MESSAGE(), GETDATE());
        
        RAISERROR ('Error processing data', 16, 1);
    END CATCH
END;

2. Using Temporary Tables

sql

BEGIN TRY
    BEGIN TRANSACTION;
    
    -- Create temporary tables for unprocessed data
    SELECT * INTO #temp_table1 
    FROM table1 
    WHERE "IsProcessed" = false;
    
    SELECT * INTO #temp_table2 
    FROM table2 
    WHERE "IsProcessed" = false;
    
    -- Process data
    INSERT INTO result_table
    SELECT * FROM #temp_table1 t1
    JOIN #temp_table2 t2 ON t1.id = t2.id;
    
    -- Update original tables
    UPDATE t1
    SET t1."IsProcessed" = true
    FROM table1 t1
    JOIN #temp_table1 temp ON t1.id = temp.id;
    
    COMMIT TRANSACTION;
END TRY
BEGIN CATCH
    IF @@TRANCOUNT > 0
        ROLLBACK TRANSACTION;
    
    -- Clean up temporary tables on error
    DROP TABLE #temp_table1, #temp_table2;
    
    RAISERROR ('Data processing error', 16, 1);
END CATCH

3. Pattern Using OUTPUT Clause

sql

BEGIN TRY
    BEGIN TRANSACTION;
    
    -- Use OUTPUT to track updated records
    UPDATE table1
    SET "IsProcessed" = true
    OUTPUT inserted.id, inserted."IsProcessed"
    WHERE "IsProcessed" = false;
    
    -- Similarly for table2
    UPDATE table2
    SET "IsProcessed" = true
    OUTPUT inserted.id, inserted."IsProcessed"
    WHERE "IsProcessed" = false;
    
    COMMIT TRANSACTION;
END TRY
BEGIN CATCH
    IF @@TRANCOUNT > 0
        ROLLBACK TRANSACTION;
    
    RAISERROR ('Flag update error', 16, 1);
END CATCH

Performance Optimization

1. Indexing

Ensure you have proper indexes:

On columns used in JOIN conditions
On filtering columns (“IsProcessed”)
On columns used in ORDER BY

2. Batch Processing

For large tables, implement batch processing:

sql

DECLARE @BatchSize INT = 5000;
DECLARE @MaxID INT = (SELECT MAX(id) FROM table1 WHERE "IsProcessed" = false);
DECLARE @CurrentID INT = 0;

WHILE @CurrentID < @MaxID
BEGIN
    BEGIN TRY
        BEGIN TRANSACTION;
        
        -- Process batch
        INSERT INTO result_table
        SELECT * FROM table1 t1
        JOIN table2 t2 ON t1.id = t2.id
        WHERE t1."IsProcessed" = false
        AND t1.id > @CurrentID
        AND t1.id <= @CurrentID + @BatchSize;
        
        -- Update flags
        UPDATE table1
        SET "IsProcessed" = true
        WHERE id BETWEEN @CurrentID AND @CurrentID + @BatchSize
        AND "IsProcessed" = false;
        
        COMMIT TRANSACTION;
        SET @CurrentID = @CurrentID + @BatchSize;
    END TRY
    BEGIN CATCH
        IF @@TRANCOUNT > 0
            ROLLBACK TRANSACTION;
        
        -- Log error and exit
        BREAK;
    END CATCH
END

3. Parallel Processing

For very large data volumes, consider using parallel queries or distributed processing.

Sources

Conclusion

Your approach with CTE and DML operations is technically possible but requires careful planning for error handling and transaction management.
Main recommendations:
- Always use explicit transactions with TRY-CATCH blocks
- Design rollback logic for partial failures
- Implement batch processing for large data volumes
- Use SET XACT_ABORT ON for reliable transaction operation
Alternative approaches to consider:
- Separating logic into multiple stored procedures
- Using temporary tables for operation isolation
- OUTPUT clause pattern for change tracking
- Asynchronous processing using queues
Additional considerations:
- Monitor performance and indexing
- Implement a retry mechanism for temporary errors
- Maintain logging for failure analysis

Ultimately, the choice of approach depends on your specific requirements for performance, reliability, and system scalability. For critical operations, I recommend using more conservative approaches with clear separation of processing stages.

How to implement batch data processing for large tables in SQL without locking the entire table?What alternative approaches to data processing exist instead of CTE with DML operations for better performance?How to avoid long locks when multiple users process data simultaneously in SQL Server?How to configure retry mechanisms for transaction errors for temporary failures in SQL?What indexes should be created to optimize queries with JOIN and UPDATE operations in large tables?How to implement asynchronous data processing in SQL Server using queues and background tasks?

Ask NeuroAgent