From Ad-Hoc to Engineering-Grade: A Practical Guide to Integrating dbt into Your Data Team’s Workflow

If you don’t feel like listening right, you can listen to audio summary generated by NotebookLM.

Introduction

So, your team has adopted dbt. You’ve run through the tutorials, built a few models, and seen the magic of turning raw data into clean, analysis-ready tables with simple SQL SELECT statements. But what comes next? The path from running your first dbt run to managing a complex, production-grade data pipeline used by the entire company is often unclear. Many guides teach you the what of dbt’s commands, but not the how of integrating it into your team’s day-to-day operations.

This gap is where data projects often falter. Without established standards, a dbt project can quickly devolve into a “wild west” of inconsistent naming, untested logic, and un-documented models. The result is a system that no one trusts and everyone is afraid to touch.

The true power of dbt lies not just in its ability to transform data, but in its capacity to bring software engineering best practices to analytics. It provides the framework for version control, automated testing, documentation, and collaboration. To unlock this potential, your team needs more than just a tool; it needs a workflow.

This article provides a practical, opinionated blueprint for integrating dbt into your data team’s processes. We will move beyond basic commands and provide a series of actionable checklists and real-world examples covering the entire lifecycle of a dbt project. From establishing clear naming conventions and a consistent SQL style to defining a robust pull request process, a safe model deprecation strategy, and a comprehensive testing framework, you will learn how to build a dbt workflow that is reliable, scalable, and collaborative.

Part 1: The Blueprint — Establishing a Foundation with Clear Naming Conventions

Before you write a single line of complex business logic, the most impactful decision you can make is establishing a consistent naming convention. A well-defined structure for your folders, models, and columns is the bedrock of a scalable and maintainable dbt project. Without it, your project becomes a tangled web where orders, order_details, and final_orders might mean the same thing, or entirely different things, depending on who wrote them.

This section provides a prescriptive guide to organizing your dbt project. By following these conventions, new team members can quickly understand the data flow, and veterans can navigate the project with ease.

1. Folder Organization: From Raw to Ready

The first step is to structure your models directory to reflect the flow of data. The most common and effective pattern uses three layers: staging, intermediate, and marts.

  • staging: This layer is for cleaning and standardizing raw data from your sources. Each model should correspond one-to-one with a source table. Transformations here are light: renaming columns, casting data types, and basic cleaning.
  • intermediate: This is the “workshop” layer where you build modular pieces of business logic. Models here often join different staging models or perform complex aggregations. They are not meant for end-users but serve as building blocks for your final mart models.
  • marts: This is the final, business-facing layer. These models are clean, well-documented, and ready for consumption by BI tools, data scientists, or other stakeholders. They represent distinct business entities or concepts, like customers or monthly_revenue.

A well-organized project directory should look like this:

models/
├── staging/
│   ├── salesforce/
│   │   ├── stg_salesforce__accounts.sql
│   │   └── stg_salesforce__opportunities.sql
│   └── stripe/
│       └── stg_stripe__payments.sql
├── intermediate/
│   ├── finance/
│   │   └── int_orders__with_taxes.sql
│   └── marketing/
│       └── int_campaigns__performance.sql
└── marts/
    ├── finance/
    │   └── mart_monthly_revenue.sql
    └── core/
        ├── mart_customers.sql
        └── mart_orders.sql

2. Model Naming: Prefixes for Predictability

Model file names should tell you exactly what layer they belong to and what they represent.

  • Staging Models (stg_): Prefix with stg_ followed by the source name and the entity, separated by double underscores. The __ is a dbt convention that helps organize the DAG.
    • Example: stg_stripe__payments.sql clearly indicates it’s the staging model for the payments table from stripe.
  • Intermediate Models (int_): Prefix with int_ and name them based on the logical transformation they perform.
    • Example: int_customers__first_and_last_orders.sql is an intermediate model that calculates the first and last order dates for each customer.
  • Mart Models (mart_): Prefix with mart_ and give them clear, business-friendly names.
    • Example: mart_monthly_revenue.sql is a final model ready for financial reporting.

3. Column Naming: The Language of Your Data

Consistent column naming is critical for making your final data models intuitive. When a data scientist sees order_amount_usd and created_at in every mart, they don’t have to guess the currency or format.

Key Rules:

  • IDs: Be specific. Use customer_id and order_id, not a generic id. This makes joins unambiguous.
  • Timestamps: Use the _at suffix (e.g., created_at, updated_at).
  • Booleans: Use prefixes like is_ or has_ (e.g., is_active, has_ordered).
  • Amounts: Include units or currency (e.g., order_amount_usd, shipping_weight_kg).

A key task in staging is to enforce these conventions. For example, when staging data from two different sources, you should standardize their column names immediately.

-- models/staging/salesforce/stg_salesforce__opportunities.sql
SELECT
    id AS opportunity_id,
    created_date AS created_at,
    amount AS opportunity_amount_usd,
    is_won AS is_won_deal
FROM 

-- models/staging/stripe/stg_stripe__payments.sql
SELECT
    id AS payment_id,
    created AS created_at,
    -- Convert from cents to dollars
    amount / 100 AS payment_amount_usd,
    status = 'succeeded' AS is_successful_payment
FROM 

Your Quick Reference: Naming Convention Checklist

Use this checklist to ensure every contribution to your dbt project is consistent and clean.

📋 dbt Project Naming Convention Checklist


1. Folder Organization & Asset Structure

Directory Structure

  • Did I organize models by layer and business domain?

    models/
    ├── staging/
    │   ├── salesforce/
    │   │   ├── stg_salesforce__accounts.sql
    │   │   └── stg_salesforce__opportunities.sql
    │   └── stripe/
    │       └── stg_stripe__payments.sql
    ├── intermediate/
    │   ├── finance/
    │   │   └── int_orders__with_taxes.sql
    │   └── marketing/
    │       └── int_campaigns__performance.sql
    └── marts/
        ├── finance/
        │   └── mart_monthly_revenue.sql
        └── core/
            ├── mart_customers.sql
            └── mart_orders.sql
    

Asset Placement Rules

  • Did I place raw data transformations in staging/?
  • Did I place business logic and calculations in intermediate/?
  • Did I place final business-ready models in marts/?

2. Model Naming Standards

Staging Models

  • Did I prefix staging models with stg_ + source system?

    -- ✅ Good
    stg_salesforce__accounts.sql
    stg_stripe__payments.sql
      
    -- ❌ Bad
    salesforce_accounts.sql
    raw_payments.sql
    

Intermediate Models

  • Did I prefix intermediate models with int_ and use double underscores?

    -- ✅ Good
    int_orders__daily_summary.sql
    int_customers__lifetime_value.sql
      
    -- ❌ Bad
    daily_orders.sql
    intermediate_orders.sql
    

Mart Models

  • Did I prefix mart models with mart_ for consistency?

    -- ✅ Good (marts/)
    mart_customers.sql
    mart_monthly_revenue.sql
      
    -- ❌ Bad
    customers.sql              -- Missing prefix
    final_orders.sql           -- Wrong prefix
    

3. Column Naming Standards

Primary Keys & Foreign Keys

  • Did I use consistent ID naming patterns?

    -- ✅ Good
    customer_id          -- Primary key in customers table
    order_id            -- Primary key in orders table
      
    -- ❌ Bad
    id                  -- Too generic
    cust_id            -- Abbreviations
    

Timestamps & Booleans

  • Did I use _at suffix for timestamps and is_/has_/can_ prefix for booleans?

    -- ✅ Good
    created_at
    updated_at
    is_active
    has_orders
      
    -- ❌ Bad
    created_date
    active
    

Amounts & Measurements

  • Did I include units/currency in column names?

    -- ✅ Good
    order_amount_usd
    shipping_weight_kg
    duration_minutes
      
    -- ❌ Bad
    amount             -- What currency?
    weight             -- What unit?
    

Standardized Column Transformations

  • Did I standardize column names from different sources?

    -- From Salesforce staging
    select
      id as opportunity_id,
      name as opportunity_name,
      created_date as created_at,
      amount as opportunity_amount_usd
    from 
    

4. Naming Macros for Consistency (Advanced)

Column Standardization Macros

  • Did I create macros for common column transformations?

    -- macros/standardize_columns.sql
    {% macro standardize_id_column(source_column, entity_name) %}
      {{ source_column }} as {{ entity_name }}_id
    {% endmacro %}
    

Using Naming Macros with Examples

  • Did I apply naming macros consistently across staging models?

    Input Code:

    -- stg_salesforce__opportunities.sql
    select
      ,
      ...
    from 
    

    Compiled Output:

    -- After dbt compile
    select
      id as opportunity_id,
      ...
    from raw_salesforce.opportunities
    

Part 2: Writing Readable Code — Your Team’s SQL Style Guide

Now that your project has a logical folder structure and consistent naming, the next step is to standardize the code inside each .sql file. If naming conventions provide the blueprint for your data warehouse, a style guide provides the clean, consistent handwriting that makes the blueprint legible for everyone.

An inconsistent style—where one developer uses trailing commas and lowercase keywords, while another prefers leading commas and uppercase—creates friction. The code becomes harder to read, review, and debug. By adopting a unified SQL style, you reduce the cognitive load on your team, allowing them to focus on the business logic rather than deciphering formatting quirks.

The cornerstone of modern, readable dbt SQL is the use of Common Table Expressions (CTEs). CTEs allow you to break down complex transformations into a series of logical, sequential steps. Each CTE acts as a temporary, named table, making the flow of data from source to final output easy to follow. This approach is vastly superior to nested subqueries or monolithic scripts, which quickly become unmanageable.

The Standard dbt Model Structure

A clean dbt model should follow a predictable pattern:

  1. Import CTEs: Start with CTEs that select from your source or ref models. These are your building blocks.
  2. Logical CTEs: Perform transformations in a series of intermediate CTEs. Each CTE should have a single, clear purpose, such as joining two tables, calculating an aggregation, or pivoting data.
  3. Final SELECT: Conclude with a final SELECT statement that pulls from your last logical CTE, renames columns for clarity, and presents the final output of the model.

Here’s a practical example of a well-structured model that calculates customer lifetime value:

/*
This model calculates the lifetime value for each customer.
It joins orders with payments to get the total amount per customer.
*/

-- 1. Import CTEs from staging models
WITH stg_orders AS (
    SELECT
        order_id,
        customer_id
    FROM 
),

stg_payments AS (
    SELECT
        order_id,
        payment_amount_usd
    FROM 
),

-- 2. Logical CTE to join orders and payments
order_payments AS (
    SELECT
        stg_orders.customer_id,
        stg_payments.payment_amount_usd
    FROM stg_orders
    LEFT JOIN stg_payments
        ON stg_orders.order_id = stg_payments.order_id
),

-- 3. Logical CTE to aggregate at the customer level
customer_lifetime_value AS (
    SELECT
        customer_id,
        SUM(payment_amount_usd) AS total_revenue_usd,
        COUNT(DISTINCT order_id) AS total_orders
    FROM order_payments
    GROUP BY 1 -- Use ordinal position for conciseness
)

-- 4. Final SELECT statement to clean up for the mart
SELECT
    customer_id,
    total_revenue_usd,
    total_orders
FROM customer_lifetime_value

Your Quick Reference: SQL Style Guide Checklist

To enforce these best practices, provide your team with a clear checklist. This guide ensures that every model, regardless of author, is formatted consistently.

📜 dbt SQL Style Guide & Best Practices


This guide outlines the standard format for writing SQL within our dbt models. Following these rules ensures that our code is readable, consistent, and easy to maintain for everyone on the team.

1. General Formatting Rules

Keywords and Identifiers

  • Did I use uppercase for all SQL keywords?

    • Why? It visually separates the SQL language’s structure from our specific table and column names.
    -- ✅ Good
    SELECT
        customer_id,
        order_date
    FROM stg_orders
    WHERE is_active = true;
      
    -- ❌ Bad
    select
        customer_id,
        order_date
    from stg_orders
    where is_active = true;
    

Indentation and Commas

  • Did I use leading commas and consistent indentation (2 or 4 spaces)?

    • Why? Leading commas make it easier to add, remove, or comment out columns without causing syntax errors. It also improves readability.
    -- ✅ Good: Easy to comment out a column
    SELECT
      order_id
      , customer_id
    --  , order_status
      , order_date
    FROM stg_orders
      
    -- ❌ Bad: Commenting out requires fixing the previous line
    SELECT
      order_id,
      customer_id, -- Now I have to remove the comma here if I comment out order_status
      order_status,
      order_date
    FROM stg_orders
    

2. Query Structure with CTEs

  • Did I structure my model using Common Table Expressions (CTEs)?
    • Why? CTEs break down complex logic into clean, sequential, and debuggable steps. Each CTE should perform a single, logical transformation.

The Standard CTE Flow

  1. Import CTEs: Start by selecting from source() or ref() models.
  2. Logical CTEs: Perform transformations (joins, aggregations, window functions) in intermediate steps.
  3. Final CTE: A final SELECT statement to select and rename columns for the final model output.

Complete Example Structure

-- ✅ Good: A full, readable model structure
WITH stg_orders AS (
    SELECT ... FROM 
),

stg_payments AS (
    SELECT ... FROM 
),

order_payments AS (
    SELECT ... FROM stg_orders JOIN stg_payments ...
),

final AS (
    SELECT ... FROM order_payments
)

SELECT * FROM final

3. Formatting Specific Clauses

JOINs

  • Did I format JOINs for clarity?

    • Why? Placing each clause on a new line with clear aliases makes complex joins much easier to read and understand.
    -- ✅ Good
    FROM stg_orders AS o
    LEFT JOIN stg_customers AS c
        ON o.customer_id = c.customer_id
      
    -- ❌ Bad
    FROM stg_orders o LEFT JOIN stg_customers c ON o.customer_id = c.customer_id
    

CASE Statements

  • Did I properly indent my CASE statements?

    • Why? Indentation reveals the logical flow of the conditions and makes the statement scannable.
    -- ✅ Good
    CASE
        WHEN order_status = 'shipped' AND is_delivered THEN 'Completed'
        WHEN order_status = 'returned' THEN 'Returned'
        ELSE 'In Progress'
    END AS order_category
      
    -- ❌ Bad
    CASE WHEN order_status = 'shipped' AND is_delivered THEN 'Completed' WHEN order_status = 'returned' THEN 'Returned' ELSE 'In Progress' END AS order_category
    

Comments

  • Did I document the model’s purpose and any tricky logic?

    • Why? Good comments help future developers (including your future self!) understand the business context and the reason behind specific code choices.
    -- ✅ Good
    /*
    Model: mart_customer_segments
    Description: This model segments customers based on their purchasing frequency
    and monetary value (RFM analysis).
    */
    ...
    -- This logic handles a known data quality issue where legacy orders have null dates.
    -- We default them to the start of the fiscal year.
    COALESCE(order_date, '2020-01-01') AS cleaned_order_date
    

Part 3: Extending Your Toolkit — A Style Guide for Python Models

While SQL is the foundation of dbt, Python models open the door to advanced statistics, machine learning, and complex data manipulations that are cumbersome in SQL. However, this flexibility can come at a cost. Without clear standards, your dbt project can become littered with inconsistent, difficult-to-debug Python scripts that operate as “black boxes.”

To prevent this, it’s crucial to apply the same level of discipline to your Python code as you do to your SQL. A consistent style guide ensures that your Python models are readable, maintainable, and seamlessly integrated into your dbt workflow, regardless of who writes them.

The Anatomy of a Well-Styled Python Model

A clean Python model is predictable. It has a clear entry point, separates concerns, and is easy for a teammate to understand at a glance. The fundamental structure revolves around a single model function that ingests data via dbt.ref() and returns a DataFrame.

Let’s look at a simple example that calculates the ratio of a customer’s revenue to the average revenue of all customers. This model, while simple, demonstrates all the elements of good style.

# models/intermediate/int_customer_revenue_ratio.py

# 1. Imports are clean and at the top
import pandas as pd

# 2. Function is well-defined with type hints and a docstring
def model(dbt, session) -> pd.DataFrame:
    """
    Calculates each customer's revenue as a ratio to the
    overall average customer revenue.

    Inputs:
      - ref('mart_customers'): A model with customer-level revenue.

    Outputs:
      - A DataFrame with customer_id and their revenue ratio.
    """
    
    # 3. dbt config is the first line inside the function
    dbt.config(materialized="table")

    # 4. Load data using dbt.ref()
    customers_df = dbt.ref("mart_customers").to_pandas()

    # 5. Logic is clear, with descriptive variable names
    # Avoids "magic numbers" by calculating the average first
    average_revenue = customers_df["total_revenue_usd"].mean()

    customers_df["revenue_to_avg_ratio"] = (
        customers_df["total_revenue_usd"] / average_revenue
    )
    
    # 6. Select and rename final columns for a clean output
    final_df = customers_df[["customer_id", "revenue_to_avg_ratio"]]

    # 7. Return the final DataFrame
    return final_df

This structure is a blueprint. It makes the code easy to test, debug, and for new team members to contribute to confidently.

Your Quick Reference: Python Model Style Checklist

Use this checklist during development and code reviews to enforce a high standard of quality for all Python models in your project.

🐍 dbt Python Model Style Checklist


1. Structure & Readability

  • Did I use a clear function signature with type hints and a docstring?

    • Why? It explains the model’s purpose and makes it clear what kind of object is expected and returned, which helps prevent runtime errors.
    # ✅ Good
    def model(dbt, session) -> pd.DataFrame:
        """Calculates user engagement scores based on recent activity."""
        ...
        return final_df
      
    # ❌ Bad
    def model(dbt, session): # No type hints or docstring
        # ... what does this do?
        return df
    
  • Is dbt.config() the first line inside the model function?

    • Why? It makes the model’s materialization strategy immediately obvious to anyone reading the code.
    # ✅ Good
    def model(dbt, session) -> pd.DataFrame:
        dbt.config(materialized="table")
        customers_df = dbt.ref("mart_customers").to_pandas()
        ...
      
    # ❌ Bad
    def model(dbt, session) -> pd.DataFrame:
        customers_df = dbt.ref("mart_customers").to_pandas()
        # ... some logic ...
        dbt.config(materialized="table") # Config is hidden at the end
        ...
    
  • Are all imports located at the top of the file?

    • Why? It provides a clear, upfront summary of all external dependencies.
    # ✅ Good
    import pandas as pd
    from sklearn.preprocessing import StandardScaler
      
    def model(dbt, session):
        ...
      
    # ❌ Bad
    def model(dbt, session):
        import pandas as pd # Hidden import
        ...
    

2. Code Quality & Formatting

  • Does the code follow PEP 8 style guidelines (e.g., snake_case for variables and functions)?

    • Why? It’s the community standard for Python code, making it universally readable. Use a tool like Black or Ruff to automate this.
    # ✅ Good
    daily_revenue_df = dbt.ref("mart_daily_revenue").to_pandas()
      
    def calculate_moving_average(df, window_size):
        ...
      
    # ❌ Bad
    DailyRevenueDF = dbt.ref("mart_daily_revenue").to_pandas() # Not snake_case
      
    def Calculate_MA(df, windowSize): # Not snake_case
        ...
    
  • Are complex steps or “magic numbers” explained with comments?

    • Why? It clarifies the business logic or rationale behind a technical choice.
    # ✅ Good
    # Calculate churn if no activity in the last 90 days
    churn_threshold_days = 90
    df["is_churned"] = df["days_since_last_login"] > churn_threshold_days
      
    # ❌ Bad
    df["is_churned"] = df["days_since_last_login"] > 90 # Why 90?
    
  • Does the model avoid hardcoding file paths, model names, or credentials?

    • Why? Hardcoding makes models brittle and difficult to run in different environments (dev, prod). Use dbt.ref() and environment variables instead.
    # ✅ Good
    customers_df = dbt.ref("mart_customers").to_pandas()
      
    # ❌ Bad
    # This will break when `mart_customers` changes or in CI/CD
    customers_df = pd.read_csv("target/mart_customers.csv")
    

3. dbt Integration & Documentation

  • Is the Python model documented in a schema.yml file?

    • Why? Python models are first-class citizens in your DAG. They should have descriptions, and their columns should be documented and tested just like any SQL model.
    # ✅ Good
    models:
      - name: int_customer_revenue_ratio
        description: "Python model that calculates each customer's revenue ratio to the mean."
        columns:
          - name: customer_id
            tests:
              - unique
              - not_null
          - name: revenue_to_avg_ratio
            description: "Ratio of the customer's total revenue to the average of all customers."
    
  • Are all required Python packages listed in packages.txt?

    • Why? This ensures that the model can run reliably in any environment (local, CI, production) by defining its dependencies.
    # packages.txt
    pandas
    scikit-learn
    statsmodels
    

I’ve also created a comprehensive article on Python code styling guide for data teams, which you can explore after finishing this piece.

Part 4: The Gateway to Production — A Robust Pull Request Workflow

You’ve established clear naming conventions and a consistent style guide. Now, how do you enforce them? The answer lies in the Pull Request (PR). In modern data teams, the PR is the single most important control point for ensuring quality, collaboration, and stability in your dbt project. It transforms analytics development from a solo, ad-hoc activity into a transparent, peer-reviewed engineering discipline.

A successful PR workflow relies on a clear, repeatable set of steps that every team member follows. The developer is responsible for ensuring their changes are complete, tested, and well-documented. The reviewer is responsible for validating the logic, checking for adherence to standards, and providing constructive feedback.

The Anatomy of a Great Pull Request

A great PR tells a story. It clearly explains what was changed, why it was changed, and how it was tested. This makes the reviewer’s job easier and serves as a permanent historical record for future developers.

A well-structured PR should always include:

  1. A Descriptive Title: Summarize the change concisely (e.g., “feat: Add mart_customer_churn model”).
  2. A Link to the Ticket: Connect the work to a business request (e.g., a JIRA or Asana ticket).
  3. A Clear Description: Explain the context and the approach taken.
  4. Proof of Testing: Show that all relevant dbt tests have passed.

Your Quick Reference: Pull Request Checklist

This checklist serves as a “pre-flight” guide for any developer preparing to submit their work. By completing these steps, you ensure that your PR is ready for an efficient and effective review.

📝 dbt Pull Request (PR) Development Checklist with Code Examples


1. Branch Management

  • Create a new branch from main or develop.

    git checkout main               # switch to base branch
    git pull                        # ensure it is up-to-date
    git checkout -b feature/new-churn-model
    

2. Development & Style

  • Add or update models, tests, seeds, or macros as required. -> Check org test guide!

  • Ensure all new and modified assets follow your team’s standards.

  • Compile and check code for syntax errors.

    dbt compile
    

3. Test Coverage

  • Add appropriate schema and logic tests for any new or changed logic.

    # models/marts/schema.yml
    models:
      - name: mart_customer_churn
        columns:
          - name: customer_id
            tests: [not_null, unique]
          - name: is_churned
            description: "True if the customer has not made a purchase in 90 days."
            tests:
              - accepted_values:
                  values: [true, false]
    
  • Run all dbt tests locally and resolve any failures.

    dbt build --select +mart_customer_churn # Build the model and run its tests
    

4. Documentation

  • Add or update model and column descriptions in the relevant schema.yml file.

    columns:
      - name: churn_date
        description: "The first date the customer was considered churned. Null if not churned."
    
  • Summarize your changes and rationale in the PR description.


5. Repo Cleanliness

  • Remove temporary models, debugging statements (limit 100), or commented-out code blocks.

  • Regenerate docs locally to ensure your descriptions render correctly.

    dbt docs generate
    
  • Run a linter/formatter (e.g., sqlfluff, black) if your project uses one.


6. Git Workflow: Add, Commit, Push

  • Stage, commit, and push your changes with a clear commit message.

    git add models/marts/mart_customer_churn.sql models/marts/schema.yml
    git commit -m "feat: Create mart_customer_churn model with tests"
    git push origin feature/new-churn-model
    

7. Review & Collaboration

  • Open a Pull Request using your team’s template.

    Example PR Description:

    Summary:

    This PR introduces a new model, mart_customer_churn, which flags customers as “churned” if they haven’t made a purchase in the last 90 days.

    Rationale:

    This is required for the new executive retention dashboard. Ticket: PROJ-1234

    Tests:

    • Added not_null, unique, and accepted_values tests for the new model.
    • All tests pass locally via dbt build.

    Impact:

    • This is a new mart model with no downstream dependencies yet.
  • Request a review from at least one other team member.


When is a Review Most Critical?

While all changes should be reviewed, pay special attention when the PR includes:

  • Significant Business Logic: Implementing a key metric (e.g., ARR, LTV) or a complex rule that has financial or strategic importance.
  • Complex Changes: Introducing multi-stage CTEs, complex joins, window functions, or custom macros.
  • Changes to Core Models: Modifying a foundational model (like mart_customers or mart_orders) that has many downstream dependencies.

Part 5: Sunsetting Models Gracefully — The Deprecation Workflow

As your data project matures, models will inevitably become outdated. Business logic changes, sources are replaced, or you discover a more efficient way to transform data. Simply deleting an old model is a recipe for disaster, as it can silently break downstream dashboards, data science models, or ad-hoc queries that depend on it.

A formal deprecation process is essential for maintaining a clean, trustworthy, and modern dbt project. It provides a clear, communicative path for retiring models without causing chaos. This workflow ensures that all stakeholders are aware of upcoming changes, have a clear migration path, and are given adequate time to adapt.

The goal is to make model retirement a non-event—a routine maintenance task rather than a source of fire drills and broken reports. This process can be broken down into five phases: Announce, Analyze, Migrate, Execute, and Clean Up.

The Deprecation Lifecycle: A Step-by-Step Guide

  1. Announce & Document: The first step is always communication. You must clearly signal that a model is slated for removal. dbt provides built-in features for this. Add a deprecation_date to the model’s schema.yml file. This is the official flag that a model is on its way out.

    # models/core/schema.yml
    models:
      - name: orders_legacy
        description: |
          DEPRECATED: This model will be removed after 2024-08-01.
          Please use `mart_orders_v2` instead.
        deprecation_date: '2024-08-01'
    

    You should also broadcast this change in your team’s communication channels (e.g., Slack or Teams).

  2. Analyze Dependencies: Before you can remove a model, you must understand its “blast radius.” Who and what depends on it? This analysis has two parts:

    • Internal Dependencies: Check which other dbt models reference the one you’re deprecating.
    • External Dependencies: This is the crucial part. Check your data warehouse’s query history to find BI tools, ad-hoc queries, or other systems that query the model’s table directly.
  3. Migrate & Monitor: Proactively help users move to the new model. Provide clear instructions, code snippets, and support. As the deprecation date approaches, continue to monitor usage and send reminders to any remaining users.

  4. Execute the Deprecation: On the deprecation date, don’t delete the model file immediately. First, “soft-deprecate” it by disabling it in dbt. This prevents it from being built but keeps the code in the repository, making it easy to reverse if needed.

    # models/core/schema.yml
    models:
      - name: orders_legacy
        config:
          enabled: false
        description: "DEPRECATED: Disabled after 2024-08-01. Use mart_orders_v2."
    

    After disabling the model in dbt, you can drop the table from your data warehouse.

  5. Clean Up: After a safe waiting period (e.g., one or two sprints) where no issues have been reported, you can complete the process by removing the model’s .sql and .yml files from your project.

Your Quick Reference: Model Deprecation Checklist

This checklist provides a systematic, step-by-step process to ensure nothing is missed when retiring a model.

DBT MODEL DEPRECATION CHECKLIST


1. Pre-Deprecation: Communication & Documentation

  • Did I add a deprecation_date and rationale to the model’s YAML?

    models:
      - name: orders_legacy
        description: |
          DEPRECATED: This model will be removed after 2024-08-01.
          Please use `mart_orders_v2` instead.
        deprecation_date: '2024-08-01'
    
  • Did I communicate the plan to stakeholders (e.g., via Slack)? Sample Message:

    :warning: dbt Model Deprecation Notice :warning: The model orders_legacy will be removed on August 1, 2024. Please migrate all dependencies to mart_orders_v2. Contact #data-team for support.


2. Analysis: Downstream Dependencies & Usage

  • Did I check which dbt models depend on it?

    # This command searches for references within your dbt project files.
    grep -r "ref('orders_legacy')" models/
    
  • Did I check warehouse query history for external usage (BI tools, ad-hoc queries)?

    -- Example for Snowflake
    SELECT query_text, user_name, start_time
    FROM snowflake.account_usage.query_history
    WHERE query_text ILIKE '%orders_legacy%'
      AND start_time > dateadd(day, -30, current_timestamp());
    

3. Migration: Plan & Action

  • Did I inform affected users about the migration deadline and alternatives?

  • Did I provide a clear code sample for the update?

    -- Before
    SELECT * FROM 
      
    -- After
    SELECT * FROM 
    

4. Transition: Monitoring & Warnings

  • Did I monitor usage as the deprecation date approaches? (Re-run the query history check).

  • Did I send final reminders to any remaining users?


5. Execution: Deprecation in dbt

  • Did I disable the model in dbt to prevent it from being built?

    # in schema.yml
    models:
      - name: orders_legacy
        config:
          enabled: false
        description: "DEPRECATED: Disabled on 2024-08-01. Use mart_orders_v2."
    
  • Did I drop the table/view from the data warehouse?

    DROP TABLE IF EXISTS ..orders_legacy;
    

6. Post-Deprecation: Final Housekeeping

  • After a safe waiting period (e.g., 1-2 weeks), did I remove the model’s .sql and schema files from the repository?

    git rm models/core/orders_legacy.sql
    git rm models/core/schema.yml # Or just the entry from the file
    git commit -m "chore: Remove deprecated model orders_legacy"
    
  • Did I clean up any references in project documentation (e.g., README)?

Part 6: Building Trust — A Comprehensive Testing Framework

Data without tests is just a rumor. You can have perfectly named models, beautifully styled code, and a flawless PR process, but if the data itself is wrong, none of it matters. A single NULL in a key column or an incorrect join can silently corrupt every downstream analysis and dashboard, eroding the trust you’ve worked so hard to build.

dbt’s built-in testing framework is its most powerful feature for preventing this. It transforms data quality from a manual, reactive task into an automated, proactive part of your development cycle. Tests are assertions you make about your data that run every time you build your project, acting as a permanent guard against regressions and bad data.

But what should you test, and when? The answer isn’t to test everything all the time. A smart testing strategy is pragmatic, risk-based, and adapts to the maturity of your data pipeline.

Choosing Your Testing Strategy: From Greenfield to Legacy

Not all development scenarios are the same. Your approach to testing should be tailored to the context of your work. We can think of these as different workflows.

  1. Greenfield (“Design Before Build”): When starting a brand-new project, you can design your tests upfront. You define the business rules and data contracts before writing the code, ensuring full coverage from day one. This is ideal but rare.

  2. Agile/Iterative (“Test-As-You-Build”): This is the most common workflow for modern data teams. You build a model and immediately add foundational tests (not_null, unique). As you add more complex logic or joins, you add more specific unit and integration tests in parallel. This keeps testing in sync with development.

  3. Legacy Retrofit (“Test the Past”): When you inherit a project with little to no test coverage, you can’t test everything at once. You must prioritize. Start by adding basic schema tests to your most critical models. Then, add integration tests to high-risk joins and write specific unit tests to address known bugs or problem areas.

  4. Hybrid/Continuous Improvement (Recommended): For most teams, the reality is a mix of all three. The rule is simple:

    • All new models must have foundational tests from day one.

    • Any change to an existing model requires reviewing and adding relevant tests.

    • Continuously identify critical, untested legacy models and add coverage over time.

Regardless of the workflow you choose, the goal is the same: to have a suite of tests that gives you confidence in your data.

The Testing Pyramid: What to Test and How

The types of tests you write can be categorized to cover different aspects of data quality.

  • Schema & Data Quality Tests: These are the “must-haves” for every model. They are assertions about the shape and integrity of your data, like checking for nulls, uniqueness, or accepted values.
  • Unit Tests: These verify a specific piece of business logic within a single model, like a complex CASE statement or a calculation.
  • Integration / Referential Integrity Tests: These check the relationships between models, ensuring that joins are complete and that no orphaned records are created.
  • Business/Custom Data Tests: These are specific rules that encode business knowledge, such as “revenue can never be negative” or “an order’s ship date cannot be before its creation date.”

Your Quick Reference: Model Test Coverage Checklist

This checklist is your tactical guide to implementing a robust testing strategy. Use it for every new model you create and every existing model you modify.

📝 dbt Model Test Coverage Checklist (with Examples)


1. Schema & Data Quality (“Must-Have” for Every Model)

  • Did I add not_null tests for required columns?

    # in schema.yml
    models:
      - name: mart_customers
        columns:
          - name: customer_id
            tests:
              - not_null
    
  • Did I add unique tests for primary key columns?

    - name: mart_customers
      columns:
        - name: customer_id
          tests:
            - unique
    
  • Pro tip: Apply both unique and not_null tests on the primary key columns

  • Did I add accepted_values tests for columns with a defined set of values (e.g., statuses)?

    - name: mart_orders
      columns:
        - name: order_status
          tests:
            - accepted_values:
                values: ['placed', 'shipped', 'completed', 'returned']
    

2. Unit Testing Logic

  • Does this model contain non-trivial logic (e.g., complex CASE statements, window functions)? If yes:

  • Did I write a custom data test to validate that logic?

    Example: Your model has a CASE statement to define a customer segment.

    -- models/marts/mart_customers.sql
    SELECT
      customer_id,
      CASE
        WHEN lifetime_revenue > 1000 THEN 'High Value'
        WHEN lifetime_revenue > 100 THEN 'Medium Value'
        ELSE 'Low Value'
      END AS customer_segment
    FROM 
    

    Unit Test: Write a SQL query that returns rows only if the logic is incorrect.

    -- tests/assert_customer_segment_logic_is_correct.sql
    SELECT *
    FROM 
    WHERE
      (lifetime_revenue > 1000 AND customer_segment != 'High Value') OR
      (lifetime_revenue BETWEEN 100 AND 1000 AND customer_segment != 'Medium Value')
    

    dbt will pass this test only if the query returns zero rows.


3. Integration / Referential Integrity

  • Does this model join or depend on other models? If yes:

  • Did I write relationships tests to ensure foreign keys are valid?

    # in models/marts/schema.yml
    - name: mart_orders
      columns:
        - name: customer_id
          tests:
            - relationships:
                to: ref('mart_customers')
                field: customer_id
    

    This test ensures every customer_id in mart_orders exists in mart_customers.


4. Business/Data-Specific Validations

  • Are there critical business rules this model must follow? If yes:

  • Did I add a custom data test to enforce that rule?

    Example: An invoice amount should never be negative.

    -- tests/assert_invoice_amount_is_positive.sql
    SELECT *
    FROM 
    WHERE invoice_amount_usd < 0
    

    Example: A subscription’s end date must be after its start date.

    # Using the dbt_utils package for a more concise test
    - name: mart_subscriptions
      tests:
        - dbt_utils.expression_is_true:
            expression: "ended_at >= started_at"
    

5. Test Maintenance

  • If I changed a model’s logic, did I update or add tests to cover the change?
  • If I removed columns or logic, did I remove the corresponding obsolete tests?

Conclusion: From Rules to Rituals

We’ve journeyed through a comprehensive blueprint for integrating dbt into your team’s workflow, covering everything from foundational naming conventions and style guides to the daily rituals of pull requests, model deprecation, and automated testing. Each section provided a detailed checklist, designed to bring structure and predictability to the art of data transformation.

However, the goal of this guide is not to impose a rigid, one-size-fits-all dogma. Your data team is unique—with its own size, maturity level, and specific challenges. A five-person startup team may find a lightweight process is sufficient, while a fifty-person enterprise team will require more stringent controls.

Think of these checklists less as immutable laws and more as a comprehensive toolbox. The true value lies not in blindly adopting every single rule, but in using them to facilitate a crucial conversation with your team. The important questions are:

  • What does a “good” model look like for us?
  • How will we ensure our data is trustworthy?
  • What is the standard of quality we agree to uphold?

The most critical step is to make a conscious, informed decision about how your team will operate and then document that decision. Whether you choose leading or trailing commas is less important than the fact that you’ve chosen one and committed to it as a team. This shared agreement is what eliminates ambiguity, reduces friction in code reviews, and enables new members to contribute confidently from day one.

By establishing and documenting these standards, you create a system of collective ownership. You move from a world of individual preferences and ad-hoc practices to one of shared language and engineering discipline. This is how you build a data platform that is more than just a collection of scripts—it becomes a reliable, scalable, and trusted asset that empowers the entire organization. It’s how you turn a powerful tool like dbt into the backbone of a truly data-driven culture.