buildfrm

For Databricks data engineers

Data Pipeline code from whatever you have.

kickoff meetings docs sample data architecture diagrams images ...

buildfrm is an AI tool that lives where your team already works. During development, it ships pipelines from a spec across VS Code and Browser Extension on Databricks UI, with the same context flowing through both. After deployment, it stays attached monitoring runs, reading the team's Slack thread, proposing fixes when things fail, and keeping every architectural decision searchable for the next change.

The fix discussed at standup becomes the change applied to the pipeline. The diagram in last quarter's spec is still searchable today. The pipeline that paged you at 3am is already healing itself by the time you open your laptop.

Databricks-native

PySpark / Delta
DLT pipelines
Unity Catalog
Notebooks & jobs

Snowflake · dbt · Airflow · Dagster — coming.

Where you already work.

Data engineers don't live in one tool. buildfrm shows up in places where most of the day happens.

VS Code extension

In your favourite editor.

Activity-bar panel, Google sign-in, inline fixes, writes generated files into your repo. Streams responses while it works.

  • Drop a kickoff recording → extracted requirements + table summary
  • Databricks Connect detected and used
  • Works against your open repo, not a sandbox
buildfrm panel inside VS Code: a 113-second kickoff recording has been processed and a requirements document is rendered on the right, with the extracted summary listing Unity Catalog tables from samples.tpch.
kickoff mp4 → requirements doc + tpch table summary, Databricks Connect enabled

Chrome extension

On Databricks UI.

Side panel on *.cloud.databricks.com and Azure Databricks. Injects code cells straight into the notebook you're looking at.

  • Generated @dlt.table cells injected into the active notebook
  • Reasons over dashboard images pulled from the source video
  • No copy-paste between tools
buildfrm side panel on a Databricks notebook showing a generated PySpark DLT cell (@dlt.table bronze.lineitem_bronze) alongside a chat that reasoned over a dashboard image extracted from the source video.
@dlt.table bronze layer + side-panel dashboard-image reasoning

From this, to this.

The shapes of input we see most. Drop the real thing and get running code — not a suggestion, not a snippet to paste somewhere else.

Input Video + doc

A 45-minute kickoff recording.

A Microsoft Teams kick-off meeting invite for the Supplier Analytics Dashboard project, with agenda, objectives, expected outcomes, and a dashboard preview.
supplier_analytics_kickoff.mp4 45:12
dashboard_spec.pdf 3 pages
Output orders_silver.py Databricks · DLT
import dlt
from pyspark.sql.functions import col

@dlt.table(
  name="orders_silver",
  partition_cols=["region"],
  comment="Deduped orders joined to customer dim, daily grain",
)
def orders_silver():
    orders = (
        dlt.read("orders_bronze")
           .dropDuplicates(["order_id"])
    )
    customers = spark.read.table("main.sales.dim_customer")
    return orders.join(customers, "customer_id", "left")
Input Spec + sample

A spec doc and 500 rows of sample data.

A Data Transformation Mapping Document for Databricks: project info, target tables, attribute mapping with source/target columns and transformations, business rules, and data quality checks.
mapping_document.xlsx attribute mapping
sample_lineitem.csv 500 rows
Output models/marts/customer_ltv.sql dbt · coming
{{ config(materialized='incremental', unique_key='customer_id') }}

with orders as (
  select * from {{ ref('stg_orders') }}
  {% if is_incremental() %}
    where order_ts > (select max(last_order_ts) from {{ this }})
  {% endif %}
),
customers as (
  select * from {{ ref('dim_customer') }}
)
select
  c.customer_id,
  sum(o.gross_amount)       as lifetime_value,
  count(distinct o.order_id) as order_count,
  max(o.order_ts)           as last_order_ts
from customers c
left join orders o using (customer_id)
group by 1
Input Image + prompt

A dashboard screenshot and one sentence.

A Fintech Reconciliation and Revenue Overview dashboard with KPI tiles for gross transaction volume, settled amount, settlement coverage, revenue variance, and chargeback rate, plus charts for settlement delay, aging, and revenue by currency.
fintech_recon_dashboard.png 5 KPIs, 4 charts
"build the backing queries for these tiles"
Output supplier_metrics.py Snowflake · coming
from snowflake.snowpark.functions import col, sum_, avg, count_distinct

def supplier_kpis(session, region: str | None = None):
    df = session.table("sales.fact_shipment")
    if region:
        df = df.filter(col("region") == region)
    return df.agg(
        sum_("line_total").alias("gross_revenue"),
        avg("discount").alias("avg_discount"),
        count_distinct("supplier_id").alias("suppliers"),
    )

def on_time_trend(session):
    return (
        session.table("sales.fact_shipment")
               .select("ship_date", "is_on_time")
               .group_by("ship_date")
               .agg(avg(col("is_on_time").cast("int")).alias("rate"))
    )
Input Architecture diagram

A medallion architecture diagram.

Data engineering pipeline architecture: Postgres source ingested via Databricks JDBC into Delta Lake on S3, processed through bronze, silver, and gold layers using Apache Spark, with Unity Catalog governance, lineage, and alerting.
postgres_to_delta_medallion.png 3 layers, 7 entities
Output sales_medallion/ Databricks · DLT
sales_medallion/
├── ingestion/
│   ├── postgres_to_bronze.py     # JDBC + incremental
│   └── jdbc_config.yml
├── bronze/
│   ├── orders.py                 @dlt.table
│   ├── customers.py              @dlt.table
│   └── products.py               @dlt.table
├── silver/
│   ├── orders_cleansed.py        @dlt.table  # dedupe, expectations
│   └── customers_scd2.py         @dlt.table  # SCD type 2
├── gold/
│   ├── revenue_daily.py          @dlt.table  # currency norm + agg
│   └── orders_by_region.py       @dlt.table
├── jobs/
│   └── medallion_pipeline.yml    # Databricks Workflow
└── observability/
    ├── lineage.json              # Unity Catalog
    └── alerts.yml

Mockup. Real examples from your own inputs on the demo call.

Built For Data Engineers Solo or Teams.

Shared sessions

Hand off with the context intact.

Share a session by email with view or admin access. The next engineer opens the same session, the same chat, the same inputs — and the same decisions your team already made.

  • Chat history and uploaded artifacts carry over
  • Design decisions are tracked explicitly, not buried in the transcript
  • No re-litigating the schema on Monday
  • Open the same session on vs code or chrome

Self-hosted

For teams

In your infra, not ours.

For teams that don't want to send data to external applications. Deploy the full stack — inside your VPC.

  • Bring your own LLM keys (Anthropic, OpenAI and Open Source models)
  • Your data stays on your network and in your audit trail
  • SSO, role-based access, and a named point of contact

Talk to us → [email protected]

Custom

For teams

Have a custom requirement?

Maybe your org is on AWS, Azure, or GCP — Glue, Synapse, BigQuery, Dataform, or some mix of all three. We can deliver solutions made for your stack, not adapted from someone else's.

  • Connectors for your warehouse, lakehouse, or orchestrator
  • Code generation tuned to your idioms — internal style guides, naming, lineage tags
  • Single point of contact, scoped engagement

Talk to us → [email protected]

Early access For Solo Engineers

₹ 2,500 to get started.

  • Get 2 month of access
  • AI tokens are on us
  • Fully refundable
  • Priority Assistance
  • Personal walk through of capabilities

Early access For Teams

Pilot with your team.

  • Scoped 4-week pilot on your data, your platform with forward deployed engineer
  • Dedicated point of contact, named engagement
  • Custom set up for your workflows and code-gen tuned to your stack

The bet

Why we're building this.

At our last data jobs, every new pipeline started the same way: a 45-minute kickoff call, a Confluence page that took two days to write, then four days of someone translating it all into PySpark. By the time the code shipped, half the decisions from the kickoff had been forgotten. Six months later, when something broke at 3am, the on-call engineer had no idea why a column was named the way it was.

AI coding tools have gotten very good at the file in front of you. None of them remember the kickoff call from last quarter. None of them know what your team decided in Slack on Monday. None of them carry context across the editor, the notebook, and the next engineer who opens the same pipeline. For most software work, that's fine. For data engineering, that gap is where the bugs live.

buildfrm is our attempt at the AI coding tool data engineers actually need. It lives across the surfaces your team already uses — VS Code, the Databricks UI, eventually a CLI and a Slack bot — and the same context flows through all of them. The kickoff video, the spec doc, the design decisions, the on-call thread: it remembers all of it, and so does the engineer who picks up the work next.

We're early. The product works, but the surface area we want to cover is large. If this matches your problem, the early-access form above is how to start using it now — and how to tell us where we're getting it wrong. Five engineers who push back are worth more to us than fifty who nod.

The buildfrm team

Questions.

The ones we hear most.

What does buildfrm do? +

buildfrm reads kickoff recordings, specs, sample data, and architecture diagrams and writes Databricks pipeline code — PySpark, Delta Lake, Delta Live Tables, Unity Catalog three-level namespace — that runs in your workspace. It lives as a VS Code extension and a Chrome side panel on the Databricks UI.

Which platforms does buildfrm generate code for? +

Databricks first: PySpark, Delta Lake, Delta Live Tables (DLT), Unity Catalog, Databricks Workflows, notebooks and jobs. Snowflake, dbt, Airflow, and Dagster support is coming.

What inputs does buildfrm accept? +

Video and audio recordings (mp4, mov, webm), transcripts, documents (pdf, md, xlsx, csv), data samples (parquet, json, xml, yaml), existing code (.py, .sql, .ipynb, .scala, .r), and images (jpg, png, gif, webp) including dashboard mockups and architecture diagrams.

Can buildfrm be self-hosted? +

Yes. The full stack — backend, Postgres, vector store, object store — deploys via Docker Compose inside your VPC. Bring your own LLM keys (Anthropic, OpenAI) or run local via Ollama. Available for teams; contact sales.