Utility Bill Data Extraction: From PDF to Pro-Forma, Proposal, or Audit

A guide for project finance analysts, energy consultants, and C&I solar sales engineers who turn utility bills into decisions — and pay a hand-keying tax every time they do.

A project finance analyst is closing a community solar deal at 11pm. Her pro-forma assumes a $0.087/kWh blended baseline rate. The number came from a junior associate who hand-keyed eighteen months of bills into a spreadsheet last week. Two of those bills had a demand-response credit the associate folded into "other charges." The real baseline is $0.091. The deal still pencils — but the IC the next morning will not be told that.

A consultant in Denver bills $200/hr for client work. Three days of every month go to typing utility bills into Excel before the analysis his clients actually pay for can begin. He has stopped pitching prospects whose bills look "messy" because the intake cost is too high.

A C&I solar sales engineer in Houston loses a 1.4 MW rooftop deal because his proposal lands on a Thursday. The customer's CFO made the decision Tuesday based on a competitor's slightly worse number that arrived two days earlier. The engineer spent those two days reconstructing the customer's tariff from a stack of PDFs.

All three are doing the same job — turning utility bills into decisions. This guide is about doing that job without the hand-keying tax.

#What "utility bill data extraction" actually means

Utility bill data extraction is the process of converting commercial and industrial electricity bills — typically arriving as PDFs from the utility — into structured data that finance models, proposal tools, audit deliverables, and reporting systems can consume directly.

The inputs are bills from any US or Canadian utility: PG&E, Con Edison, Duke, Xcel, AEP, ERCOT REPs, Hydro-Québec, hundreds of municipal and cooperative utilities. Each has its own format. Each format changes when the utility updates its billing system. Some are clean digital PDFs; many are scanned images of printed bills mailed to a property manager who then forwarded the scan.

The outputs — when extraction is done correctly — are not a flat dump of text. They are line-itemized records where every charge on the bill is mapped to its role in the tariff: an energy charge in $/kWh for a specific TOU period, a demand charge in $/kW for a specific window, a fixed monthly service charge, a rider, an adjustment, a tax. Each line carries its quantity, its rate, its subtotal, and a pointer back to where it appeared on the source PDF.

The anatomy of a C&I bill is the part most people miss. A residential bill has maybe four lines. A C&I bill on a TOU demand tariff can have thirty to fifty: summer-on-peak energy, summer-mid-peak energy, summer-off-peak energy, winter equivalents, coincident demand, non-coincident demand, ratchet adjustments, power factor adjustments, transmission service, distribution service, public benefit charges, franchise fees, state taxes, city taxes. (For a plain-English walk-through of each of these, see how to read a commercial utility bill.) Extraction means understanding which of those lines is what — not just reading the numbers.

#Why generic OCR fails on utility bills

This is where most "we'll just use AI" plans break. (Why generic OCR fails on utility bills covers each failure in depth.)

One bill format per utility — and they change. There is no single utility bill schema. PG&E's E-19 bill looks nothing like Con Edison's SC-9, which looks nothing like Duke's GS-T. Within one utility, the same tariff may print differently depending on whether the account has a smart meter, whether the customer is on a legacy plan, and whether the utility migrated to a new billing system in Q3. A generic OCR pipeline trained on tax forms or invoices has no model of this.

Estimated reads. When the meter wasn't read this cycle, the utility prints an "EST" flag and bills against an estimated kWh. The next bill trues up — often with a negative adjustment that makes the totals look wrong. An extraction pipeline that doesn't recognize estimated reads will produce a baseline that drifts every other month.

Mid-cycle rate changes. When the utility files a new tariff with the PUC, the rate changes effective on a specific date. A bill that spans that date is pro-rated: the same charge type appears twice, once at the old rate, once at the new. A generic extractor sees two "Energy Charge" lines and assumes the bill is malformed.

Line-item ambiguity. "ENERGY CHG-SUMMER-ON-PEAK" on a PG&E bill and "Summer On-Peak Energy" on an SDG&E bill are the same charge type. To a regex, they are different strings. Mapping them to a canonical tariff component is not OCR — it is domain knowledge.

Totals that don't tie out. Utility bills round at the line-item level and again at the subtotal level. Sometimes a $0.02 rounding adjustment appears as an unprinted reconciliation. Sometimes a credit was applied off-bill and reduces the total without appearing as a line. An extractor that doesn't reconcile to the printed total will silently produce data that's off by single dollars — exactly the kind of error a CFO finds in week two.

This is why every Tariform extraction is reconciled against the bill total. If the sum of extracted line items doesn't tie to the printed total to the penny, the bill is flagged for review — not silently corrected, not hidden. The auditor's first question is "does this tie out?" and the answer has to be yes, or has to be a documented exception.

#What good structured bill data looks like

A useful extraction output has four parts.

Header. Account number, service address, meter ID, service period start and end, utility, tariff schedule code, rate class, customer class. Without these, the line items mean nothing — you can't tell whether a $14,000 demand charge is reasonable or absurd without knowing it's on a 2 MW account on a Tier 2 industrial tariff.

Line items. Each charge as a structured record: charge type (energy / demand / fixed / rider / tax / adjustment / credit), tariff component the charge maps to, time-of-use period if applicable, quantity, unit, rate, subtotal, and the page and line number in the source PDF where this value was found.

Totals and reconciliation. Bill total as printed. Sum of extracted line items. Reconciliation delta. Reconciliation status — exact match, within rounding tolerance, or flagged for review.

Source provenance. For every value, a pointer back to its origin on the source PDF. This is the difference between data you can use in an audited deliverable and data you can only use as a draft.

That last point is what separates an extraction product from an LLM-on-a-PDF script. A script can produce JSON. It cannot produce JSON that an auditor will accept in week six.

#The three workflows this unblocks

Three different jobs, one shared bottleneck. In each case, the work that gets paid for — judgment about a deal, analysis for a client, a proposal that closes — sits downstream of an intake step nobody wants to do.

#Project finance: from bill to pro-forma

A C&I solar or storage pro-forma rests on a baseline electricity spend assumption. Get that wrong by 3% and the IRR estimate compounds the error across twenty years. In the deals we see, the baseline assumption is a larger source of pro-forma error than the production model — which is why building a defensible baseline is worth the rigor.

Line-item bill extraction feeds the pro-forma directly: historical kWh by TOU period, demand peaks by month, fixed-charge components, riders, taxes. The blended $/kWh number that usually drives the baseline becomes one input among many — and the assumptions built on top of it (escalators, sensitivities, component-level stress tests) stop being applied to a single opaque rate. The analyst can defend each line because each line came from a bill.

The PF analyst's job is not extraction. It is judgment about deals. Hand-keying bills is the tax she pays for not having a tool.

#Energy consulting: from bill to client deliverable

Consultants and brokers sell analysis. The intake step — getting client bills into a usable form — is overhead that doesn't bill. A consultant who can compress three days of intake into thirty minutes per client either takes more clients at the same headcount or upgrades to higher-value work.

The economics shift more than the math suggests. Engagements that weren't worth pitching because the intake cost was too high become viable. White-labeled bill summaries become a deliverable in their own right. Margins improve on every existing engagement.

#C&I solar sales: from bill to proposal

Proposal turnaround has collapsed. Customers who asked for numbers on Monday have moved on by Friday. Sales engineers who can turn a customer's bills into a proposal-ready savings model in an afternoon close deals that engineers who need a week do not.

The bottleneck is rarely the savings model itself — most teams have a working spreadsheet or proposal tool. The bottleneck is getting the customer's actual tariff and historical usage into that tool with enough fidelity to be defensible. Line-item extraction collapses that step.

#Auditability: the non-negotiable for finance and advisory work

Every deliverable that uses extracted bill data eventually faces a question that sounds like one of these:

"Where did the $14,200 demand charge in June come from?" "How did you reconcile the May estimated read?" "Why is the summer-on-peak rate in your model $0.231 when the customer's bill says $0.234?"

"The AI extracted it" is not an answer. "Here is the line on page 3 of the June 14th bill, and here is how it maps to Schedule E-19's coincident demand component" is an answer. The first answer ends the engagement. The second one closes it.

A general-purpose document AI can pull numbers off a PDF with 95% accuracy. For ad copy or contact extraction, 95% is fine. For a pro-forma the IC will review or a Scope 2 report the auditor will sign, 95% means roughly one wrong number per bill — and you don't know which one. The only acceptable workflow is one where every extracted value is traceable to its source line and every bill ties out to its printed total to the penny, or the discrepancy is flagged for human review.

#What Tariform does

Extract is what this guide is about. Utility-bill PDFs go in; line-itemized, tariff-aware, source-traceable structured data comes out. It is built for the workflows above — project finance pro-formas, consulting deliverables, C&I solar proposals. Every reconstructed charge is auditable back to its source PDF line. Every bill is reconciled against its printed total.

If you're building a pro-forma, scoping a consulting engagement, or shortening proposal turnaround, Extract is the product. Book a demo — twenty minutes, a real bill from a real account, you see the output. Prefer to try it yourself? Start a free trial — upload a real bill and see the extraction in minutes.

#FAQ

How is this different from generic bill-parsing tools or a custom LLM script?

Two things. First, tariff-awareness: line items are mapped to their role in the source tariff, not just extracted as text. Second, reconciliation: every bill ties out to its printed total or is flagged. A generic tool can produce text. Extract produces data you can put in an audited deliverable.

Which utilities does Tariform cover?

Any US or Canadian utility. The extraction engine is utility-agnostic; the tariff catalog backing the platform covers the full US investor-owned, municipal, and cooperative landscape, plus the major Canadian utilities.

What bill formats are supported?

Any PDF, including scanned and image-only bills. Image bills run through OCR first, then the same line-item extraction and tariff-mapping pipeline as digital PDFs.

How accurate is the extraction?

Every bill is reconciled against its printed total. If line items sum to the total within rounding tolerance, the extraction is accepted. If not, the bill is flagged for review with the specific discrepancy surfaced. The accuracy guarantee is structural — not a percentage claim, a tie-out requirement.

I operate solar assets — is this the right product?

Extract is for turning bills into structured data for pro-formas, proposals, and advisory deliverables. If you own C&I solar or storage assets and want to measure how much they're actually saving on each utility bill, ask about Verify — the other product on the Tariform platform, built on the same extraction backbone.