Healthcare is undergoing a fundamental shift toward transparency. Under federal regulations, hospitals and insurers are required to publish detailed machine-readable files (MRFs) of negotiated rates for thousands of procedures, services, and providers.
On paper, this data promises a revolution in how patients, providers, payers, and self-funded employers understand healthcare costs. In practice, it’s overwhelming: terabytes of messy JSON files that are nearly impossible to analyze with traditional spreadsheets or databases.
So how do you make sense of it?
First it's important to understand what's in these files (and perhaps more importantly, what isn't).
Under the Transparency in Coverage (TiC) rule, applicable to commercial insurers and group health plans, the required files typically include:
Additionally, files typically include:
These payer MRFs are updated monthly and aim to give consumers, regulators, and analysts insight into pricing across a wide range of services. The raw files are massive and often lack standardized formatting, making them difficult to work with out of the box.
One of the biggest pitfalls in working with transparency files is the presence of “zombie rates”, which are rates tied to inactive contracts, placeholder values, or outdated plan designs that still appear in the files. These ghost entries inflate the dataset and can skew benchmarking if not carefully identified and filtered out.
Under CMS rules, hospital MRFs must include five categories of standard charges for each item or service:
As of recent regulatory updates (Aug 2025), additional data elements are now required:
Hospital MRFs must also include:
Key challenges with hospital data include format variance, inconsistent schemas, and frequent mismatches between hospital and payer data. In our opinion, hospital data, while useful at times, tends to be less reliable than payer data as a whole.
At Gigasheet we're big proponents of the proven Intelligence Cycle methodology used by national security intelligence agencies and experts around the world. The same process works well for gaining intelligence in any domain, and healthcare markets are no exception.
Below is an abrreviated explaination of the core elements of the cycle for analyzing price transparency data.
The first step is collecting the hospital and payer machine-readable files. Hospitals are required to publish files that include gross charges, discounted cash prices, and negotiated rates. Payers publish in-network negotiated rates, out-of-network allowed amounts, and drug pricing data. These files are updated regularly and can be massive, often hundreds of gigabytes. Most are delivered in nested JSON formats that require parsing before they can be queried. File structures vary by payer, which adds further complexity when trying to build a unified dataset.
Once collected, the raw files need significant cleaning. Provider names, NPIs, and payer identifiers are often inconsistent or duplicated. Zombie rates tied to inactive or placeholder contracts add noise and can distort analysis. Cleaning also involves standardizing billing codes and normalizing file formats so that services align correctly across sources. Without this step, comparisons between providers or across payers are unreliable.
Transparency data becomes much more valuable when enriched with external context. Medicare reimbursement benchmarks, provider quality ratings, geographic crosswalks, and network attributes all add critical perspective. Enrichment transforms raw pricing data into a framework where costs can be tied to quality, outcomes, and geography. It also enables analysts to connect the dots between price and value rather than just producing static rate comparisons.
The final step is turning the prepared data into actionable insights. Effective analysis depends on the objective. For example, comparing rates for outpatient versus inpatient services requires factoring in the place of service. Removing statistical outliers can prevent extreme but rare values from skewing results. Recognizing that a single provider may work across multiple organizations with different contracted rates is also critical. Analysts may also want to evaluate contract terms at the plan level rather than lumping all rates under one payer.
AI is rapidly lowering the barriers to this type of work. Machine learning models can flag zombie rates and outliers automatically, cluster providers with similar pricing patterns, and even suggest benchmarks by payer and geography. Natural language interfaces, like those we use at Gigasheet, are making it easier for business teams to query datasets without writing code or learning complex interfaces. Instead of only expert data engineers being able to navigate terabytes of nested JSON, AI-assisted tools now allow anyone to quickly surface the insights that matter for negotiations, network design, and cost management.
Platforms like Gigasheet combine this analytical flexibility with AI-powered parsing and cleaning at big-data scale, making it possible to work with massive, messy files and still deliver clean, trustworthy insights.