How to Compare Two Datasets to Validate Accuracy | by Marvin Rubia

If you have been in the analytics industry for some time now, I am 99.9% certain that a stakeholder has asked at least once (your team or another) about the accuracy of a specific dataset. Data discrepancy is a common issue that needs an urgent fix. Mistaken data naturally leads to mistaken insight.

Discrepancies will likely occur if an organization has multiple sources for the same metric. Unintended transformation in data transfer may also cause a discrepancy between the source and the destination.

But how do we actually prove that a given dataset is not accurate? In this blog, let’s demonstrate how to do it in MS Excel. The process is simple and even simpler the next time, as we can automate our formulas for the same analysis.

NOTE: If you are not a Medium subscriber, you can still read my full article via this link for free. Sharing my blog with other like-minded people is appreciated.

The source of truth

Our source of truth is a complete list of identifiers (175 unique emails) with data about their performance. It is the raw data before any analyst in the organization processes it. The performance (KPI) is equal to the ratio of the numerator (frequency of activity) and the denominator (time in minutes).

The lower the KPI, the better. Numerator and denominator values are sourced from separate locations. The Ratio column is a calculated column. There are no blanks under each column.

Our source of truthThe dataset in question

The dataset in question can be an export from a dashboard or a database table. It only has 162 unique emails. It has blanks under its numerator and denominator columns. The Ratio in Question column is a calculated column.

Our dataset in question

NOTE: The two tables (separate sheets) I have shown so far are stored in the same Excel file. Also, it is better to format each table as a table by…

How to Compare Two Datasets to Validate Accuracy | by Marvin Rubia | Mar, 2025

Tags: