Skip to content

Reconciling differences across Data Pipeline and Analytics

What is Analytics?

These are products such as the beautiful Dashboard and API that provides aggregated and cleansed data.

What is Data Pipeline?

This is the raw, not aggregated and not cleansed, data output for all events.

How can both systems return different data but both be right? Analytics and Data Pipeline use the same underlying raw data, but Analytics marries the event data and content metadata data in a way that is not simple to replicate in the raw Data Pipeline data.

For example, let’s take a piece of content that was published on August 1, 2020. It was tagged with Section A. Then on August 2, 2020 it was changed to a different section. Analytics would capture that change because the change was within 5 days after the published date, but the Data Pipeline serves data “as-is” when the event occurs. One way to handle this is to treat metadata as Type I or Type II Slowly Changing Dimension. This solution is detailed further below.

Take for example an article that was published on June 1 2020. It was tagged with Section A. Then on June 15 2020 it was changed to a different section, and the article was recrawled then. Analytics would not update the past 15 days of data, and instead only start reporting then new section as of June 15th. As would Data Pipeline. But given that the customer is using the most recent metadata for all post traffic in the DPL query, this would make the two systems report differently.

Both system are right too! The Data Pipeline gives you more control to decide which “right” is best for your company and analysts.

Considerations to your queries to consistently define your version of the truth:

  • Create a metadata dimension as a Slowly Changing Dimension. In the case of all SCDs, there are multiple versions of the truth, and it needs to be decided which is best for the business case. This would allow you to report on views based on either:
    • the metadata at the time the event happened on the site
    • the metadata that is true now
    • the metadata that was true in-between the event happening and now
  • Timezone differences (although small) should always be accounted for. All timezones in the Data Pipeline are reported in UTC, and the Analytics are reported in the customers time zone.

Last updated: August 16, 2023