Skip to main content

I’m trying to model some data from the warehouse sync files (in BigQuery) and I’m facing some issues with duplicate rows when I join Deliveries and Metrics.

What’s the correct way to join these two tables - I’m using delivery_id which is resulting in a few duplicates, but event_id doesn’t seem to be the correct field as it returned incomplete data.

Hi there Ash,

I believe you’re referring to joining/merging the two Parquet files for Deliveries and Metrics into one table in BigQuery?

The reason you are running into an issue is because each table uses a different primary key (unique identifier for rows in that table). The deliveries table uses delivery_id, while the Metrics table uses event_id (as you’ve mentioned!). The reason each delivery could have multiple event ids in the Metrics table is that each message can generate multiple delivery events (delivered, opened, clicked, etc.).

With that said, there isn’t really a “correct” answer to your question. As how you model your data is up to you depending on your use case and company goal to achieve with the merged table. 

I hope the above helps to provide some background into what is happening and assist you in making a decision about what you would like to do with the data tables.

- Maddie


Reply