Why Firms Must Invest More In Trust

As our hyperconnected and social economy matures, business leaders are are beginning to re-engineer their organizations by flattening hierarchies and breaking down silos in order to respond to…

Smartphone

独家优惠奖金 100% 高达 1 BTC + 180 免费旋转




Exploring Compound Value Dimensions with Snowflake and Looker

Indiana Jones with the compound value Golden Idol

This relationship is intuitive and efficient to model in both SQL and BI tools. If a user wants to show a measure (# of impressions) by a dimension (video url), we only need to expose these as default objects, and the user will be able to natively select the set they want to use.

In a many-to-many relationship, each row could have many values for a field, and that value can occur in many rows. There are many approaches to modeling this type of data, but one we have become fond of is using compound values (ARRAY, JSON, and other semi-structured data types) within the row.

Example data for many to many relationship data with compound values

This relationship is much more difficult to explore and report on because the many-to-many representation violates a lot of identity principles AND more advanced SQL implementations are required to transform the data.

In our previous implementation, we persist an exploded version of the data for each representation. That meant replicating each row n times (where n is the number of values in the compound dimension), storing it in a table, and building a reporting layer in Looker for that representation.

Some of the major pain points of this model include:

In this demonstration, we will show the full path from modeling the data in Snowflake, querying the compound value dimensions as single value properties, formalizing the approach into a Looker explore, and answering some common reporting questions in Looker.

For this demo, we create a table containing some fake data with semi-structured data types such as variants and arrays.

Applying LATERAL FLATTEN on arrays or variants is pretty straightforward for Snowflake users. But for stakeholders who just want to focus on business insights, there are a lot of complexities involved.

The key principle of this implementation is treating the LATERAL FLATTEN statement as another view which can be lazily joined back to the original view containing unexploded columns.

In order to perform the LATERAL FLATTEN dynamically, we are going to use native derived sql to define this explore. We do this so we can use the lazy join feature — meaning the LATERAL FLATTEN only happens when it is required when the compound value dimension is referenced.

When stakeholders choose an explodable compound value dimension, the LookerML intelligently explodes this value into separate rows. This allows Looker to model the individual values in the compound-value dimension as single-value dimension for reporting purposes. Because this happens prior to any real aggregation in the SQL engine, the handling and aggregations of measures can be handled as usual.

As you can see in the explore we can explode a compound value column into single value columns that can then be used in reporting.

With this explore, users can freely peruse exploded versions of the compound value dimensions, without having to pay any special attention to the fact that the original dataset is stored as a compound value dimension. Looker users can easily ask their questions without knowing any of the complexities underlying the data.

Show me the total impression counts per video category? In this request, the explore explodes only categories and sums the counts by category.

We need to see the total video count — break it out by content category and keyword? In this request, the explore explodes categories AND keywords and sums the counts by both fields.

What’s the total video count for ads where “soymilk” is in the keyword and positive sentiment greater than 0.03 — break it out by content category? In this case, we are filtering on the unexploded keywords and then exploding the content categories.

Overall, we are really excited about the end result of our effort. This is a really simple solution to a really common and complicated data modeling problem.

We can finally fully leverage compound value modeling. In the modern big-data stack, this is a powerful solution to reporting complex data relationships in real-time.

Our data product is more robust and flexible. We used to think of lateral flatten as a “cleaning transformation” that would be performed during ETL and curation. By pushing this transformation to the reporting layer, the user has dynamic control of this process. We do not have to rebuild and rerun processing pipelines to add new data and answer new user questions.

This simplifies our data ecosystem. We have stripped away a lot of the overhead and complexities, and are left with a straightforward approach to exploring data. In the described case, we would have previously used eight tables to represent all the combinations of dimensions, now we use only one.

Find the limits for this solution. While we are really happy with the performance improvements, delaying transform until reporting obviously has its limits. If users query this data frequently, we will likely want a hybrid approach where some data is made available as already exploded.

Design how to UX-gate users to avoid common pitfalls of compound value dimensions. One common issue with compound value dimensions is double counting. While the measure counts are fundamentally correct, interpreting the many-to-many relationship can be confusing for users. The simplest example is — summing up the total of the measure column will result in more of that measure than was actually recorded. We are exploring UI elements (warnings, labels, column names, etc.) and alternative measure formats (normalized percentages, fractional displays, etc.) to determine what works best.

Add a comment

Related posts:

Behind the Logos

The Dapper Notes site started out as a place for me to sell prints of my Hebrew typography work. I eventually added notebooks (originally dubbed “Hebrew Type Books,” more on that here), and now have…

Biggest crypto hacks of 2021

In 2021 we saw an incremental rise in the adoption and interest in the cryptocurrency industry, leading to the increase in the development and investment of DeFi platforms. The total value across the…

Ultimate tips for a remote working Roadmap

Remote working is the new normal. There are so many benefits to it, but there are many challenges as well. You’ll be working on a project together with your team, but it is harder to collaborate when…