Winning with Dimensional Models in the Age of Big Data
It’s a fair question: Is dimensional modeling still relevant in the age of big data?
Read enough about “big data”, and you may start to get the impression that modeling techniques are old-fashioned or irrelevant.
Data professionals in IT departments are used to wrangling data - it’s understood that there’s no “easy button” when it comes to integrating data and geting meaning out of it. So we muster the energy and we do what it takes to get the data in order for whatever analysis the business is after.
But guess what: Business users don’t want to wrangle data.
Not even savvy business users want to go through the trouble of writing map-reduce jobs or doing the work necessary to produce predictive models.
What do business users want? Self-service analysis capability in a context they understand.
- They don’t want to write code or queries
- They don’t want to be left integrating data sets themselves
- They don’t want to understand infrastructure
They want the information that can help them make a business decision without lots of hoops to jump through (including work orders for reports to IT).
That’s why I believe there are at least 5 reasons that dimensional models can be a huge win for the business and for IT.
1. Data in context is the most powerful model for robust analysis
A dimensional model is a representation of data that portrays its measurements in context… that is, it takes the numbers and surrounds them with highly descriptive characteristics about the event that generated the measurements.
Dimensional models highlight the characteristics, qualities, features, and facets…the who, what, when, where, how…of the data collection situation.
Therefore, a dimensional model is ideal for understanding the patterns and information that the data contains in a way that is widely approachable by analysts of all kinds.
2. Data cleansing will always have to be done
There’s no escaping data cleansing and transformation.
In “Stages to Machine Learning”, I discuss some always-required steps to getting from A to Z in machine learning endeavors. Two of those: data cleansing, and data transformation.
If data professionals don’t do this, end-users must.
No matter what, someone has to put in the effort to ensure quality, consistent data that is standardized for analysis.
So if you’re going to go through the trouble anyways, it makes sense to put in the effort to model the data so that business users can focus on what matters most: the business.
3. It’s just a matter of time until you need a schema
You’re not escaping schemas if you’re delivering the ability for end-users to interact with the business’ data sets with relative ease.
“Data lakes” offer the opportunity to reverse the order in which schemas are required for a particular analysis, shifting it from an up-front necessity to a “schema on read” / “just-in-time schema” approach. But “schema on read” isn’t the same as “no schema”.
So again, whether it’s schema on write, as you do in traditional Extract-Transform-Load processes, or if it’s schema on read, as the advent of a data lake provides, there’s likely going to be a schema in your future.
It makes sense, then, to design a schema (a model, if you will) that end-users can readily understand and work with using a business intelligence reporting and visualization tool of their choice.
4. Not all data problems are “big data” problems
Before your team runs off to implment a Hadoop cluster, ask, “Do we really have ‘big data’??”
If your data set can fit in memory on a single computer, your data is not “big data”. If your data is coming in at manageable intervals…ie, if you’re able to keep up, you don’t have the velocity problem that big data presents.
If you don’t have a big data problem, don’t invent one for yourself!
Somewhere along the way, your business users just want to perform some basic descriptive analytics. Go there. Provide that value.
5. Dimensional models are low-hanging fruit
In terms of investment, dimensional models are relatively low-hanging fruit to implement.
Additionally, they provide real, tangible value to a business.
If you’re looking for a relatively “easy” win, building dimensional models that reflect the processes of the business in an integrated manner is a good way to get off the ground.
When the time comes, you’ll have done the data cleansing and transformation work to provide standard data sets that can be used for more advanced forms of analysis.
comments powered by Disqus