We have launched our next-gen data warehouse automation (DWA) platform, Astera DW Builder that accelerates and simplifies data warehouse development. It is a unified metadata-driven solution that allows organizations to design, develop, and deploy enterprise-grade data warehouses in days.
We recently hosted a webinar where we got an opportunity to connect with James Serra. James is a renowned thought leader in using and applying Big Data and advanced analytics. He currently works for EY as Data Platform Architecture Lead and previously worked at Microsoft for seven years as a solution architect specializing in big data and data warehousing.
In our last discussion, we talked to James to gain some insider insights on modern data warehouse development. In this live Q&A session, we delved into the concept of automating data warehouse design.
What do you think of the concept of the enterprise data model? Do you think it’s time to move on from this idea of preparing all your requirements upfront and moving on to a development cycle? If not, how would you move from this approach toward a more iterative style of data warehousing?
I think most customers have moved on from that sort of waterfall approach where they spend a lot of time gathering requirements. They’ve switched to an agile waterfall type of development, and a lot of it has to do with the tools that have come out lately. If you look at something like business intelligence tools, I found customers are now using that tool to define the business requirements instead of somebody from IT going to a client saying, “Okay, what’s your requirement? Let’s take it down, let’s build something”, come back and find out it’s wrong, and have this cycle going on. They now go, “Hey, go use the prototype, and we’ll use that as business requirements.”
The modern ETL reporting tools allow you to prototype and create those requirements easily. And if not, it’s usually “Hey, we need a quick win. Let’s go and start building something out and show the value of what we’ve built and get people and end users excited.” A lot of time, it [helps] unlocks budgets, and then you’re also bringing in those end-users early on, so you feel like they’re part of what you’re building on there, and then they can get something of value so you pick out something that you can do in the short term that has a lot of value, and then you come out with that.
Now, I always say you want to keep your end goal in mind. It’s sort of like building a city that you’re going to break up into towns. You may come up with a blueprint for this city with all the towns in there. Then you go, build a town, and build one on top of the other. You don’t go and build the whole thing and open it up in there.
So, you [must] have that long-term vision of where you want to go, but you get those quick wins early on.
What are your thoughts on the data vaults? Do you think that’s going to overtake the dimensional modeling soon as the preferred technique, or does everything have its place?
I’ll say that there’s no way that data vault will ever take over dimensional modeling. It’s been around for over 20 years, and it’s got a great use case. I find customers who need to do a lot of historical tracking and auditing using the data vault, but there’s not a lot of those use cases that require something as, say, sophisticated as the data vault in there.
So, I think it has its place. But I see 99 percent of the customers using a sort of star schema along with a third normal form on there, and a lot of it depends on [whether] you get the performance you need in a third normal form. If I’m joining a dozen or so tables, I may be able to get that performance, and I will go to a star schema. It’s an extra layer of detail, but you get a lot of benefit by having that extra layer in there.
Also, it helps when you’re looking at self-service BI. If I don’t have a star schema, then I must know the relationship between all those tables, which can be very difficult for an end-user. Usually, you have one person in the company who knows how to join everything, and you rely on that person.
Instead, you create that star schema and then it’s self-serving BI. I could just go to a reporting tool and just drag fields onto the form and build it out without having to understand the relationships work because of the star schema. So, it gives you that, so there’s a great reason for star schema, and I would say probably 75 percent of the companies I see will use a star schema on top of a third normal form.
Talking a little bit about that data modeling dimensional modeling perspective, what do you think the role that metadata has to play? Do you think a metadata-driven framework can benefit data warehouse design? And if so, how?
I have seen a lot of effort lately in creating data catalogs for customers. The idea behind that is they’re gathering so much data, and you want to avoid having duplication of data.
We’re doing this at EY on a large scale because they’re bringing all this third-party data. It could be literally tens of millions of dollars third-party, and you don’t want somebody going out and getting a license because they didn’t realize the data was already there.
So, let’s create a metadata catalog and let’s create a data discovery tool in a marketplace where any end-user can go, “Hey, I need to create something using this particular type of data. I wonder if we have it. Let’s go to the catalog and see and if it’s there.” We [now] can instantly get access to that data and avoid duplication.
Also, a big thing lately with customers is lineage. “I got this n value this calculation. How did we get there? Where did it come from in there? I want to make sure it’s right.” So, they go back and look. Metadata lineage is important when you want to make sure that you’re building out a schema and that schema is also metadata.
So, this is another thing that I think you’ll start seeing over a push of people cataloging not just the data but the datasets. I think you’ll see that integrated more with data catalogs is “Hey, it’s just [that] it may be cool enough that I can see that we have customer data and product data, but maybe somebody’s already built this data set. Maybe somebody’s already built a report and dashboard on that, and I can go and quickly use that instead of reinventing the wheel.”
So, metadata is so important now and without it when we have this proliferation of data and — in some cases — we have data in a data lake that doesn’t have its own metadata within it, [then] we’ve got to create that [metadata]. So, you need such a product, a solution that is going to match all that up, so it makes it a lot easier for people to discover the data.
We’re living obviously in the era of big data. How do you think the explosion in volume variety and velocity has impacted the arc of data?
Yeah. It’s more challenging because now you have to think of the size of the data. And [for example,] I have to create a model, it may not be performing if I don’t do it correctly. So, it adds so much more importance in making sure that the design is correct. And also, data governance is a huge part of it.
[Suppose] the data gets into a model. How can I make sure that it’s cleaned and I’m joining data correctly? The biggest reason I see for the failure of big data warehouse projects is not enough time in data governance.
They come out with a data set, and you just say, “Great, I’m going to use this data set,” and then go, “Wait a minute! This data is wrong.” If that’s your first impression, you’ve lost confidence right upfront. They’re not going to trust anything you come out with, so you have to spend a lot of time before validating that data, making sure the data set is correct.
Bring in the users early on, tell them you’re testing this stuff out, and make sure it’s valid. It’s one way to do it. But it’s hugely important to win their confidence when you’re developing these solutions, so make sure you have enough time for data governance.
What are your thoughts on the statement that a robust verified data warehouse scheme model equals an overall high-quality data warehouse architecture?
I would agree with that. Again, if the data governance is in place, then this gets into more than just technology, and I’ve had a lot of discussions with customers [about how] you [can] have the best technology in the world, but you really need to have the people and the process in place.
We need that data governance, maybe a center of excellence, meetings to decide who owns the data, and a vetting process for cleaning data. If we clean the data and data warehouses, we also bring it back to the source system — all these things got to be discussed there.
So, it’s so important to have those right people and the right skills, which is very challenging, especially now we’ve seen lately a dearth on talent in there and finding people with these skill sets is a challenge, but there are enough examples out there that you can go and see how people set up their company and their groups and the people in the process for these type of solutions that we’re building and try to follow those best practices.
Given the MPP technology and columnar storage, do you see a trend of using the one big table OBT model for reporting and analytics on top of a dimensional model?
Well, an MPP, for those that may not know, that’s multiple parallel processing. So, the idea is, I can make queries that could take hours to run on as an SMP [Symmetric Multi-Processing] or a typical solution and put it on the MPP system, and it’s going to run anywhere from 20 to 100 times faster on there. It could do that with third normal form tables. It can do that even better with a star schema, but I’ve seen tremendous results for lots of data, even with many different joins.
So, the good thing about MPP technology Is you don’t have to go to extremes to have that data denormalized. My conclusion is you almost always want to go to star schema for numerous reasons, and the idea that you need this big table, you don’t — with MPP technology — in most cases.
Some third-party products required you to do that — or underneath the coverage, we’re doing that — because of the performance issues. But because the technology has advanced so much, it’s rare that I find a customer going to that length of putting out creating a big table. Instead, they use the star schema.
What role does automation really have to play? And what kind of advantages do you think enterprises can see from using a purpose-built solution to drive their data modeling verification and other processes?
I’m all about shortcuts. So, when I talked to customers, it was, “Well, wait a minute here! What you’re talking about… I think you could probably use some automation tools and third-party products on here. Yeah, there’s an extra expense, but the time savings and the accuracy you may get out of that could be well worth it, so yeah, I am all for those third-party tools.
Of course, there are trade-offs. You have to get the skillset, so if you don’t know it, you got to learn it. And, is it an automation tool that requires that automation tool all the time going forward or is it just creating the code that you can pick up and use? Because if you always have to use that tool, then there could be some challenges with that.
As the product, it is building off of updates its features, [the data warehouse automation software [has] got to update it too, or maybe you can’t use all the features. So, you have to look pretty closely at it. But there are so many good tools out there now that shortcut that process that provides the automation.
Especially in most cases, what customers are building is not that different than thousands of other customers have built. So yeah, there’s an automation tool for that. It’s very rare that you’re building something so unique or at such a scale that a third-party tool wouldn’t help.
Even more so, if you’re in some industry that’s popular like healthcare, finance, banking, or retail —those all have very repeatable scenarios that you can use an automation tool for.
Are there any parting words you want to leave us with?
I always say, if you’ve never gone down this path before and you’re building a real solution, get some experts, find some people. Again, the talent is hard to find now, but find some consulting company or a third-party company that has been down this road before, early in the process.
Because this is what I do, in my role at EY as architecture lead [I know] that you need to make some really key decisions early on and if you make those wrong decisions down the road, it can wind up costing you a lot of time and money.
So, find people who can help you make those right decisions early on because all the projects I’ve seen failed — it could be [because of] a part, people, or process — it’s rare that the technology was bad. It’s that you chose the wrong technology. You wanted some relational database, and you try to make it work where a NoSQL solution, for example, would be much better.
So, get a lot of help early on making those big key decisions and understand what’s coming in the road, down the road, and the road maps of a lot of products. If you’re using a specific product, make sure you know what is in development for the next three to six months.
These projects take a lot of time to build, and you don’t want to be in a few months in, and also other companies say, “We got this new product and a new feature,” and you go, “Oh, should we know about this?” and I say this because this is a lot of my role at my job was making sure people were educated about what was coming down the road.
Because the data warehouse projects are long paths, and it’s six months or year that you’re building it. It never ends, really, but a lot of those big decisions are going to be early on, and you need to know that road map, so those would be my parting words.
Automating Data Warehouse Design Using Astera DW Builder
Astera DW Builder is an automated, end-to-end data warehousing solution that allows enterprises to go from source to insights at lightning speeds. The agile, metadata-driven platform enables businesses to integrate dispersed data into an agile BI system and create a centralized gateway to perform analytics at an enterprise scale.
Astera DW Builder offers a zero-code integrated development environment to build your data warehouse on a logical level. The solution provides an integrated development environment that automates all standard data modeling tasks all the way to the deployment of your data warehouse. For a first-look trial of Astera DW Builder, click here.
Authors:
- Ammar Ali