Data Lakes vs Data Warehouses

Data Lakes versus Data Warehouses

What is a data lake?
A data lake is a central repository that allows you to collect and hold all of your unstructured and structure data at any scale. What that sentence means is that you do not have to transform your unstructured data to store and then run analytics on it. You can store all your data as-is.

Do I need a data lake?
Companies that have put data lakes in place outperformed like companies by 9% which was reported by an Aberdeen survey. What was found is that these companies were able to do new types of analytics (even create new products) from their sources of data like log files, social media, IOT, etc being stored inside of data lakes. Being able to parse and learn from this data enabled them to react faster to what their data was telling them. Looking at unstructured and structure data allowed these companies to attract and retain customers and make better informed decisions.

Data Lakes versus Data Warehouses
You have to approach this decision with facts and requirements. Depending on the needs to the business you might need a data warehouse, a data lake, or even both. Allow the needs of the business and what data the business collects drive the decision organically. Before we go any further lets define each of these.

Data Warehouse is a system that pulls together data from many different sources within the business usually transactional systems that is needed by the business to conduct day to day operations. This data is collected for reporting and analysis.

Data Lake is a storage repository that holds a vast amount of raw data in its native format until it is needed. By allowing the data to remain in its native format, the size of the data can be greater and pulled in a more timelier manner which will give the business quicker insight what is going on.

Lets compare Data Warehouses to Data Lakes side by side to help see what is best for your business to use for its analytic needs.

Section Data Lake Data Warehouse
Schema Written at the time of analysis Architected prior to streaming the data in.
Data This is both non-relational and relational data from web sites, social media, IOT and business applications Transactional systems and operational databases
Performance and Price Low cost storage and queries are getting faster Higher cost storage and fast queries
Data Quality This is raw data. May or may not be curated This data is greatly curated and serves a Single Source of Truth
Users Data scientists, business analysts, data developers Business analysts
Analytics Predictive analytics, data discovery, and machine learning Business Intelligence and batch reporting

Looking at your data in different ways can help your business

You can look at your data many different ways and it will help your business to look at that data in as many different ways as possible and not just one. We will cover some of the basics here and hopefully it will get your mind thinking about what data you currently have and how you can look at it. By looking at your data in different ways or different perspectives you can obtain different business values.

Grouping

You can group your data. What I mean by grouping is when you have rows and rows of data and you want to bring that data together then grouping logical can help you. Take for instance you have eight to ten sales reps making calls to physicians daily trying to sell them their pharmaceutical drugs. You can group all that data by day for a giving week to show what days the sales people are making the most successful sales calls.

Granular

Looking at your data granular is the processing of driving further down into the details of the data. Let’s take the above example of showing successful sales people calls for a giving week. Well we know that Bob on the sales team has be crushing it since he joined, but looking at the data for daily sales calls he and the other sells team members are placing about the same amount of calls Monday through Friday. Let’s get a little more granular and drill down into each of the days calls and see what time he is making his calls. Now the data is showing us that Bob is making all his calls either early in the morning or late afternoon and all the other sales people are consistently calling in between Bob’s hours. Deriving this types of answers and drilling into the data will give you more answers that will help you guide others in the business to be better.

Visualizing

One of the most important ways to look at your data is seeing it in a visual concept or chart. I feel this is one of the most important ways for your data to tell it’s story. One of the biggest reasons visualizing your data is so important is because people are so visual and seeing data in row or tabular form to understand the data can be difficult and sometimes mind numbing. Showing a data story through a picture (chart and/or graph) allows people to understand the meaning quicker. If there was a significant increase or decrease in products sales or customer service calls seeing that spike in a graph will be quick and easy. If there is bad data in your warehouse a picture will definitely tell that story and you will be able to visually see that very quickly. Data visualization tools have greatly changed how businesses have worked with their data as well how businesses have dug deeper into their data gaining valuable insight. So make sure you ask the question “Can this data be visually representing?” and if the answer is yes build out the data visualization and share it with the business. 

Data Relationship

Can my data be related? I think this question not only does not get asked enough, but it is one of the most powerful ways to look at your data. If your data can be related to other data how strong is that relationship? Even though asking about data relationships is one of the most basic questions in looking at your data it is the most overlooked and the most informative. Let’s look at a simple example: You have a subscription form on your website that allows potential customers to fill out and subscribe to your newsletter. You have two sets of data, one set being the customers information and the other being website. If you look at each data separately you can know who the customer is, from the form information. Looking now at just the website data you can see how an unknown person entered your web site and walked it (What I mean by walking, is what web pages did they visit on your web site). Now let’s join the data (the relationship), Once we do that we know who the person is, how they got to the web site, what they did on the web site, what page they came from before filling out the form, and then what they did after filling out the web form. Now you have a complete picture and you can make some pretty good assumptions about the user. Powerful right. Relating your data will take time and thought but trust me it is definitely worth the work.

These are just some of the ways that you can look at your data and get answers to questions that have been asked and possibly questions that have not been asked yet. Never stopping asking questions and never stop looking at data in different ways.

Advantages of a data warehouse in the AWS cloud

There are several advantages of why I believe you should go with a data warehouse living in the cloud and not on premise. I know everyone has heard all the news and excitement around what AWS has been doing around their data warehouse offerings. AWS has been one of the leaders in the Cloud industry for a while and the way they are continually updating their existing products as well as releasing new cloud products is awesome. Lets now go over the advantages that I see a cloud warehouse has over on-premise.

Speed of implementation: With a cloud solution your server can be up and running within fifteen minutes. For the on-premise you will have to order a server, rack the server, install the software, apply patches, put the server on the network, and all that can take several weeks to months from ordering to your warehouse is live and ready.

Flexibility and scalability: Let’s say in a year or two that your company no longer wants a data warehouse, with the cloud solution you just terminate it and now you are no longer paying for those resources. With the on-premise you have to figure out on how to recoup the capital expense from those purchased server and/or servers. Scalability is what the cloud is made for. You want another server you spin it up. With the on-premise you have to plan accordingly and go through the entire process again order a server, racking the server, install the software, apply patches, putting the server on the network.

Ongoing cost: Hardware usually has a three year life span and then you have to replace the server. Along with replacing the server you will have to plan the migration of the warehouse and then migrate. All of that adds up the cost as well. In the cloud you are not responsible for the hardware so this is all handled without your involvement you can focus on the warehouse schema and the data which is the most important pieces of the business intelligence solution.

Security: All cloud based providers are constantly checking the security of their infrastructure with teams of people that try to compromise the system. To go long with that, infrastructure physical security is very tight around their data centers usually taking two to three methods of identification before being allowed to enter the facility.

I think it is safe to say that I am a little biased when it comes to cloud solutions and what they can offer to businesses. That is not to say that I do not rule out on-premise infrastructures. There are requirements and situations that the business really needs to look hard at and make the decision that on-premise is the best way to go for them.

 

Seeing the picture

Join Goggle and marketing data to get big picture of your prospects. The why and how

The Why:

When you look at each silo of data individually you do not get a complete picture of what a visitor or prospect is interested in, you only see what that one piece of the puzzle is showing you. You can get good information from your Google Analytics data (silo) when you look at it by itself and you can get good information from your marketing data (silo) when you look at that data by itself. What I am writing about in this blog is bringing those silos of data together to give you more insight into a potential customers journey, like when you put pieces of a puzzle together you see the picture clearer. Lets say that your business is utilizing Pardot as its marketing tool to not only manage email campaigns that are guiding potential customers to landing pages that are hosted on your web site but you are also gather customer information from forms that are being embedded within your business’s website. Wouldn’t it be great to see what content that prospect was looking at prior to and after they filled out a form on your website or where did they navigate to after getting to that landing page from one of your email campaigns? You have all the information in Pardot first name, last name, email address, etc and now you would be able to gather additional intelligence about them by what they are viewing within your business’s web site. You could even use that information to send out more direct campaigns based on the new knowledge that you have obtained from joining the data. For instance the prospect was on your website for 5 minutes and they visited the integration, business analytics, and warehouse pages. I know what I want to send to that prospect now.

 

The How: 

You have to join that data somewhere so lets bring that data into a warehouse. I used the CopperHill AIR Platform to run all the jobs that I created with Talend to move the data from Google and Pardot into the AIR Warehouse. Lets start with the steps I took with Google first. I opened Google Query Explorer which is a tool that allows you to play with the Core Reporting API by building queries to get data from your Google Analytics views. This tool allows you to set different metrics, dimensions, sorts, filters and segments so that you can see the end results before applying them into your data integration tool. Once I got the queries working just as I wanted them to and the data looked good, I move those exact settings inside my Talend tool. With Talend I had to install the Google Analytics Component as it is not part of the base install. The component is free and very easy to install. Once I setup the component and deployed it to the AIR Platform I was immediately streaming data. Below is just an example of one call to Google’s Core Reporting API within the job. I repeated the same steps for each call I wanted to make to Google Analytics API and all the data transfers were handled inside that one job. Channel Data

The next step was to get Pardot data flowing. Pardot does have a great API that you can access and read your marketing data from. Again I used the AIR Platform to run the job that I created in Talend to move this data out of Pardot and into the AIR warehouse. Let me say this, you do have to have knowledge of the Pardot API or creating this job can be a little tricky. What I did was setup a REST call to each of the Pardot objects that I wanted to receive data from and store within the AIR warehouse. I have attached a portion of the job I configured to retrieve data from the Pardot campaign object. As you can see the setup and configuration is a little more complex then the call to the Google API but the data is flowing just as fast.

Now that all the data from the two sources Google and Pardot are streaming into the AIR warehouse we can create the reports showing exactly what we stated in The Why section. This thought process and build out can be applied to really any marketing tool that a business is using and join that to your Google Analytics data. If you want to gather great actionable information around your prospects then joining google and marketing data is definitely a great idea for you and your business. Finish the data puzzle so you can see the big picture and just do not look at the pieces by themselves.

Data warehouse in the Cloud versus On-premise

There are a lot of misconceived notions that cloud solutions have speed constraints leading to latency and have huge security vulnerabilities. These thoughts and some others have many companies move the cloud solution option to the bottom of the pile and that is simple not a good option for the business. To address the first notion that cloud solutions have speed constraints leading to latency, I think is misleading when people state that. If your warehouse and your visualization tool are all in the same cloud infrastructure then I do not think latency will be an issue at all. It is when your warehouse is in the cloud and visualization tool is hosted on premise is when you can see a little latency if any at all. Both AWS and Azure offer tools to support high speed on-premise to cloud technologies. AWS calls its tool Direct Connect and Azure call theirs ExpressRoute. Direct Connect and ExpressRoute  are essentially private cloud connections that offer faster speeds, more reliability, lower latencies, and higher security than standard Internet connections. These tools essentially connect your network (private) directly to your cloud provider without crossing the public internet.  The second notion in reference to security vulnerabilities. Lets address physical security first. AWS and Microsoft have invested a lot of money into the physical security around protecting their infrastructure and it greatly exceeds any normal IT environment. Now for data security that is something that you have complete control of. Yes it is a little more complex than your traditional on-premise environment but if done properly your data is just as secure. Use all the security control tools that are available deterrent, preventative, detective and corrective and along with creating policies and using best practices you will put in place some outstanding data security measures that will allow everyone in the business to sleep good at night. There are many books out that review this topic in great detail, I do suggest investing in one or two and further educate yourself in this area.

In my opinion cloud based solutions, especially for small to midsize companies is an excellent option and can provide  enterprise like excellent performance and availability. I also believe that most companies can really reap the benefits from the cloud solution which typically can lower cost, scale easier, and be up and running very quickly. Two cloud warehouse solutions you should definitely look at are AWS Redshift and Google Big Query. These are the two leaders in the industry and they have proven themselves as reliable and scalable. If you do review them please reach out to me and let me know your thoughts and which one you decided to go with and why.

How much does it cost?

This question is usually the first asked question by the executive team. If your executive team does not ask this within the first five minutes of your conversation then you have a team that is investing in doing the right thing by looking at what technology is best for the business and not worried about cost. There will be a difference in cost on what solution you do pick whether it is the cloud solution or an on-prem. With an on-prem solution you will have bigger capital expenses as well as the cost of additional software licenses depending what database you choose for your data warehouse and what operating system that database needs to run on. These expenses include but are not limited to hardware (server(s), network switches, ups, etc), space, and power consumption. Do not forgot about the human cost on maintenance of the servers and the database. The cloud solution will already have server, power and space included in the price. The server patches as well as patches to the database will be taken care for you with the cloud solution so no worries about additional man hours around those areas as a monthly resource cost.

There are plenty of articles and tools out there they go over on how to calculate the cost of on-prem and cloud solutions and they are very informative. One of the tools I use often is the AWS total cost of ownership tool https://aws.amazon.com/tco-calculator/. This tool will assist you in seeing how AWS can reduce your capital expenses in a project such as implementing a business intelligence solution as well provide you some very detailed reports that you can utilize in your presentation to management. The one thing I will say if you are only implementing a one server solution the cloud solution might cost a little more but will give you several advantages. In my next blog I will talk about some of those advantages in detail.

The importance of a Data warehouse in a Business Intelligence Solution

I consider the data warehouse as the foundation of a business intelligence solution – your single source of truth. As with any good structure if the foundation is not good the structure will come crumbling down. Before we go any further about the importance of a data warehouse in a bi solution lets define it. A data warehouse is a large repository that contains data that is collected from a wide range of sources. These sources of data are usually the business’s transactional systems. The data warehouse is used to guide decision making at all levels and is optimized from read access. Taking the time here to build your warehouse by defining your schemas and tables  is crucial to the solution and the business. You do not want to rush through this part by any means. By not designing the data warehouse correctly the project’s cost can increase significantly, cause delays as well as potentially incorrect reporting of the data that is being stored in the warehouse incorrectly. Take your time here to ask the right questions, develop the right solution, review the solution with your team, and then implement and test. You will need to make sure the data warehouse is not only designed right but can also grow and scale as your business’s data needs grow. One thing you do not want to happen is in six months to a year  you are getting reports that the system is either slow and/or unresponsive. If one of my clients were asking me about this situation, my first thoughts would be around did the team calculate the right size warehouse or server, did they miss something within the database structure (wrong join, wrong columns were index, etc.). I can not say this enough, the data warehouse is the foundation of the business intelligence solution and it deserves the time to be designed, built, and tested correctly. Like I stated earlier if the foundation is bad then your business intelligence solution will have many issues, if you do not do your due diligence around the building of the data warehouse your solution will have a lot of issues in the future.

Why you need a Data warehouse

Let’s start off by defining Data Warehouse. A data warehouse is a central repository of information that can be analyzed to make better-informed decisions. It can also be further defined as a repository that stores large amounts of data that has been collected and integrated from multiple sources – such as a CRM, payroll or accounting software, or inventory and sales systems.

“Why do we need  a data warehouse that is separate from our business transactional systems?”  This is a question that we get asked frequently at CopperHill Consulting. We answer this with several statements:

  • Business transactional systems are built for tasks and very specific work flows.
  • Business transactional systems allows editing, while a data warehouse is read-only.
  • Business transactional systems should only hold current data. A data warehouse can hold historical and current data.

 

If you want to move forward with your BI strategy, you need a data warehouse.

The data warehouse is a core component of Business Intelligence. Here’s how a data warehouse makes an impact:

  • Maintains a copy of data from your transactional systems. This allows you to keep your transactional systems lean and processing quickly with only the most recent and relevant data visible. It also lets you keep a history of all past transactions for recordkeeping and analysis.
  • Improves the quality of the data. Identify duplicate entries and records. Find anomalies in your data. Build custom views. These are all ways that a data warehouse can help improve the quality of your data both in your transactional systems and in your reporting.
  • Restructures information for different users. Create different user roles to restrict permissions and set different views to make it easier for users to understand the story their data is telling them.
  • Integrates data from multiple transactional systems. This lets you see a bigger and clearer picture of your business across all departments and silos.
  • Delivers excellent query performance without compromising business transactional systems. No need to worry about your systems slowing down, timing out, or crashing.

Your data warehouse will change and evolve as your business gets larger and greater over time. As your company grows, your requirements shift.  Your data warehouse needs to be designed to be flexible and scalable so it can handle changing requirements. Automated integration solutions to move company data from your business transactional systems and flat files to the data warehouse is one way you can make sure your data warehouse can grow with your company. Automations keep costs low as well as lower the chance of errors.