What is a data lake?
A data lake is a central repository that allows you to collect and hold all of your unstructured and structure data at any scale. What that sentence means is that you do not have to transform your unstructured data to store and then run analytics on it. You can store all your data as-is.
Do I need a data lake?
Companies that have put data lakes in place outperformed like companies by 9% which was reported by an Aberdeen survey. What was found is that these companies were able to do new types of analytics (even create new products) from their sources of data like log files, social media, IOT, etc being stored inside of data lakes. Being able to parse and learn from this data enabled them to react faster to what their data was telling them. Looking at unstructured and structure data allowed these companies to attract and retain customers and make better informed decisions.
Data Lakes versus Data Warehouses
You have to approach this decision with facts and requirements. Depending on the needs to the business you might need a data warehouse, a data lake, or even both. Allow the needs of the business and what data the business collects drive the decision organically. Before we go any further lets define each of these.
Data Warehouse is a system that pulls together data from many different sources within the business usually transactional systems that is needed by the business to conduct day to day operations. This data is collected for reporting and analysis.
Data Lake is a storage repository that holds a vast amount of raw data in its native format until it is needed. By allowing the data to remain in its native format, the size of the data can be greater and pulled in a more timelier manner which will give the business quicker insight what is going on.
Lets compare Data Warehouses to Data Lakes side by side to help see what is best for your business to use for its analytic needs.
Section | Data Lake | Data Warehouse |
Schema | Written at the time of analysis | Architected prior to streaming the data in. |
Data | This is both non-relational and relational data from web sites, social media, IOT and business applications | Transactional systems and operational databases |
Performance and Price | Low cost storage and queries are getting faster | Higher cost storage and fast queries |
Data Quality | This is raw data. May or may not be curated | This data is greatly curated and serves a Single Source of Truth |
Users | Data scientists, business analysts, data developers | Business analysts |
Analytics | Predictive analytics, data discovery, and machine learning | Business Intelligence and batch reporting |