Self Service BI needs Data Governance

Self Service BI I think is very important for businesses to implement and can greatly increase productivity of a business as a whole. Let me give you what I think Self Service BI. Self Service BI allows employees to conduct their daily analytics work with little to no IT intervention which increases productivity and gets answers to questions that are important to that department or set of employees.

When business are looking into Business Intelligence Visualization tools to allow it’s different departments to conduct their own data analytics and data discovery, I will say Data Governance not only greatly helps but I would state it rescues this ideology that a business wants to put into place. Data Governance can help these self service users with complex data models and ensures that all the data that is being analyzed is of the highest quality. Data Governance also makes sure that all the different departments in the business are looking and talking about the data in the same way, so everyone is getting the same story from what the data is saying and moving in the same direction.

With all the above being stated, Self Service BI does come with obstacles and each of the below obstacles need to be addressed and can be solved with a Data Governance solution. You might encounter more obstacles as you pursue Self Service BI but the below ones are the obstacles I have ran into more then several times and wished to share.

Obstacle 1: Data quality. A lot of businesses that I have worked with realize that they have data quality issues once we get into the project. Trying to convince the stakeholders that Data Governance is built for exactly this is not only cumbersome sometimes but outright difficult. Many times the stakeholders say the investment is to much in dollars, my statement back to them is that I do not think you can afford not to put a Data Governance in place. Here are just an example of some questions that you can ask the stakeholders or executives. Is there duplication in the data? Can you see all the touch points of your customers? Is the data that your systems collecting accurate and valid? Letting the stakeholders know that implementing Data Governance does not have to be a huge investment in time or money if you take the approach I like to call the Mindful Data Governance Initiative.

Obstacle 2: Data growth: Many business are seeing an exponential growth in the data that they are collecting (variety and data sources) especially business with many departments. Implementing a Data Governance solution ensures data quality and that all the processes that collect data are repeatable and valid.

Obstacle 3: One fits all: In Self Service BI there might be an attempt that one tool fits everyones needs from a sales rep to a data analytics employee. The methodology of one fits all will work in Self Service BI but will not work with Data Governance as the wide range of data analytics is way to wide. However, self-service is of no benefit if the data being accessed is not valid and governed correctly. Data Governance provides a consistent and repeatable way to manage data collection across business units and to make sure that information delivered is reliable and is of the highest quality.

Obstacle 4: Although the biggest benefit of Self-Service Computing is that it offers complete democratization of complex Data Management tasks in the daily life of a business user, you still have to think: Are there any down falls of having so much freedom over critical business data? As all types of business users will have access to critical data, isn’t Data Security and Data Privacy at high risks of loss or corruption? Unless Data Governance policies take these risks into full consideration through the rules, procedures, and access controls, the whole purpose of self-service may be compromised.

Like I stated early you will definitely come across more obstacles when implementing Self Service BI in your business and you will have to figure out solutions for them. It is my belief that if you do not have a formal Data Governance solution in place first it will make getting over all these obstacles extremely hard if not impossible.

Data Governance 4 Your Business

What is Data Governance?
Data governance refers to the overall management of the availability, usability, integrity, and security of the data deployed in your business. A healthy data governance program includes a governing group, a defined set of procedures (that are repeatable), and a well-designed plan to execute those procedures (documented).

I always think of the number four when I think of Data Governance. Here is why: Some attributes of the number four are hard work, security, practicality, productivity, appreciation, tradition, solid foundations, security-consciousness, self-control, loyalty, conscientiousness, high morals and ethics. The essence of the number four is security, diligent work and strong foundations. All those attributes and the essence of number four is exactly how your data governance is to be implemented and treated. One of the most important attributes of a Data Governance is a solid strong foundation. Remember DATA GOVERNANCE 4 YOUR BUSINESS

Your Data Governance should consist of a four-way structure incorporating availability, usability, integrity, and security.

Why is Data Governance important?
Data Governance is important because it ensures that the data assets are formally, proactively, properly, and efficiently managed throughout the organization to secure its trust and accountability.

Data Governance comprises the collecting of data, revising and standardizing it, and making to ready for use. It makes the data consistent. Data Governance ensures that critical data is available at the right time to the right person, in a standardized and reliable form. This helps the business and its operations to be better organized. Adopting and implementing Data Governance can overall help improve the efficiency and productivity of an organization.

What are some of the methodologies of Data Governance?
Data is king and is so very important to every business no matter the size. You can implement Data Governance in phases but the implementation must always be across the entire business in order to be successful. Also it is very important to remember that once a phase has been implemented the governance body will have to continue monitor, maintain and review those implemented processes and the data, this is critical. Other success factors that can help you implement a winning Data Governance are: Look and prioritize areas of improvement (phase approach); Create roles, responsibilities, and rules from the processes people use in working with the data; establish an accountability infrastructure; convert your business culture to a master data management system. The way to start this highly structured and monitored Data Management strategy is to standardize the use of terminology across business units and enforce consistency of use. The ultimate goal of Data Governance is to make sure it is possible to consolidate your data and create a consistent view of that data across the business for advanced Business Intelligence activities. Literally you are turning data into a “Single Source of Truth” that the entire business is looking at.

Business Intelligence without Data Governance
Can you implement Business Intelligence without Data Governance? Of course you can. I believe that the two must go hand and hand. A sound Data Governance can significantly increase the returns of a company wide Business Intelligence investment. When starting a BI initiative with clients I seldom hear them talking about Data Governance but that does not take away the reality that it has to exist. Without governing your data in this data driven world business businesses will never realize the full potential of the data they are collecting. Data governance used to be a nice to have, but due to the increasing focus and importance of data and analytics, it’s becoming a necessity that helps to drive data management across the business.

For Example
Take a financial business I worked with that had very poor, inconsistent customer data. All of the customers with first, middle and last names had multiple differences, and addresses were inconsistent. This type of situation makes it very difficult to do any type of customer analytics, from identifying cross-sell opportunities to tracking and understanding customer experience. Data Governance can be a first step in identifying the issues, defining standards, and implementing changes in the business to align with these standards.

Remember DATA GOVERNANCE 4 YOUR BUSINESS

Inner and Outer Joins

Outer and Inner SQL Joins

SQL is such a wonderful language and one question I get from people that I talk with is what is the difference between Outer and Inner Joins. Before we talk about that lets talk about what a Join is. A join is used to compare and combine — literally join — and return specific rows of data from two or more tables in a database. It really is that simple. Now an inner join finds and returns matching data from tables, while an outer join finds and returns matching data and some dissimilar data from tables.

Inner Join
An inner join targets on the commonality between two tables. When using an inner join, there must be at least some matching data between two (or more) tables that are being compared. An inner join searches tables for matching or overlapping data. Upon finding it, the inner join combines and returns the information into one new table.

Outer Join
An outer join returns a set of records (or rows) that include what an inner join would return but also includes other rows for which no corresponding match is found in the other table.

There are three types of outer joins:

  • Left Outer Join (or Left Join) – returns only unmatched rows from the left table.
  • Right Outer Join (or Right Join) – returns only unmatched rows from the right table.
  • Full Outer Join (or Full Join) – returns unmatched rows from both tables.

Each of these outer joins refers to the part of the data that is being compared, combined, and returned. Sometimes nulls will be produced in this process as some data is shared while other data is not.

To see examples of LEFT OUTER and RIGHT OUTER JOINS look back at this blog RIGHT JOIN and LEFT JOIN.

Data Lakes vs Data Warehouses

Data Lakes versus Data Warehouses

What is a data lake?
A data lake is a central repository that allows you to collect and hold all of your unstructured and structure data at any scale. What that sentence means is that you do not have to transform your unstructured data to store and then run analytics on it. You can store all your data as-is.

Do I need a data lake?
Companies that have put data lakes in place outperformed like companies by 9% which was reported by an Aberdeen survey. What was found is that these companies were able to do new types of analytics (even create new products) from their sources of data like log files, social media, IOT, etc being stored inside of data lakes. Being able to parse and learn from this data enabled them to react faster to what their data was telling them. Looking at unstructured and structure data allowed these companies to attract and retain customers and make better informed decisions.

Data Lakes versus Data Warehouses
You have to approach this decision with facts and requirements. Depending on the needs to the business you might need a data warehouse, a data lake, or even both. Allow the needs of the business and what data the business collects drive the decision organically. Before we go any further lets define each of these.

Data Warehouse is a system that pulls together data from many different sources within the business usually transactional systems that is needed by the business to conduct day to day operations. This data is collected for reporting and analysis.

Data Lake is a storage repository that holds a vast amount of raw data in its native format until it is needed. By allowing the data to remain in its native format, the size of the data can be greater and pulled in a more timelier manner which will give the business quicker insight what is going on.

Lets compare Data Warehouses to Data Lakes side by side to help see what is best for your business to use for its analytic needs.

Section Data Lake Data Warehouse
Schema Written at the time of analysis Architected prior to streaming the data in.
Data This is both non-relational and relational data from web sites, social media, IOT and business applications Transactional systems and operational databases
Performance and Price Low cost storage and queries are getting faster Higher cost storage and fast queries
Data Quality This is raw data. May or may not be curated This data is greatly curated and serves a Single Source of Truth
Users Data scientists, business analysts, data developers Business analysts
Analytics Predictive analytics, data discovery, and machine learning Business Intelligence and batch reporting
Slow SQL

Speeding up Slow Queries inside of Postgres

If your tables are large and you are suffering from slow queries the number one place to start to look, is at the indexes on the tables they are usually the culprit. There are several tools that can help you investigate and dissect the problem you are having with your queries within your Postgres database. In other databases there are similar tools that you can use but in this blog we will only focus on the Postgres components.

PG_STATS_STATEMENTS
The pg_stats_statements component is a awesome place to start to find to not only gathering statistics on the SQL that is being executed against your database but easily identify the slow performing SQL statements.

Once you have this component installed, a system view named pg_stat_statements will be available with all kinds of great information once it has had a chance to consume a good amount of data, look for queries that have relatively high total_time value.

Example: SELECT * FROM pg_stat_statements ORDER BY total_time DESC.

AUTO_EXPLAIN
auto_explain is very useful for finding slow queries but has advantages: First it logs the actual execution plan and supports logging nested statements using the log_nested_statements option. Nested statements are those statements that are executed inside a function. If your application uses many functions, auto_explain is invaluable for getting detailed execution plans. The log_min_duration control which query execution plans are logged, based on how long they perform. For instance, if you set this to 1000, all sql statements that run longer than one second will be logged

Statistics Collector
The statistics collector is a great system that collects all sorts of performance statistics that are extremely helpful in analyzing your slow performing SQL.

Turning on the statistics collector on gives you tons of pg_stat_… views which contain all the handy information. In particular, I have found it to be extremely helpful for finding missing and unused indexes.

EXPLAIN
The EXPLAIN command is a great tool when it comes to tuning queries. It tells you exactly what is really going on within your SQL. To execute it, simply but the EXPLAIN command in front of your statement (i.e EXPLAIN SELECT ……) and run it. PostgreSQL will show you the execution plan it used.

When using EXPLAIN for increasing the performance of your SQL, I suggest using the ANALYZE option (EXPLAIN ANALYZE) as it gives you more accurate results. The ANALYZE option actually executes the statement (rather than just estimating it) and then explains it. In a later BLOG I will go into great depth on the output of the EXPLAIN.

Benefits

Benefits of Business Intelligence Visualization Tools

There are so many business intelligence visualization tools available in this day and age. Tableau, Power BI, Qlik, and AIR Intel are just a few. These tools will help the users make better and more informed decisions around their business by depicting the data in a graph or chart representation. There are many benefits in using a business intelligence visualization tool and we are going to touch on a few:

• Easier to understand and quick to action: The human brain tends to process visual information far more easier than written information. Use of a chart and/or graph to summarize complex data ensures faster comprehension of relationships than cluttered reports or spreadsheets.
• Interact with data: The greatest benefit of data visualization in my opinion is that it exposes changes in a timely manner. But unlike static charts, interactive data visualizations encourage users to explore and even manipulate the data to uncover other factors. Drilling into a chart to see the underlying data which could be yet another chart or a table/grid of the raw data can assist the user in seeing the data from the highest level to the lowest. For example, we have a pie chart depicting counts of sales calls by region within a specific time frame. You can then click on a region and then see a bar chart, each one of those bars represents a count of the amount of calls each sales person did in that specific region. Then that same user can click on a specific salespersons bar from within the chart to see all the details behind the calls: Who they called, when they called, how long the call was, comments from the call, etc.. This type of functionality is allowing the user to see not only how the sales people are doing overall but allowing you to see who are the best sales people making calls and why are they successful. Are all the successful calls made within a certain time frame? The visualization tool allows you to convey a story easier.
• Creation of new discussions: Another advantage to data visualization is that it provides a ready means to tell stories from the data. Heat maps can show the development of product performance over time in multiple geographic areas, making it easier to see those products that are performing very well or those that are underperforming. With this functionality built into most visualizations tools users (Executives, managers, and employees) can drill down into specific locations to see what’s what is working and what is not and pivot if needed.
• Communicate more effectively: Gone should be the days where you read an eight to ten page document to decipher the findings of what occurred in your business by the month and/or quarter. Now you can supply reports that can decipher complex data into simple outputs and have them automatically delivered to the people that should be reviewing the data. Not only do visualization tools allow you to communicate more effectively but I would also state that the reports are automatically delivered in a more timely manner.
• Absorb more information easily: Data visualization enables users to view and understand vast amounts of information regarding operational and business conditions. It allows decision makers to see connections between multi-dimensional data sets and provides new ways to interpret data through heat maps, bar charts, line charts, bubble charts, and other rich graphical representations. Businesses that use visual data analytics are more likely to find the information they are looking for and sooner than other companies.

Above are just a few of the benefits of data visualization tools and I am sure you can think of several more if you have played around or have used visualization tools.

Reading CSV and inserting data with Python

One of the easiest things that you can do in Python is read a CSV file and insert that data into a Postgres database. In this blog I will show you how to open up and read a CSV file and then loop thru that CSV creating an insert statement to move that data into the database.

Table structure in Postgres database
CREATE TABLE user
(
business CHARACTER VARYING(155),
firstname CHARACTER VARYING(75),
lastname CHARACTER VARYING(100),
title CHARACTER VARYING(50)
);

First thing we want to do is create a connection and create a cursor
conn = psycopg2.connect(host=”HOSTNAME“, database=”DATABASE NAME“,user=”USER“,password=”PASSWORD“)
cur = conn.cursor()

Second lets create an insert statement
insert_stmt = ” INSERT INTO TEST (business, firstname, lastname, title) VALUES (%s,%s,%s,%s)”

Third we are now going to open the CSV file, read it, build insert statements and insert. Sample file lines

“BUSINESS”,”FIRSTNAME”,”LASTNAME”,”TITLE”
“Jakes Landscape”,”Jake”,”Switch”,”President”
“Lamps etc”,”Robert”,”Jump”,”Sales”
“PMP Light Bulbs”,”David”,”Downes”,”Warehouse”

file_name =”PATH TO CSV FILE
reader = csv.DictReader(open(file_name))
# Count how many records have been processed
recordcount = 0
for row in reader:

try:
cur.execute(insert_stmt, (row[‘BUSINESS’],
row[‘FIRSTNAME’],
row[‘LASTNAME’],
row[‘TITLE’]))
conn.commit()
recordcount = recordcount+1
except (Exception, psycopg2.DatabaseError) as error:
print(sys.stderr)

conn.close()
cur.close
print (‘Process ‘+str(recordcount)+’ rows from ‘+file_name)

It is as simple as that to read a CSV file and insert its data into a database.