Mindful Data Governance Initiative – Identifying Data Domains

From my last blog we have learned about some of the roles and responsibilities that should play within Mindful Data Governance, now we need to identify the data domains. Before we can start to look at formalizing the nonstandard data governance practices that are going on within your business today we need to identify the different data domains. Simply put all you are doing is identifying the systems (transactional) that are collecting data about the business and the consumers. For example data domains could be referencing product, consumer, vendors, finance, hr and logistic information.

Looking at the above example each of these data domains can contain:

  • Data owners
  • Business glossaries
  • Data dictionaries
  • Business processes
  • Data catalogs
  • Reports catalogs
  • Data quality scorecards
  • Systems and applications
  • Policies and standards

The best way to start to identify data domains is simply start looking at the business needs or problems. For example lets say that a Client has need to be able to increase customer experience, to be able to up-sell while a customer is buying a different product, and to control over validating customer’s needs.

Requirements such as these, were tied to the business problem our client was facing: they had to control visibility and understanding around its customers. Data was spread across multiple systems and applications with no defined ownership. We helped create ownership by identifying key stakeholders, business processes, and datasets related to the customer domain and established control around its lifecycle. The idea is to have a clear understanding of where data comes from, who owns it, and when changes are made, who should be involved.

Remember when implementing the Mindful Data Governance Initiative you do not have to implement all the data domains at once, in fact I highly suggest that you do not do that. Assign a number to your data domains in the order that the business would like those data domains to be implemented and then follow that. Mindful data governance is about implementing data governance in phases and not to create a disturbance in your business. Mindful Data Governance will increase your data quality, accessibility, usability, security and availability.

Self Service BI needs Data Governance

Self Service BI I think is very important for businesses to implement and can greatly increase productivity of a business as a whole. Let me give you what I think Self Service BI. Self Service BI allows employees to conduct their daily analytics work with little to no IT intervention which increases productivity and gets answers to questions that are important to that department or set of employees.

When business are looking into Business Intelligence Visualization tools to allow it’s different departments to conduct their own data analytics and data discovery, I will say Data Governance not only greatly helps but I would state it rescues this ideology that a business wants to put into place. Data Governance can help these self service users with complex data models and ensures that all the data that is being analyzed is of the highest quality. Data Governance also makes sure that all the different departments in the business are looking and talking about the data in the same way, so everyone is getting the same story from what the data is saying and moving in the same direction.

With all the above being stated, Self Service BI does come with obstacles and each of the below obstacles need to be addressed and can be solved with a Data Governance solution. You might encounter more obstacles as you pursue Self Service BI but the below ones are the obstacles I have ran into more then several times and wished to share.

Obstacle 1: Data quality. A lot of businesses that I have worked with realize that they have data quality issues once we get into the project. Trying to convince the stakeholders that Data Governance is built for exactly this is not only cumbersome sometimes but outright difficult. Many times the stakeholders say the investment is to much in dollars, my statement back to them is that I do not think you can afford not to put a Data Governance in place. Here are just an example of some questions that you can ask the stakeholders or executives. Is there duplication in the data? Can you see all the touch points of your customers? Is the data that your systems collecting accurate and valid? Letting the stakeholders know that implementing Data Governance does not have to be a huge investment in time or money if you take the approach I like to call the Mindful Data Governance Initiative.

Obstacle 2: Data growth: Many business are seeing an exponential growth in the data that they are collecting (variety and data sources) especially business with many departments. Implementing a Data Governance solution ensures data quality and that all the processes that collect data are repeatable and valid.

Obstacle 3: One fits all: In Self Service BI there might be an attempt that one tool fits everyones needs from a sales rep to a data analytics employee. The methodology of one fits all will work in Self Service BI but will not work with Data Governance as the wide range of data analytics is way to wide. However, self-service is of no benefit if the data being accessed is not valid and governed correctly. Data Governance provides a consistent and repeatable way to manage data collection across business units and to make sure that information delivered is reliable and is of the highest quality.

Obstacle 4: Although the biggest benefit of Self-Service Computing is that it offers complete democratization of complex Data Management tasks in the daily life of a business user, you still have to think: Are there any down falls of having so much freedom over critical business data? As all types of business users will have access to critical data, isn’t Data Security and Data Privacy at high risks of loss or corruption? Unless Data Governance policies take these risks into full consideration through the rules, procedures, and access controls, the whole purpose of self-service may be compromised.

Like I stated early you will definitely come across more obstacles when implementing Self Service BI in your business and you will have to figure out solutions for them. It is my belief that if you do not have a formal Data Governance solution in place first it will make getting over all these obstacles extremely hard if not impossible.

Data Governance 4 Your Business

What is Data Governance?
Data governance refers to the overall management of the availability, usability, integrity, and security of the data deployed in your business. A healthy data governance program includes a governing group, a defined set of procedures (that are repeatable), and a well-designed plan to execute those procedures (documented).

I always think of the number four when I think of Data Governance. Here is why: Some attributes of the number four are hard work, security, practicality, productivity, appreciation, tradition, solid foundations, security-consciousness, self-control, loyalty, conscientiousness, high morals and ethics. The essence of the number four is security, diligent work and strong foundations. All those attributes and the essence of number four is exactly how your data governance is to be implemented and treated. One of the most important attributes of a Data Governance is a solid strong foundation. Remember DATA GOVERNANCE 4 YOUR BUSINESS

Your Data Governance should consist of a four-way structure incorporating availability, usability, integrity, and security.

Why is Data Governance important?
Data Governance is important because it ensures that the data assets are formally, proactively, properly, and efficiently managed throughout the organization to secure its trust and accountability.

Data Governance comprises the collecting of data, revising and standardizing it, and making to ready for use. It makes the data consistent. Data Governance ensures that critical data is available at the right time to the right person, in a standardized and reliable form. This helps the business and its operations to be better organized. Adopting and implementing Data Governance can overall help improve the efficiency and productivity of an organization.

What are some of the methodologies of Data Governance?
Data is king and is so very important to every business no matter the size. You can implement Data Governance in phases but the implementation must always be across the entire business in order to be successful. Also it is very important to remember that once a phase has been implemented the governance body will have to continue monitor, maintain and review those implemented processes and the data, this is critical. Other success factors that can help you implement a winning Data Governance are: Look and prioritize areas of improvement (phase approach); Create roles, responsibilities, and rules from the processes people use in working with the data; establish an accountability infrastructure; convert your business culture to a master data management system. The way to start this highly structured and monitored Data Management strategy is to standardize the use of terminology across business units and enforce consistency of use. The ultimate goal of Data Governance is to make sure it is possible to consolidate your data and create a consistent view of that data across the business for advanced Business Intelligence activities. Literally you are turning data into a “Single Source of Truth” that the entire business is looking at.

Business Intelligence without Data Governance
Can you implement Business Intelligence without Data Governance? Of course you can. I believe that the two must go hand and hand. A sound Data Governance can significantly increase the returns of a company wide Business Intelligence investment. When starting a BI initiative with clients I seldom hear them talking about Data Governance but that does not take away the reality that it has to exist. Without governing your data in this data driven world business businesses will never realize the full potential of the data they are collecting. Data governance used to be a nice to have, but due to the increasing focus and importance of data and analytics, it’s becoming a necessity that helps to drive data management across the business.

For Example
Take a financial business I worked with that had very poor, inconsistent customer data. All of the customers with first, middle and last names had multiple differences, and addresses were inconsistent. This type of situation makes it very difficult to do any type of customer analytics, from identifying cross-sell opportunities to tracking and understanding customer experience. Data Governance can be a first step in identifying the issues, defining standards, and implementing changes in the business to align with these standards.

Remember DATA GOVERNANCE 4 YOUR BUSINESS

Inner and Outer Joins

Outer and Inner SQL Joins

SQL is such a wonderful language and one question I get from people that I talk with is what is the difference between Outer and Inner Joins. Before we talk about that lets talk about what a Join is. A join is used to compare and combine — literally join — and return specific rows of data from two or more tables in a database. It really is that simple. Now an inner join finds and returns matching data from tables, while an outer join finds and returns matching data and some dissimilar data from tables.

Inner Join
An inner join targets on the commonality between two tables. When using an inner join, there must be at least some matching data between two (or more) tables that are being compared. An inner join searches tables for matching or overlapping data. Upon finding it, the inner join combines and returns the information into one new table.

Outer Join
An outer join returns a set of records (or rows) that include what an inner join would return but also includes other rows for which no corresponding match is found in the other table.

There are three types of outer joins:

  • Left Outer Join (or Left Join) – returns only unmatched rows from the left table.
  • Right Outer Join (or Right Join) – returns only unmatched rows from the right table.
  • Full Outer Join (or Full Join) – returns unmatched rows from both tables.

Each of these outer joins refers to the part of the data that is being compared, combined, and returned. Sometimes nulls will be produced in this process as some data is shared while other data is not.

To see examples of LEFT OUTER and RIGHT OUTER JOINS look back at this blog RIGHT JOIN and LEFT JOIN.

Data Lakes vs Data Warehouses

Data Lakes versus Data Warehouses

What is a data lake?
A data lake is a central repository that allows you to collect and hold all of your unstructured and structure data at any scale. What that sentence means is that you do not have to transform your unstructured data to store and then run analytics on it. You can store all your data as-is.

Do I need a data lake?
Companies that have put data lakes in place outperformed like companies by 9% which was reported by an Aberdeen survey. What was found is that these companies were able to do new types of analytics (even create new products) from their sources of data like log files, social media, IOT, etc being stored inside of data lakes. Being able to parse and learn from this data enabled them to react faster to what their data was telling them. Looking at unstructured and structure data allowed these companies to attract and retain customers and make better informed decisions.

Data Lakes versus Data Warehouses
You have to approach this decision with facts and requirements. Depending on the needs to the business you might need a data warehouse, a data lake, or even both. Allow the needs of the business and what data the business collects drive the decision organically. Before we go any further lets define each of these.

Data Warehouse is a system that pulls together data from many different sources within the business usually transactional systems that is needed by the business to conduct day to day operations. This data is collected for reporting and analysis.

Data Lake is a storage repository that holds a vast amount of raw data in its native format until it is needed. By allowing the data to remain in its native format, the size of the data can be greater and pulled in a more timelier manner which will give the business quicker insight what is going on.

Lets compare Data Warehouses to Data Lakes side by side to help see what is best for your business to use for its analytic needs.

Section Data Lake Data Warehouse
Schema Written at the time of analysis Architected prior to streaming the data in.
Data This is both non-relational and relational data from web sites, social media, IOT and business applications Transactional systems and operational databases
Performance and Price Low cost storage and queries are getting faster Higher cost storage and fast queries
Data Quality This is raw data. May or may not be curated This data is greatly curated and serves a Single Source of Truth
Users Data scientists, business analysts, data developers Business analysts
Analytics Predictive analytics, data discovery, and machine learning Business Intelligence and batch reporting
Slow SQL

Speeding up Slow Queries inside of Postgres

If your tables are large and you are suffering from slow queries the number one place to start to look, is at the indexes on the tables they are usually the culprit. There are several tools that can help you investigate and dissect the problem you are having with your queries within your Postgres database. In other databases there are similar tools that you can use but in this blog we will only focus on the Postgres components.

PG_STATS_STATEMENTS
The pg_stats_statements component is a awesome place to start to find to not only gathering statistics on the SQL that is being executed against your database but easily identify the slow performing SQL statements.

Once you have this component installed, a system view named pg_stat_statements will be available with all kinds of great information once it has had a chance to consume a good amount of data, look for queries that have relatively high total_time value.

Example: SELECT * FROM pg_stat_statements ORDER BY total_time DESC.

AUTO_EXPLAIN
auto_explain is very useful for finding slow queries but has advantages: First it logs the actual execution plan and supports logging nested statements using the log_nested_statements option. Nested statements are those statements that are executed inside a function. If your application uses many functions, auto_explain is invaluable for getting detailed execution plans. The log_min_duration control which query execution plans are logged, based on how long they perform. For instance, if you set this to 1000, all sql statements that run longer than one second will be logged

Statistics Collector
The statistics collector is a great system that collects all sorts of performance statistics that are extremely helpful in analyzing your slow performing SQL.

Turning on the statistics collector on gives you tons of pg_stat_… views which contain all the handy information. In particular, I have found it to be extremely helpful for finding missing and unused indexes.

EXPLAIN
The EXPLAIN command is a great tool when it comes to tuning queries. It tells you exactly what is really going on within your SQL. To execute it, simply but the EXPLAIN command in front of your statement (i.e EXPLAIN SELECT ……) and run it. PostgreSQL will show you the execution plan it used.

When using EXPLAIN for increasing the performance of your SQL, I suggest using the ANALYZE option (EXPLAIN ANALYZE) as it gives you more accurate results. The ANALYZE option actually executes the statement (rather than just estimating it) and then explains it. In a later BLOG I will go into great depth on the output of the EXPLAIN.

Benefits

Benefits of Business Intelligence Visualization Tools

There are so many business intelligence visualization tools available in this day and age. Tableau, Power BI, Qlik, and AIR Intel are just a few. These tools will help the users make better and more informed decisions around their business by depicting the data in a graph or chart representation. There are many benefits in using a business intelligence visualization tool and we are going to touch on a few:

• Easier to understand and quick to action: The human brain tends to process visual information far more easier than written information. Use of a chart and/or graph to summarize complex data ensures faster comprehension of relationships than cluttered reports or spreadsheets.
• Interact with data: The greatest benefit of data visualization in my opinion is that it exposes changes in a timely manner. But unlike static charts, interactive data visualizations encourage users to explore and even manipulate the data to uncover other factors. Drilling into a chart to see the underlying data which could be yet another chart or a table/grid of the raw data can assist the user in seeing the data from the highest level to the lowest. For example, we have a pie chart depicting counts of sales calls by region within a specific time frame. You can then click on a region and then see a bar chart, each one of those bars represents a count of the amount of calls each sales person did in that specific region. Then that same user can click on a specific salespersons bar from within the chart to see all the details behind the calls: Who they called, when they called, how long the call was, comments from the call, etc.. This type of functionality is allowing the user to see not only how the sales people are doing overall but allowing you to see who are the best sales people making calls and why are they successful. Are all the successful calls made within a certain time frame? The visualization tool allows you to convey a story easier.
• Creation of new discussions: Another advantage to data visualization is that it provides a ready means to tell stories from the data. Heat maps can show the development of product performance over time in multiple geographic areas, making it easier to see those products that are performing very well or those that are underperforming. With this functionality built into most visualizations tools users (Executives, managers, and employees) can drill down into specific locations to see what’s what is working and what is not and pivot if needed.
• Communicate more effectively: Gone should be the days where you read an eight to ten page document to decipher the findings of what occurred in your business by the month and/or quarter. Now you can supply reports that can decipher complex data into simple outputs and have them automatically delivered to the people that should be reviewing the data. Not only do visualization tools allow you to communicate more effectively but I would also state that the reports are automatically delivered in a more timely manner.
• Absorb more information easily: Data visualization enables users to view and understand vast amounts of information regarding operational and business conditions. It allows decision makers to see connections between multi-dimensional data sets and provides new ways to interpret data through heat maps, bar charts, line charts, bubble charts, and other rich graphical representations. Businesses that use visual data analytics are more likely to find the information they are looking for and sooner than other companies.

Above are just a few of the benefits of data visualization tools and I am sure you can think of several more if you have played around or have used visualization tools.

Reading CSV and inserting data with Python

One of the easiest things that you can do in Python is read a CSV file and insert that data into a Postgres database. In this blog I will show you how to open up and read a CSV file and then loop thru that CSV creating an insert statement to move that data into the database.

Table structure in Postgres database
CREATE TABLE user
(
business CHARACTER VARYING(155),
firstname CHARACTER VARYING(75),
lastname CHARACTER VARYING(100),
title CHARACTER VARYING(50)
);

First thing we want to do is create a connection and create a cursor
conn = psycopg2.connect(host=”HOSTNAME“, database=”DATABASE NAME“,user=”USER“,password=”PASSWORD“)
cur = conn.cursor()

Second lets create an insert statement
insert_stmt = ” INSERT INTO TEST (business, firstname, lastname, title) VALUES (%s,%s,%s,%s)”

Third we are now going to open the CSV file, read it, build insert statements and insert. Sample file lines

“BUSINESS”,”FIRSTNAME”,”LASTNAME”,”TITLE”
“Jakes Landscape”,”Jake”,”Switch”,”President”
“Lamps etc”,”Robert”,”Jump”,”Sales”
“PMP Light Bulbs”,”David”,”Downes”,”Warehouse”

file_name =”PATH TO CSV FILE
reader = csv.DictReader(open(file_name))
# Count how many records have been processed
recordcount = 0
for row in reader:

try:
cur.execute(insert_stmt, (row[‘BUSINESS’],
row[‘FIRSTNAME’],
row[‘LASTNAME’],
row[‘TITLE’]))
conn.commit()
recordcount = recordcount+1
except (Exception, psycopg2.DatabaseError) as error:
print(sys.stderr)

conn.close()
cur.close
print (‘Process ‘+str(recordcount)+’ rows from ‘+file_name)

It is as simple as that to read a CSV file and insert its data into a database.

Analyzing unstructured data

Before we start to talk about unstructured data (or unstructured information) lets define it. Wikipedia defines it as information that either does not have a pre-defined data model or is not organized in a pre-defined manner. Unstructured information is typically text-heavy, but may contain data such as dates, numbers, and facts as well. This results in irregularities and ambiguities that make it difficult to understand using traditional programs as compared to data stored in fielded form in databases or annotated (semantically tagged) in documents. I think that is pretty spot on. Today more than ever businesses are collecting more and more unstructured data whether it is from the vast social media sites to the enormous volumes of emails that are being sent out and received every day. There is a lot of great data just within social media sites and emails that are being collected and desperately need to be analyzed. This new data can be stored in a relational database as well as a NoSql database but my suggestion to you is store this data where you are storing all the other data sources — your warehouse. For example, if you are using Salesforce’s Pardot as your marketing tool you will have all your business’s campaign data, visitor data, and prospect data streaming from Pardot to your warehouse. Now you created a post in Facebook, you set up a campaign in Pardot and then you pushed an email out to all your prospects with a link to that post. You now know who opened that email and you know who clicked on the link. Please tell me how important would it be for your business to know the sentiment of all the comments that the prospects have left on the Facebook post? If you have not thought about that trust me it is powerful and it will make your data actionable. To analyze this unstructured data one of the best ways is to use Natural Language Processing (NLP). NLP is a form of artificial intelligence that focuses on analyzing the human language to draw insights, create advertisements and more. NLP is being used more and more and is driving many forms of Artificial Intelligence (AI). Think about it — you can decipher the sentiment of all the comments left on a post and based on the sentiment you need to pivot because the post has a negative sentiment towards it or better yet you do not have to do anything because the posts results have a positive sentiment around them. Not only is that information extremely important to your marketing teams but you will know how your product’s message is being received by the public.

There are a several important data points that you can get from NLP including sentiment analysis, keyword extraction, syntax analysis, entity recognition, and topic modeling to name a few. We are going to touch a little bit on each of these to show you not only how important the information can be for you and your business but to also make sure you have a general understanding of each of the topics. I utilize AWS comprehend which is a natural language processing (NLP) service that uses machine learning to discover insights from text and provides all the above functionality and returns the result in JSON format to either store in a database or display in real time inside an application.

Developing a data collection process and documenting

When building out your business intelligence solution an important step of developing data collection processes and documenting those processes is critical to your business and its success. Why develop a data collection process? Not only will creating a data collection process standardize the way you collect data for all the groups in your business but it will ensure the integrity of your data is kept high. Processes will ensure that your actions are repeatable, reproducible, accurate, and stable. Think about it: if a business had an employee that was collecting critical data on the business and the business had no idea how it was being collected and that employee left, that would have some impact on the business. Would the business be able to figure out how the employee was doing this but after how long? At what cost to the business? Could there be repercussions? Ensuring processes are in place for your data collection will improve the likelihood that the data is secure, available, clean and the measurements derived from that data will help the business now and in the future.

There are many reasons why every business should be documenting the data collection process. If you are then documenting your processes becomes transparent and your data becomes comprehensible in the future for yourself and others. Your documentation of each of the data collection process should include:

• A good description behind the data that is being collected.
• Provide all the answers: the who, what, why, where and how of the data.
• Provide any conditions of the use of the data as well as the confidentiality.
• Provide any history around the project for collecting this data