We’re in a series on the “ Keys to Effective Data Science Projects ”. We’ve identified the question we want to solve , and made a preliminary pass at the data we need to answer that question . Next we brought in that data to a central location we can work with. We now want to explore that data.
This is a primary difference between Data Science and Business Intelligence (BI). In BI solutions we often find our source data, then alter it to fit into a desired aggregation, and then perform the aggregations to fit the type of queries the users will ask. This process is called “ETL” – for Extract, Transform and Load.
In Data Science, we *don’t* do ETL – we do ELT . We extract the data (the last step) load it, and then leave it alone (for now). We just extract and load it. Next it’s important to simply look at the data.
Now, if you’re a data professional, this will be one of the hardest steps you’ll ever do. We’re trained to look at some data, query it, and probably create some reports – we try to find meaning . And that’s not what we want to at this phase. What we need to do is to explore the data to simply find out more about what it is. This is a deceptively simple statement.
We’re not looking to find meaning in the data – we’re looking for the meaning of the data as a source. And we do that by opening the data, and documenting it. Just that.
Here are a few questions to get started that you can use to do that:
- What is the source of the data? Where did you get it?
- Why did you get it from there?
- How is it structured, or not structured?
- If it has “rows”, how many are there?
- If it has “columns”, how many are there?
- Are there any missing values? Where? How many? How many as a percentage of that column or those rows?
- What does each row represent, if you simply read it?
- What does each column represent, if you simply read it?
- Can you find out what it really means?How? Who would you check with?
- If it has numbers in it, are there any aggregations that seem to make sense? (sums, etc.)
- If it has numbers in it, are there any descriptive statistics that make sense? (average, standard deviation, minimum, maximum, etc.)
- If there are numbers, what are the distributions (quartiles, etc.)
- Are there more than one set of data?
- Are there any “natural” join methods between the data sets?
- Do we have all the data that covers all the data points needed for analysis?
We simply cannot move forward with any analysis until you understand this data. We’re going to be using this data in a statistical sense, so the reliability of this data, the spread, the centrality, and the sizes you’re dealing with are vitally important.
So now – how to do this? There are lots of mechanisms you can use, from R to Python , from Azure ML to Excel . The technology is not actually that important – it’s more that you answer the above questions (and many others).
You will, of course, need to document all this. Personally, I’m using the Azure Data Catalog , but the bigger point is that you do that. We’ll use this in the next series of steps.
So isn’t this just a standard part of the process? What makes this a “Key” thing to do? It’s because I find that this is often lacking in Data Science projects. When we test a Machine Learning model and it does not perform well, most of the time I simply go straight back to the source data process – and I often find the problem there. It’s Key because it’s vital.