One doesn’t need to be a data scientist to know that data/decision science projects are different than any other digital initiatives. Its complexity cannot be quantified by number of users, function points or whether it’s a bespoke and package implementation. The challenges and opportunities of data science projects is purely a function of use case at hand and the data to be dealt with. Manage these and you will have a successful data science project. Simple, isn’t it?
Many a times the answer to this question is more nuanced than a ‘Yes’ or ‘No’. There are several things at play which need to be taken care of. Here is my list of 10 precepts for data a science project leader.
- Know your customer
- Data is a person not a tool
- Evaluation is a continuous process
- Bad results are good
- Business is more important than statistics
- Ensemble is better than the star
- Model is as good as its implementation
- Don’t educate, communicate
- In this mine, dirt is valuable
- The end is not the end
1. Know your customer
Ok, I can already see a lot of eyes rolling, humor me for a minute. Obviously, there is no point in starting data science project without knowing about the organization and the business function which is expected to use the generated insights. The ‘Customer’ in this precept is the person who will make or save money due to the insight.
This concept although sounds very simple, it is deceptively complicated. At the start of the project several people will tell you things like ‘this insight will change my life’ or ‘this is going to bring in so much of predictability in our business’. However, change in life and predictability are subjective terms. As someone who is obsessed with numbers, a data scientist should strive to attach money/time value to their insights. e.g. before starting work on the project you should know how much money will logistics function save due to 5% improvement in demand forecast accuracy. It not only makes the insight more attractive for business users but also gives some handy tips for data scientist during EDA. Above all it will make adoption of the insights and overall shift towards a data driven organization smoother.
2. Data is a person not a tool
Yes, you read it right, I believe that treating data like block of words and numbers is not helpful. Its not a tool which has clear limitation of strong points. It’s a person. Its limitations can be improved or made irrelevant; strengths can be maximized. Just like you will never find a person who is 100% perfect for a job, you will never get data which is 100% perfect for a use case. You look at some traits like reliability, consistency and volume, if they are at a level which is acceptable, you can move forward. Wherever possible try to improve its weakness. E.g., if market basket at SKU level might give you useless results like someone bought same product in 3 different colors. Then you go at style level or collection level and then perform analysis. Trying to force an insight at a specific level is pointless. Sometimes more useful insight is just one aggregation level away.
3. Evaluation is a continuous process
This is pretty much a known fact, however many a times I have seen people focusing too much on end state evaluation as compared to intermediate evaluation. e.g., when calculating forecast with cross sectional parameters, too much emphasis is given on using more parameters to improve forecast accuracy. In many cases such exercises lead to overfitting. Make sure that cross sectional parameters have a significant impact on output. Evaluate the strength of correlation and keep an eye on it through multiple iterations. Again, as stated in earlier precept, better parameter might be just an aggregation level away.
4. Bad results are good
There is a tendency to discard results which did not provide best outcome. Specifically, while performing hyperparameter tuning in AI algorithms, we tend to discard all results except for top one. However, it is always a good idea to check at least top 10 results to see if there a significant change in more than one parameter which has resulted in good outcome albeit not the best. Such analysis helps you in determining optimum hyperparameter ranges. Even in traditional ML algorithms plotting results against parameters values gives useful insights.
5. Business is more important than statistics
One of the biggest reasons why data science doesn’t become a strategic organization-wide initiative is because it often generates obscure results which business users find inexplicable. As a data scientist, statistical rigor is important, and all insights generated by statistical analysis need to be carefully analyzed. However, we should try to provide causal explanation for the results of statistical analysis. Only then users will be more willing to accept our insights.
This doesn’t mean we should not present the results which do not make direct business sense. It just means we should dig deeper in the data for such results and check if they can be linked with some causal factor or is just a statistical anomaly.
6. Ensemble is better than star
This one is more for the young ones. In their zeal to use cutting edge algorithms and provide sophisticated solutions, young data scientists tend to focus too much on a single algorithm because it is used or recommended by someone important. But remember; data, use case and application might be completely different. Try out even the simplest of algorithms. Implementing multi-layer perceptron will look great on a resume, but sometimes best results are generated by combination (better known as ensemble) of simpler algorithms like SVM and good old logistic regression.
7. Model is only as good as its implementation
Data science projects can be of two types, one-time insights, or steady state insight generation. However, most of the times all projects are handled like one-time insights and not much attention is given to generating these insights on periodic basis. The initial euphoria of implementing a DS project quickly fizzles out since new insights are not reliably available. Adequate attention must be provided to ‘productionizing’ the algorithm along with building one. It goes a long way in building confidence on robustness of generated insights.
Adequate care should also be taken to ensure that model does not become stale. Tuning hyper parameters in AI algorithms periodically should also be automated. For ML algorithms suitable mechanism should be put in place to ensure the model is kept up to date.
8. Communicate, not educate
As any seasoned data scientist will tell you, storyboarding is an essential skill. Communicating statistical insights in a manner which is appealing to the business stakeholders is an art. It is critical that we are not bogged down by overload of statistical jargon and complex insights. Stay focused on the story and the business. As discussed in first precept, distil the insights in business terms (like $ value, transactions, time saved etc.). State business benefits upfront rather than spending time on explaining methodologies and approach. That discussion can come later. Most importantly, as said in earlier precept, business is more important than statistics.
Remember, once business users are convinced of the value which your insights will add they will be encouraged to know more. Only then it makes sense to provide technical and statistical details.
9. In this mine, dirt is valuable
We have already established that exploratory data analysis and intermediate evaluation and analysis of all variables is important. It is like sifting through data till we find something of value. However, many a times there are several intermediate results and exploratory insights which are quite important for business. e.g. while analyzing cross sectional parameters for sales forecasting you will find what impact those parameters have on sales. Such demand elasticity graphs can be quite valuable to business leader. It is always a good idea to present such findings during storyboarding. It further adds value to your final insights and gives more confidence to the audience regarding capabilities of data science.
10. The end is not the end
Lastly, remember to provide a roadmap for future data science project. Although everyone must be doing that for obvious reasons, the roadmap should be articulated with specific business benefits. The roadmap can be about how to improve on quality of current insight or can be about different use case altogether. Provide certain pointers on why you think this use case is worth exploring. Use your knowledge about customer’s business and their data to build compelling use case. And remember, if that use case does make it to the anvil, start from precept 1.
These are my 10 precepts, what are yours?