Interview with a Head of AI: Pawel Godula

A few weeks ago I had a chance to interview an amazing person and total rockstar when it comes to modeling and understanding customer data.

We’ll talk about:

How to build models that are robust to change
How to become a leader in a technical organization
How to focus on the “right” questions
Why model ensembling can be more important in real-life than in competitions
and way more!

Check out the full video of our conversation:

What follows is not a 1–1 transcript but rather a cleaned-up, structured, and rephrased version of it.

You can watch the video to get the raw content.

What is your role right now?

I am a director of customer analytics at deepsense.ai and we do projects related to forecasting in the area of customer behavior.

For example, we do things like:

whether a customer will repay his credit or not
what kind of product should we recommend to this client
What is the customer lifetime value to see if it makes sense to reach out to her or not

Things that are mostly related to real customer data.

The data that we are working with typically has the shape of events and it is collected online. Think data from e-commerce websites where you have events and then out of these events you do feature engineering, and then you build some models.

How is your team structured?

I think that typically the most efficient size for a machine learning or artificial intelligence team is between 5 to 10 people per project. With 5 to 10 people, you can easily tackle most of the important aspects of an AI project. With more people, it becomes difficult to manage. There is a lot of overhead in the communication cost.

Since we actually have many projects going on we need to scale that team size. That is also true for larger companies where ML modeling is at the core of their business. From my experience, their (core) modeling team is around 5 to 10 people. So 5 to 10 people for a single task should be always enough to solve it.

5 to 10 people for a single task should be always enough to solve it

What is your current project?

Right now we are working on projects for a global ad network. This project is very difficult in terms of deployment and production needs because we need to deliver 5 million point forecasts per second, at 100 milliseconds latency.

It’s a very difficult project because for every customer and every product that they want to advertise via our clients’ network we need to create those forecasts. So multiple original numbers by the number of products and clients and the number of predictions our model needs to generate and this is just huge. Forecasts need to be delivered every second, it is online so yeah it is a tough one.

This project has actually taught me a lot in terms of production in a large scale forecasting system. We made a couple of unintuitive moves that improved the performance a lot. For example:

we went from one big lightGBM model to 20 small lightGBM models.
we used large (500 trees predicting 20 classes) and small (50 trees predicting 2 classes) models

And those models differ basically by the different data that was used to train it or by different random seed.

The problem that we are facing requires us to stabilize the results and averaging predictions across 20 models is much better than taking one prediction from one big model.

Because when you are averaging predictions from 20 models, what you can also do is look at the standard deviations of those predictions. So, if all models vote positively for a given product, it means you should show it to the customer. If like five models, vote positively 15 models votes negatively, then you have a problem. A good thing is we can take advantage of model ensembling.

I know that forecasting on large datasets where you need low latency doesn’t seem like a typical scenario for model ensembling but that’s why I’m saying it is not intuitive. In such a production scenario, you would typically expect the model to be as simple as possible.

I think that in the production setting, model ensembling is much more important than, for example, in competitions. Typically people claim that you only use those monster ensembles to win competitions but you will never use it in real life. I say that in real life it is something much more important than in competitions because of a very simple reason that you get a lot of non-stationarity.

in real life model ensembling is something much more important than in competitions because of a very simple reason that you get a lot of non-stationarity.

Non-stationarity is something common In real life, which you don’t experience in competitions. And in the non-stationary problems having a diverse group of models helps a lot.

We were actually spending a couple of hundred thousand dollars on Amazon every month but we have decided to move from one model to 20 models because the performance improvement was so big.

Speaking of competitions you are a very active Kaggler. Just recently you had some nice success right?

Yeah, I’ve just won my first solo gold medal and it’s a dream come true. Thank you so much. Yeah, so, actually, I came to data science and to machine learning and artificial intelligence from competitions.

I started competing and then I realized it is such a great area to be in and an area that can have a tremendous impact on our lives. So I decided to basically change my life entirely and start coding, doing data science and machine learning.

I started as a business consultant, I worked at the Boston Consulting Group. It is a great company and I‘ve been working there for six years or so.

Then they were generous enough to send me for an MBA program. I chose INSEAD in Singapore and that was one of the best things that happened to me in my entire life. This one year, when you come back to university, you have some time to think about what you actually would like to do with your life.

I’m a huge evangelist of MBA and having that gap year to reflect. It was during my MBA program, when I discovered machine learning and wrote my first line of code. This is how my Data Science career started.

I’m a huge evangelist of MBA and having that gap year to reflect.

After MBA I went back to BCG, BCG gamma to be exact. BCG gamma is a sister company of BCG focused purely on AI. All big consulting companies created those. It was a great company led by fantastic people with an amazing guy, Sylvain Duranton running it. So they opened an office in Warsaw, they invited me and eventually I was leading this office. I think consulting, especially there at BCG, was a great place to be.

But after some time, I just wanted to try something different and I went to Netsprint.

It is one of the largest data ecosystems in Poland and also a great company with a fantastic management and leadership team. I went there to be a chief data scientist.

Some of the major use cases that we developed were around predicting the behavior of internet users based on their past history that was stored in cookies. That was prior to GDPR in Europe, which by the way would change the situation completely.

For example, we developed a website embedding system. Sites visited by similar people were close to each other in the embedding space and based on this information our system would predict if the websites visited by you were male or female-biased. For advertisers this is actually a very valuable piece of information.

Then GDPR happened and the situation totally changed. I moved to deepsense.ai where I am right now. I’m a director of customer analytics.

How does your typical day look like?

One big challenge that every technical manager has to deal with is the amount of time he spends on coding and the amount of time he spends on managing the people.

It is very easy to fall into the role of organizer/facilitator because you have your team to do your work for you. At the same time, it is highly unsatisfactory, to do just emailing and stand-ups and status meetings.

What drives me most at my work is actually problem solving and it is most attractive when I can also participate in coding. That is a big challenge that I see not only in my case but for many people who grow in organizations as good programmers and developers or scientists and they start leading teams.

I was fortunate enough because at deepsense.ai we have really a great team of people who require very little hand-holding and we can limit meetings to the ones where we discuss objectives and coordinate between team members.

Once the objective is set they use all the means necessary to realize that objective. It is one of the great things about working at deepsense.ai: the management is very, very efficient.

When it comes to the percentage split, I’d say I spend 20 to 30% is formulating the problem with the client and getting clients feedback, another 30% is working with the team and another 40% is my own coding and my own work.

I’d say I spend 20 to 30% is formulating the problem with the client and getting clients feedback, another 30% is working with the team and another 40% is my own coding and my own work.

It takes conscious effort to get to 40% there as it can easily drop to 5 or 10%.

When that happens you become unhappy as a coder and nothing good comes from being unhappy. It requires some discipline to organize your meetings, your stand-ups, and the way you talk to your clients but it is doable.

What surprised you about the role of Director of customer analytics and other “Head of AI”-type roles you had?

I would say it has always been surprising to me: how much can go wrong due to the lack of coordination between various teams working on the same project.

For example, we were doing a demand forecasting model for an e-commerce website. Our features were based on customer reactions to the product on the website. Then the design of the website was changed but we didn’t know about it. Our features started to mean something entirely different and our predictions went wild for a very short period of time until we realized what happened. It always amazes me how little things and even a tiny lack of coordination can lead to shocking results.

It always amazes me how little things and even a tiny lack of coordination can lead to shocking results. To deal with those problems I always try to have an entire team in one room.

For example, the problem formulation is one of the key things that a machine learning team leader or chief data scientist should know how to do. It requires a lot of experience because it means that you should be able to know intuitively, what kind of technology to use for a given task.

Let’s say we have a project where we want to personalize the e-commerce website with ML systems. One approach could be to use traditional classification models and the other could use contextual bandits (reinforcement learning method).

The technology choice at the very beginning has a huge impact on what kind of issues you’re going to be dealing with later on.

The chief data scientist should really know very well or at least know intuitively which approaches to try. I would say that is easily one of the most important skills.

Then, of course, in any machine learning projects you need to choose where you and your team should debate their time to:

finding the best algorithm,
looking for much more data,
feature engineering.

So, again, it comes from experience but almost always new data, new good quality data has more impact on your results than using a different algorithm.

I remember a couple of years back, there was this bank in Poland that started using the transactional data from the largest e-commerce website in Poland to give a credit score to young entrepreneurs. This transactional data is typically a very valuable addition and they had huge success with it.

Why? Because by adding a new data source that their competition was not using they were able to price the risk segments that no other bank was able to price. They were the only ones willing to extend loans to these guys, and they were precise in risk estimation.

by adding a new data source that their competition was not using they were able to price the risk segments that no other bank was able to price.

Adding new data, looking for sources of new data, and even talking to potential partners that could give you the data is also within the role of a well-performing data scientist, chief data scientist in particular.

Operationally I tend to focus a lot on feature engineering. Extracting information from data is more important than choosing algorithms or tuning hyperparameters. Of course, model selection or hyperparameter tuning is also important, but rarely have I seen a big impact from these two sources.

The last point that I think the data science community underestimates a lot but the people who actually apply the models to real-life appreciate is the aspect of non-stationarity.

Non stationarity is when something fundamental changes in the process of data generation. You need to make sure that your models are resilient to that.

I will give you an example. Once I had a very good discussion with a person leading consumer lending modeling for one of the largest Polish banks. He explained to me why he’s using linear regression instead of say lightGBM for example.

First let’s understand the position this guy is in:

he’s making his training model,
he’s putting this model into production,
his model decides who gets a loan and who does not
his models are making multi-million dollar decisions and he will know the results only after 6–12 months when people start having problems repaying the loans.

So he gets very light feedback (from real-world) and needs to make a decision now.

Under such circumstances what is very desirable for a machine learning model is responsiveness to sudden changes in the economy. He told me that if something sudden or something big changes in the economy, he pretty much feels how the linear regression model will behave but he does not have the same feeling about the lightGBM model.

I did not appreciate his comment until I ran into the same situation. Today we are doing a lot of modeling in the non-stationary data environment and we are using all those tricks to stabilize the outcomes. Using 20 model ensemble is one way, using simpler models like linear regression is another.

I would say that this aspect of feeling how your solution will behave if something changed in the data generation process is really important in the context of the socio-economic domain. I feel that it is this kind of knowledge that you get after at least a couple of years of doing real-life modeling and experiencing some failures along the way.

Anyway, the example that I mentioned before with the change in the website design is a really good one as well. Suddenly your features mean something entirely different but the models are outputting completely wrong predictions.

This is one of the curses of machine learning that our models tend to fail silently.

Another example of non-stationarity that I always give is hyperinflation. Imagine doing credit risk modeling in the hyperinflation environment. Typically, one of the most important features in the credit risk model is the personal income which is heavily affected by hyperinflation. If you train your model two months ago and you use it today it will be completely wrong.

To deal with non-statinarity you can use retraining, online learning, ensembling methods or other things. All of those make things better but they don’t necessarily make things good either.

To deal with non-statinarity you can use retraining, online learning, ensembling methods or other things. All of those make things better but they don’t necessarily make things good either.

They mitigate the effects of non stationarity but they don’t eliminate it.

Say you retrain your model every day, which is a good practice, and then something really big changes. It is only one day out of say 180 days which is your typical training sample and it will not have enough impact to make the right adjustment for the business context.

What I am saying is that it never hurts to have a better frequency of retraining but it is not everything.

How do you learn new things?

There are two sources of knowledge that I love.

The first one is the competitions where I get to learn new techniques. I do it every day. I try to find 15- 30 minutes just to think about the problem, just to think about how I would approach it and maybe just write one line of code. It never stops with one line of code, of course :).

Second one is good industry blogs. One blog that I could particularly recommend is the fast AI blog run by Jeremy Howard.

I also like that newsletter from Andrew Ng called “The batch”.

In terms of books, I have one that I highly recommend.

Elements of statistical learning by Trevor Hastie, Robert Tibshirani and Jerome Friedman

It is really a very good book.

Other than that, I’d say the best experience comes from real-life projects. That is actually one area where I think there are still not enough opportunities that are exploited by the data science community.

I’d say the best experience comes from real-life projects.

For example, the amount of information that is out there on the internet, which you can scrape, analyze and gain insights from is huge. There is a big branch of companies and startups that are doing this exactly but there is still a lot of room there.

Recently BlueDot has issued a warning against the Wuhan virus on 10th of December 2019 much earlier than everyone else.

So this is the area where you need to connect your model to data around you and by automating the analysis you can derive really interesting insights.

How do you focus your work on things that matter?

Asking good questions.

… and understanding that this is the reason why it is actually possible to come from the outside industry, learn coding at the age of 29 and then become chief of the technical team.

I admit my code is probably very clumsy but I try to always focus on the question that I’m answering, the thing that brings value. I spent zero time on irrelevant issues. I think it is something that can give you a huge advantage as a data scientist or machine learning engineer.

I try to always focus on the question that I’m answering, the thing that brings value.

Not so much the technical excellence of your code but actually knowing where the value is that you bring.

So before coding spend time thinking about:

what technology to apply,
what data should you have
how to approach data engineering
,how to make your models are resilient to most issues.

If you do that, you always answer the right question and even if your code is clumsy, or not technically great, there is still enormous value coming from your work.

I think this is actually a big hope for people coming from an outside industry into machine learning because this is the skill that is very natural to people who are used to solving business problems.

How important is a good data science process to you?

I think that having a good process of data analysis and modeling is actually very important.

Machine learning and AI is a very young field and there are not many established best practices

Machine learning and AI is a very young field and there are not many established best practices that are taught at university or in any book so I designed my process and I try to follow it strictly.

Of course, I deviate from this process all the time, but at least I am by following something that is making sure that all the important elements are covered.

I think it is equally important in real-life projects and competitions but let me give you a competition example.

I believe that when you want to win a Kaggle competition you should at least touch all sources of potential value:

Data explorationFeature engineering
Model selection
Hyperparameter
Optimization
Ensembling
Thinking carefully about validation

If you follow all the steps, and you tap, let’s say to 80% of value coming from all sources, then you have a really robust solution.

I enjoy feature engineering a lot

I tend to focus on things that I like, but I think this is very natural for all of us. So I enjoy feature engineering a lot which is fine but forgetting about other parts of the process cost me dearly in the past.

I think you remember the Home Credit default competition from 2018. We lost first place because we’ve put almost all focus on feature engineering. We had the best features and great models but we completely forgot about ensembling which is a huge source of value.

We were first on the private leaderboard but our models were not robust enough and we dropped on the public leaderboard from 1st to 5th place. That is why now, I try to follow a very strict process to make sure that I exploit all the possible sources of value.

Also, it’s important to mention that my best ideas come when I am having a walk in the park with my wife but don’t tell her that. Ok, she already knows that I tend to think about machine learning when we spend time together.

my best ideas come when I am having a walk in the park with my wife

Sometimes you just need to take a step back and do something completely different: go to the park, go have a beer, talk to your friends and then the best ideas come.

What to do to get good at machine learning?

My running coach tells me that if you want to get in shape you need to be patient, your skill level comes with patience. I think it is the same in data science.

You need to spend a certain amount of time coding, experience failures and learn from them.

If you spend sufficient enough time learning and improving the results will come.

When I talk to beginner data scientists, they ask me about Kaggle I always tell them, forget about the results, focus on consistency and learning. If you spend sufficient enough time learning and improving the results will come.

What are you learning right now?

My research right now is focused on two domains.

First one is reinforcement learning. This is a technology that I believe has a huge potential going forward. I don’t think I have explored it enough in the past.

reinforcement learning is a technology that I believe has a huge potential going forward.

The other one is actually the research I’m doing at INSEAD about unconscious biases, and how these biases impact our decisions.

It is very interesting to me because I’m both in the field of machine learning and human learning. Analyzing and seeing what are the similarities in the process of learning between humans and machines is fascinating and some of the conclusions that we have here are really interesting.

Analyzing and seeing what are the similarities in the process of learning between humans and machines is fascinating

For example, I think, the big topic now is diversity and inclusion in organizations. I think machines are much more advanced in diversity and inclusion than humans are. Why is that?

If you recall in Kaggle competition and sometimes in production situations, it is not a single model that wins but rather a team of diverse models. So for any machine learning algorithm, it would be obvious that you should have diverse models in your modeling stack but for humans it is not.

Also, if you think about it, we do not analyze results of how teams perform together but rather look at individuals. We give individual performance bonuses, we hire based on individual performance but actually having a best team of four is different from having best four individuals.

having a best team of four is different from having best four individuals.

It’s a completely different matter and it is because we cannot measure the team effects as such, and we end up in an environment where we measure what we can, individual performance. In recruitment, we consider this particular person, typically not in the context of a broader team, but rather as an individual.

For example, say you have a board of a very large company, you have four guys, white males between 40 and 50 with Harvard degrees. You want to add a new member of the board and your recruitment process finds another person similar to the ones that are on the board. He may be better than other candidates as an individual but if you think about the team there could be much better candidates with entirely different backgrounds.

Diversity means derisking your “models” which is very obvious to everyone in the machine learning industry but is very difficult to grasp for humans just because we cannot measure it.

Diversity means derisking your “models” which is very obvious to everyone in the machine learning industry but is very difficult to grasp for humans just because we cannot measure it.

Sadly we don’t have a good solution for it right now. But I think it is a good question to ask, it is a good issue to spend some time on.

How can one become a head of AI?

I feel that the most important thing is to always do things that you like because it is very easy to be successful if you are doing stuff that you like, period.

it is very easy to be successful if you are doing stuff that you like

I strongly believe that some senior and very technical guys that love coding should just do coding and they will lead the team naturally with their coding skills.

The guys who like to talk to other people and can organize getting diverse sources of data, they should do that because they will bring a lot of value to the team and the team will follow.

That said, I think that the career progression for technical people is an area where companies don’t really know what is the best approach. Reconciling the need for very good programming skills with the need for managing people is just difficult.

I always advise, do what you like, do it very, very well, and your leadership will come from this.

Anyhow, I always advise, do what you like, do it very, very well, and your leadership will come from this.

All the training on management and dealing with people are secondary. Your presence as a chief data scientist or a team leader should naturally come from your true passion. People always follow people who are true in what they do.

Of course, there are some particular areas of expertise that you need which we already talked about like:

problem framing
data acquisition
dealing with non stationarity
management and mentoring your younger colleagues

If you learn those on top of your true passion you will become a leader, naturally.

I need to mention that developers, including myself, tend to focus on advancing technical coding skills only. The higher technical skills the better, of course, but improving your skills in other areas when you are already savvy technically can have a way bigger impact on your career.

improving your skills in other areas when you are already savvy technically can have a way bigger impact on your career.

Do you have some final thoughts?

I think AI so far, and this is a very broad reflection of AI, has had a tremendous impact on technology (computer vision, natural language processing) but the applications where it has a strong impact on real-life are few and far between.

AI has had a tremendous impact on technology but the applications where it has a strong impact on real-life are few and far between

I strongly encourage such applications and I believe education and healthcare are two sectors where the impacts could be the largest.

I encourage people to work there and think how AI and machine learning could change the situation because I like to think that machine learning and AI is the next big thing that will happen to society.

That said, I have not seen it so far. I have seen it work well in technology, but building tech for tech is not interesting, it should change our lives.

This article was originally posted on the Neptune blog. If you liked it, you may like it there :)

You can also find me tweeting @Neptune_ai or posting on LinkedIn about ML and Data Science stuff.