Incorrectly stated machine learning problem

When you are solving the wrong problem

I decided to check out some Kaggle competitions, and downloaded the data for the Shelter Animal Outcomes competition. The training data has this format:

The test set lacked the OutcomeType and the OutcomeSubtype columns, and these two columns have to be predicted by our trained model. When I looked at the unique values of these columns, I noticed something interesting:

I found many outcomes (e.g. Aggressive, Behavior) could also be the causes the animal got into the shelter in the first place. Following J. Pearl’s book The Book of Why [1], we can create a causal diagram that capture the causal dependencies:

By animal attributes I mean name, animal type, sex, breed and color. I will not discuss the interdependencies between these attributes. Think of these attributes as the cuteness factor, or how likely the humans will be attracted to the animals. I included name, because it can also influence the adoption decision. Obviously, the attractiveness influences both shelter and adoption decision.

Animal behavior also influences both shelter and adoption decision. But it also depends on the animal attributes, especially the breed, hence the arrow from the animal attributes to the behavior. Unfortunately, we cannot reason about this dependency based on the data we have. The reason is very simple: we have a collider “Shelter” that is also being controlled for. Meaning that we don’t have data about all animals, only about the ones who’ve got to the shelter. Since we are controlling by the collider, we cannot reason about dependency between Behavior and Animal Attributes based on these data, because controlling for the collider results in either amplification of the correlation or its reduction (also known as “explaining away”). We could reason about this dependency if we had a representative sample of all pets, whether they are in the shelter or not.

Finally, it would be really interesting to have the data how much the animal spent in the shelter and use some Survival analysis to predict it. The data has the animal age upon outcome, which is the upper bound of the value we are looking for, but not a good proxy for it. Also, it makes me feel happy to see that most of the animals were either adopted or returned to the owner.

References

[1] Judea Pearl, Dana Mackenzie, The Book of Why: The New Science of Cause and Effect Basic Books, Inc. New York, NY, USA ©2018 ISBN:046509760X 9780465097609

Data Scientist

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store