Matt_Lerner comments on A beginner data scientist tries her hand at biosecurity

Matt_Lerner 24 Oct 2020 2:55 UTC
7 points
0 ∶ 0
First, congratulations. This is impressive, you should be very proud of yourself, and I hope this is the beginning of a long and fruitful data science career (or avocation) for you.

What is going on here?

I think the simplest explanation is that your model fit better because you trained on more data. You write that your best score was obtained by applying XGBoost to the entire feature matrix, without splitting it into train/test sets. So assuming the other teams did things the standard way, you were working with 25%-40% more data to fit the model. In a lot of settings, particularly in the case of tree-based methods (as I think XGBoost usually is), this is a recipe for overfitting. In this setting, however, it seems like the structure of the public test data was probably really close to the structure of the private test data, so the lack of validation on the public dataset paid off for you.
I think one interpretation of this is that you got lucky in that way. But I don’t think that’s the right takeaway. I think the right takeaway is that you kept your eye on the ball and chose the strategy that worked based on your understanding of the data structure and the available methods and you should be very satisfied.
- Tsunayoshi 25 Oct 2020 0:49 UTC
  5 points
  0 ∶ 0
  Parent
  Are you sure that this is the standard way in competitions? It is absolutely correct that before the final submission, one would find the best model by fitting it on a train set and evaluating it on the test set. However, once you found a best performing model that way, there is no reason not to train the model with the best parameters on the train+test set, and submit that one. (Submission are the predictions of the model on the validation set, not the parameters of the model). After all, more data equals better performance.