Self supervised learning tabular data

11/7/2023

AutoInt ( Song et al., 2019) transforms high-dimensional data into low-dimensional space by using an embedding layer to reduce data sparsity. Grownet ( Badirli et al., 2020) uses a structure of gradient boosting by connecting multiple shallow trees for neural networks. NODE ( Popov et al., 2019) applied the ensemble method of oblivious decision tree to the neural network using differentiable trees with Entmax function ( Peters et al., 2019). It uses a masking structure within the encoder to increase the influence of meaningful variables and reduce the influence of variables without a significant effect on learning. TabNet ( Arik and Pfister, 2020) is a deep neural network model that reflects the feature selection characteristics of decision trees in a neural network. To compete with machine learning algorithms, many deep learning methods for tabular data are continuously proposed these days. Also, XGBoost has been discussed a lot more on Kaggle compared to deep learning ( Bansal, 2018). According to the XGBoost and LightGBM official GitHub page ( Chen and Guestrin, 2016 Microsoft, 2020), XGBoost and LightGBM took a top tier in several Kaggle competitions.

In terms of predicting tabular data, these GBDT-based models generally outperform deep learning methods. Also, tree-based ensemble models give feature importance value so we can identify which variables are important in prediction. These models give good performance in both regression and classification problems. However, there is no deep learning method known to have a structure that can capture the characteristics of the tabular data.Ĭurrently, the State-of-the-art model in predicting tabular data is often the ensemble model based on the gradient-boosted decision tree (GBDT) ( Friedman, 2001) such as XGBoost ( Chen and Guestrin, 2016), CatBoost ( Prokhorenkova et al., 2017), LightGBM ( Ke et al., 2017). Each row corresponds to each observation and each column corresponds to a variable or feature. Meanwhile, tabular data has a structure in the form of a table with rows and columns. So RNN is mainly used for sequential data such as text or audio data. In RNN, the output of the previous step is used as the input of the current step. CNN identifies the characteristics of image data through a convolution layer that extracts local features of the image and a pooling layer that reduces the dimension. However, in the case of predicting tabular data, deep learning is not yet performing as well as unstructured data.Īlthough it is not clear why deep learning methods perform not as well as the latest boosting methods, we believe that convolutional neural network (CNN) ( Krizhevsky et al., 2012) and recurrent neural network (RNN) ( Sherstinsky, 2021) perform well for some specific type of data because of the following reasons. A great advance in deep learning has been successfully made with good performance in problems dealing with unstructured data such as text, image, and audio data.

0 Comments

Author

Archives

Categories

Self supervised learning tabular data

Leave a Reply.