RedPill with TabReD π
So essentially,
Yandexβs TabReD established new benchmark in tabular data analysis π
Paper:
TabReD: A Benchmark of Tabular Machine Learning in-the-Wild (31 Pages)
Github:
https://github.com/yandex-research/tabred
Researchers from HSE University and Yandex are interested in new building a new benchmark for tabular machine learning called TabReD (Tabular benchmark with Real-world industrial Datasets) to address the shortcomings of existing academic benchmarks in reflecting real-world industrial settings.
Hmm..Whatβs the background?
The researchers argue that existing academic benchmarks underrepresent two key characteristics of industrial tabular data:
Temporally-evolving data: In real-world applications, tabular data often changes over time, impacting model performance.
Feature-rich datasets: Industrial applications frequently involve extensive data acquisition and feature engineering pipelines, resulting in datasets with a large number of predictive, uninformative, and correlated features.
To bridge this gap, TabReD offers eight industry-grade tabular datasets encompassing a wide range of domains, from finance to food delivery.
Ok, So what is proposed in the research paper?
TabReD comprises eight industry-grade tabular datasets from diverse domains. These datasets are sourced from Kaggle competitions and real-world production ML systems. The datasets cover a variety of tasks including:
Home Insurance: Predicting customer purchase of home insurance policies.
Ecommerce: Modeling customer loyalty by predicting redemption of discount offers in e-commerce.
Home Credit Default: Predicting loan default risk based on client financial data.
Sberbank Housing: Predicting property sale prices in the Moscow housing market.
Cooking Time: Predicting order preparation time for a food delivery service.
Delivery ETA: Estimating arrival time for online grocery orders.
Maps Routing: Predicting travel time in a car navigation system.
Weather: Forecasting weather temperature.
TabReD is a step towards more realistic, task-relevant benchmarks in tabular machine learning, encouraging the field to address real-world challenges using robust evaluation protocols.
Whatβs next?
The researchers acknowledge limitations of TabReD, such as its bias towards specific types of industrial applications and the lack of detailed feature information. Future research directions include expanding TabReD to cover a wider range of domains and exploring other relevant research questions, such as continual learning and missing data imputation.
So essentially,
Yandexβs TabReD established new benchmark in tabular data analysis π