DSBench: AI Data Science Benchmark Verified By Actual Data Scientists

Sep 17, 2024

So essentially,

DSBench is best benchmark for DS tasks

Paper: DSBench: How Far Are Data Science Agents to Becoming Data Science Experts? (57 Pages)

GitHub: https://github.com/LiqiangJing/DSBench

Researchers from UT Dallas, Tencent (Seattle) and USC are interested in a new benchmark designed to evaluate the performance of data science agents, which are artificial intelligence systems designed to assist with data analysis and modeling.

Hmm..What’s the background?

Current data science benchmarks have limitations such as:

Current Instructions are brief and single-modal while real-world tasks involve lengthy instructions and multiple modalities
Evaluations are incomplete, focusing on code completion or infilling capabilities rather than end-to-end performance
Evaluations are biased toward specific environments, while real-world tasks are tool-agnostic and data-centric

Source: https://lexica.art/prompt/22a4c0c1-b130-47e4-967c-961f7ab98553

Ok, So what is proposed in the research paper?

DSBench addresses these limitations by including 466 data analysis tasks and 74 data modeling tasks sourced from Eloquence and Kaggle competitions.

Provides a more realistic setting by including long contexts, multimodal task backgrounds, reasoning with large data files and multi-table structures, and performing end-to-end data modeling tasks.

Some notable findings from the evaluation of DSBench are:

Advanced language models perform better: For instance, GPT-4o demonstrates strong performance in general language tasks and achieves the highest accuracy among all vanilla model-only baselines on DSBench
Agent systems outperform vanilla models: Agent systems like AutoGen, which incorporate tools and interaction environments, generally outperform vanilla language models on DSBench's data analysis tasks
GPT-4o exhibits high accuracy, but at a cost: While GPT-4o achieves the best performance among the evaluated models, it incurs higher costs and longer inference times compared to models like GPT-3.5
Significant performance gap between AI and humans: Even the most advanced agent system evaluated on DSBench shows a considerable performance gap compared to human data scientists, highlighting the need for further advancements in data science agent technology
Source: Paper

What’s next?

Future work could involve:

Expanding data sources beyond Kaggle
More sophisticated methods, such as agent-based approaches and tool-augmented language models
Future research should focus on developing techniques and mechanisms to mitigate these errors and improve the accuracy and reliability of data science agents

So essentially,

DSBench is best benchmark for DS tasks

Learned something new? Consider sharing with your friends!

Share So Essentially

So Essentially

DSBench: AI Data Science Benchmark Verified By Actual Data Scientists

Discussion about this post