AI-PoweredSplitting ToolsFree — No sign-up

Dataset Split Calculator

The 80/10/10 split is standard, but it is wrong for small datasets — under 10,000 samples, go 70/15/15 so your validation set actually means something. For 1M+ samples, 98/1/1 works because even 1% gives you 10,000 test examples. This calculator gives exact sample counts for any ratio and dataset size, so you stop eyeballing and start getting reproducible splits.

80/10/10

Standard Split

70/15/15

Small Dataset

98/1/1

Large Dataset

1,000

Min Test Size

By SplitGenius TeamUpdated February 2026Reviewed for accuracy

To split a dataset for machine learning, divide total samples into train, validation, and test sets. The standard split is 80/10/10. For a 50,000-sample dataset: 40,000 train, 5,000 validation, 5,000 test. Smaller datasets should use 70/15/15 for more reliable evaluation. Enter your dataset size below.

Dataset Configuration

Total Samples

Random Seed

For reproducibility (e.g., 42)

Preset Splits

Train %

Validation %

Test %

Total: 100% — balanced

Stratified split

Preserves class distribution in each split (recommended for classification)

Dataset Split by Size — Sample Counts

Exact sample counts for common dataset sizes using the standard 80/10/10 train/validation/test split.

Total Samples	Train (80%)	Validation (10%)	Test (10%)
1,000	800	100	100
5,000	4,000	500	500
10,000	8,000	1,000	1,000
50,000	40,000	5,000	5,000
100,000	80,000	10,000	10,000
1,000,000	800,000	100,000	100,000

How This Calculator Works

Enter Your Details

Fill in amounts, people, and preferences. Takes under 30 seconds.

Get Fair Results

See an instant breakdown with data-driven calculations and Fairness Scores.

Share & Settle

Copy a shareable link to discuss results with everyone involved.

Frequently Asked Questions

What is the best train/test split ratio?

The most common split is 80% train / 10% validation / 10% test. For smaller datasets (under 10,000 samples), use 70/15/15 to get more validation data. For very large datasets (1M+), you can use 98/1/1 since even 1% gives you thousands of test samples.

Do I always need a validation set?

If you are tuning hyperparameters, yes — the validation set prevents overfitting to the test set. If you are using cross-validation instead, you can skip the validation split and just use train/test. For most ML projects, a three-way split is recommended.

What is stratified splitting?

Stratified splitting ensures each split has the same class distribution as the full dataset. If your data is 70% class A and 30% class B, each split maintains that 70/30 ratio. Critical for imbalanced datasets to prevent training or testing on skewed distributions.

What random seed should I use?

Any fixed integer works — common choices are 42, 0, or 123. The specific number doesn't matter; what matters is using the same seed for reproducibility. Document your seed so others can replicate your exact split.

How many samples do I need for a good test set?

At minimum, 1,000 samples for classification tasks. For regression, 500+ samples. The test set needs enough examples to give statistically meaningful performance estimates. If your total dataset is under 5,000 samples, consider cross-validation instead.

People Also Calculate

Percentage Split

Split any dollar amount by custom percentages — 50/30/20, 70/30, or any custom split. Supports 2 to 20 shares with exact amounts and remainders.

Ratio Split

Divide any number by a ratio like 2:3:5 and get exact amounts. $15,000 in ratio 2:3:5 splits into $3,000, $4,500, and $7,500 — formula shown.

A/B Test

Find out if your A/B test has a real winner or just noise. Get the p-value, z-score, relative uplift, and minimum sample size for statistical significance.

Weighted Grade

See your real weighted grade and exactly what final score you need to hit your target. Works for any class with weighted categories.

Storage Allocation

Split 1TB across 4 departments by equal quota, current usage patterns, or priority level — with emergency reserve built in.

Bandwidth Split

A 200 Mbps connection shared by 8 devices does not mean 25 Mbps each. Allocate bandwidth by device priority, usage type, or equal split.

Rent Split

Your roommate with the master suite shouldn't pay the same as the person in the closet-sized room. Split rent by square footage, features, and income — with a Fairness Score.

Bill Split

The person who ordered a salad shouldn't subsidize the steak. Split restaurant bills by item, calculate each person's tax and tip share, and settle up fast.

Trip Expenses

After a group trip, someone always overpaid and someone always "forgot." Track every expense, see who owes whom, and settle up in the fewest payments possible.

Related Guide

The Complete Guide to Ratio Calculations

Train/validation/test split ratios and how to divide a dataset reproducibly.

Explore 182+ Free Calculators

Split rent, bills, tips, trips, wedding costs, childcare, and more.

Browse All Calculators

How to Split a Dataset for Machine Learning

Every ML model needs three data splits: training data to learn patterns, validation data to tune hyperparameters, and test data to evaluate final performance. Getting the ratios wrong leads to overfitting, unreliable metrics, or wasted data.

Recommended Split Ratios by Dataset Size

Dataset Size	Recommended Split	Why
< 1,000	Use cross-validation	Too small for a fixed split — k-fold gives better estimates
1,000 – 10,000	70 / 15 / 15	Larger val/test sets for reliable evaluation
10,000 – 100,000	80 / 10 / 10	Standard split — enough data for all three sets
100,000 – 1M	90 / 5 / 5	Even 5% gives 5K+ samples for evaluation
> 1M	98 / 1 / 1	1% still gives 10K+ samples — maximize training data

Stratified vs Random Splitting

Random splitting shuffles all samples and assigns them to splits randomly. Works well when your data is balanced and IID (independent and identically distributed).

Stratified splitting preserves the class distribution in each split. If your dataset is 90% class A and 10% class B, each split will maintain that 90/10 ratio. Always use stratified splits for imbalanced classification problems.

Common Mistakes

Data leakage: Never normalize or preprocess data before splitting. Fit scalers on training data only, then apply to val/test.
Temporal leakage: For time-series data, split chronologically — don't shuffle. Future data must never appear in training.
Peeking at test data: Only evaluate on the test set once, at the very end. Use the validation set for all tuning decisions.

For splitting amounts by ratio or percentage, use our ratio calculator or percentage split calculator.