π¬ Movie Revenue Prediction β Full ML Pipeline
This project builds a complete machine learning workflow using real movie metadata.
It includes data cleaning, exploratory data analysis (EDA), feature engineering, clustering, visualization, regression models, classification models β and full performance evaluation.
π§ͺ Part 0 β Initial Research Questions (EDA)
Before any modeling, I asked a few basic questions about the dataset:
1οΈβ£ What is the relationship between budget and revenue?
- Hypothesis: Higher budget β higher revenue.
- Result: A clear positive trend, but with many outliers. Big-budget movies tend to earn more, but not always.
2οΈβ£ Is there a strong relationship between runtime and revenue?
- Hypothesis: Longer movies might earn more.
- Result: No strong pattern. Most successful movies fall in a βnormalβ runtime range (around 90β150 minutes), but runtime alone does not explain revenue.
3οΈβ£ What are the most common original languages in the dataset?
- Result: English dominates by far as the main original_language, with a long tail of other languages (French, Spanish, Hindi, etc.).
These EDA steps helped build intuition before moving into modeling.
π§ͺ Main ML Research Questions
1οΈβ£ Can we accurately predict a movieβs revenue using metadata alone?
We test multiple regression models (Linear, Random Forest, Gradient Boosting) and evaluate how well different features explain revenue.
2οΈβ£ Which features have the strongest impact on movie revenue?
We explore the importance of:
- budget
- vote counts & vote average
- popularity
- profit & profit ratio
- release year & decade
- cluster-based features (cluster_group, distance_to_centroid)
3οΈβ£ Can we classify movies into βhigh revenueβ vs. βlow revenueβ groups effectively?
We convert revenue into a balanced binary target and apply classification models.
4οΈβ£ Do clustering and unsupervised learning reveal meaningful structure in the dataset?
We use K-Means + PCA to explore hidden groups, outliers, and natural segmentation of movies.
π§± Part 1 β Dataset & Basic Cleaning (Before Any Regression)
πΉ 1. Loading the Data
- Dataset:
movies_metadata.csv(from Kaggle) - Target variable:
revenue(continuous)
πΉ 2. Basic Cleaning
- Converted string columns like
budget,revenue,runtime,popularityto numeric. - Parsed
release_dateas a datetime. - Removed clearly invalid rows, such as:
budget == 0revenue == 0runtime == 0
This produced a smaller but more reliable dataset.
π Part 2 β Initial EDA (Before Any Model)
Key insights:
Budget vs Revenue
Runtime vs Revenue
Original Language Distribution
These findings motivated the next steps: building a simple baseline model and then adding smarter features.
π§ͺ Part 3 β Baseline Regression (Before Feature Engineering)
π― Goal
Build a simple baseline model that predicts movie revenue using only a few basic features:
budgetruntimevote_averagevote_count
βοΈ Model
- Linear Regression on the 4 basic features.
- Train/Test split: 80% train / 20% test.
π Baseline Regression Results
Using only the basic features:
- MAE β 45,652,741
- RMSE β 79,524,121
- RΒ² β 0.715
π Interpretation:
- The model explains about 71.5% of the variance in revenue, which is quite strong for a first, simple model.
- However, the errors (tens of millions) show there is still a lot of noise and missing information β which is expected in movie revenue prediction.
This baseline serves as a reference point before introducing engineered features.
π§± Part 4 β Feature Engineering (Upgrading the Dataset)
To improve model performance, several new features were engineered:
πΉ New Numeric Features
profit = revenue - budgetprofit_ratio = profit / budgetoverview_length= length of the movie overview textrelease_year= year extracted fromrelease_datedecade= grouped release year by decade (e.g., 1980, 1990, 2000)
πΉ Categorical Encoding
adultconverted from"True"/"False"to1/0.original_languageandstatusencoded using One-Hot Encoding (withdrop_first=Trueto avoid dummy variable trap).
πΉ Scaling Numerical Features
Used StandardScaler to standardize numeric columns:
budget,runtime,vote_average,vote_count,popularity,profit,profit_ratio,overview_length
Each feature was transformed to have:
- mean β 0
- standard deviation β 1
π§© Part 5 β Clustering & PCA (Unsupervised Learning)
πΉ K-Means Clustering
- Features used:
budget,runtime,vote_average,vote_count,popularity,profit - Algorithm: K-Means with
n_clusters=4. - New feature:
cluster_groupβ each movie assigned to one of 4 clusters.
Rough interpretation of clusters:
- Cluster 0 β low-budget, low-revenue films
- Cluster 1 β mid-range films
- Cluster 2 β big-budget / blockbuster-style movies
- Cluster 3 β more unusual / outlier-like cases
πΉ PCA for Visualization
- Applied PCA (n_components=2) on
cluster_featuresto reduce dimensionality. - Created
pca1andpca2for each movie. - Plotted the movies in 2D using PCA, colored by
cluster_group.
This allowed visual inspection of:
πΉ Distance to Centroid (Outlier Feature)
Computed:
distance_to_centroidfor each movie = Euclidean distance between the movie and its cluster center.
Interpretation:
- Small distance β movie is βtypicalβ for its cluster.
- Large distance β movie is an outlier within its cluster.
This feature was later used as an additional signal for modeling.
π§± Part 6 β Advanced Regression (With Engineered Features)
π― Goal
Use the engineered features + clustering-based features to improve regression performance.
πΉ Final Feature Set
Included:
- Base numeric:
budget,runtime,vote_average,vote_count,popularity - Engineered:
profit,profit_ratio,overview_length,release_year,decade - Clustering:
cluster_group,distance_to_centroid - One-Hot columns:
Alloriginal_language_...andstatus_...
πΉ Models Trained
- Linear Regression (on the enriched feature set)
- Random Forest Regressor
- Gradient Boosting Regressor
π Regression Results (With Engineered Features)
| Model | MAE | RMSE | RΒ² |
|---|---|---|---|
| Linear Regression | ~0 (leakage) | ~0 | 1.00 |
| Random Forest | 1,964,109 | 7,414,303 | 0.9975 |
| Gradient Boosting | 2,255,268 | 5,199,504 | 0.9988 |
π Note:
- The Linear Regression result is unrealistically perfect due to data leakage (features like
profitare directly derived fromrevenue). - The real, meaningful comparison is between Random Forest and Gradient Boosting.
π Regression Winner
π₯ Gradient Boosting Regressor
- Highest RΒ²
- Lowest RMSE
- Best at capturing non-linear relationships
π§± Part 7 β Turning Regression into Classification
Instead of predicting the exact revenue, we converted the problem to a binary classification task:
- Class 0: revenue < median(revenue)
- Class 1: revenue β₯ median(revenue)
π Class Balance
Class 1 (high revenue): 2687
Class 0 (low revenue): 2682
### π Classification Results
#### Logistic Regression
- Accuracy: **0.977**
- Precision: **0.984**
- Recall: **0.968**
- F1: **0.976**
#### Random Forest
- Accuracy: **0.986**
- Precision: **0.988**
- Recall: **0.982**
- F1: **0.985**
#### Gradient Boosting Classifier
- Accuracy: **0.990**
- Precision: **0.990**
- Recall: **0.990**
- F1: **0.990**
---
## π Classification Winner
π₯ **Gradient Boosting Classifier**
- Highest accuracy
- Balanced precision & recall
- Best overall performance
---
## π Tools Used
- Python
- pandas / numpy
- scikit-learn
- seaborn / matplotlib
- Google Colab
---
## π― Final Summary
This project demonstrates a complete machine learning workflow:
- Data preprocessing
- Feature engineering
- K-Means clustering
- PCA visualization
- Regression models
- Classification models
- Full evaluation and comparison
The strongest model in both regression and classification tasks was **Gradient Boosting**, delivering state-of-the-art performance.
---
π₯ Watch the full project here:
- Downloads last month
- -




