🎬 Movie Revenue Prediction β€” Full ML Pipeline

This project builds a complete machine learning workflow using real movie metadata.
It includes data cleaning, exploratory data analysis (EDA), feature engineering, clustering, visualization, regression models, classification models β€” and full performance evaluation.


πŸ§ͺ Part 0 β€” Initial Research Questions (EDA)

Before any modeling, I asked a few basic questions about the dataset:

1️⃣ What is the relationship between budget and revenue?

  • Hypothesis: Higher budget β†’ higher revenue.
  • Result: A clear positive trend, but with many outliers. Big-budget movies tend to earn more, but not always.

2️⃣ Is there a strong relationship between runtime and revenue?

  • Hypothesis: Longer movies might earn more.
  • Result: No strong pattern. Most successful movies fall in a β€œnormal” runtime range (around 90–150 minutes), but runtime alone does not explain revenue.

3️⃣ What are the most common original languages in the dataset?

  • Result: English dominates by far as the main original_language, with a long tail of other languages (French, Spanish, Hindi, etc.).

These EDA steps helped build intuition before moving into modeling.


πŸ§ͺ Main ML Research Questions

1️⃣ Can we accurately predict a movie’s revenue using metadata alone?

We test multiple regression models (Linear, Random Forest, Gradient Boosting) and evaluate how well different features explain revenue.

2️⃣ Which features have the strongest impact on movie revenue?

We explore the importance of:

  • budget
  • vote counts & vote average
  • popularity
  • profit & profit ratio
  • release year & decade
  • cluster-based features (cluster_group, distance_to_centroid)

3️⃣ Can we classify movies into β€œhigh revenue” vs. β€œlow revenue” groups effectively?

We convert revenue into a balanced binary target and apply classification models.

4️⃣ Do clustering and unsupervised learning reveal meaningful structure in the dataset?

We use K-Means + PCA to explore hidden groups, outliers, and natural segmentation of movies.


🧱 Part 1 β€” Dataset & Basic Cleaning (Before Any Regression)

πŸ”Ή 1. Loading the Data

  • Dataset: movies_metadata.csv (from Kaggle)
  • Target variable: revenue (continuous)

πŸ”Ή 2. Basic Cleaning

  • Converted string columns like budget, revenue, runtime, popularity to numeric.
  • Parsed release_date as a datetime.
  • Removed clearly invalid rows, such as:
    • budget == 0
    • revenue == 0
    • runtime == 0

This produced a smaller but more reliable dataset.


πŸ“Š Part 2 β€” Initial EDA (Before Any Model)

Key insights:

  • Budget vs Revenue

    • Positive trend: higher budgets tend to lead to higher revenue, but with big variability and outliers. image
  • Runtime vs Revenue

    • No strong linear correlation. Being "very long" or "very short" does not guarantee success. image
  • Original Language Distribution

    • English is by far the most common language; most of the dataset is dominated by English-language films. image

These findings motivated the next steps: building a simple baseline model and then adding smarter features.


πŸ§ͺ Part 3 β€” Baseline Regression (Before Feature Engineering)

🎯 Goal

Build a simple baseline model that predicts movie revenue using only a few basic features:

  • budget
  • runtime
  • vote_average
  • vote_count

βš™οΈ Model

  • Linear Regression on the 4 basic features.
  • Train/Test split: 80% train / 20% test.

πŸ“Š Baseline Regression Results

Using only the basic features:

  • MAE β‰ˆ 45,652,741
  • RMSE β‰ˆ 79,524,121
  • RΒ² β‰ˆ 0.715

πŸ“Œ Interpretation:

  • The model explains about 71.5% of the variance in revenue, which is quite strong for a first, simple model.
  • However, the errors (tens of millions) show there is still a lot of noise and missing information β€” which is expected in movie revenue prediction.

This baseline serves as a reference point before introducing engineered features.


🧱 Part 4 β€” Feature Engineering (Upgrading the Dataset)

To improve model performance, several new features were engineered:

πŸ”Ή New Numeric Features

  • profit = revenue - budget
  • profit_ratio = profit / budget
  • overview_length = length of the movie overview text
  • release_year = year extracted from release_date
  • decade = grouped release year by decade (e.g., 1980, 1990, 2000)

πŸ”Ή Categorical Encoding

  • adult converted from "True"/"False" to 1/0.
  • original_language and status encoded using One-Hot Encoding (with drop_first=True to avoid dummy variable trap).

πŸ”Ή Scaling Numerical Features

Used StandardScaler to standardize numeric columns:

  • budget, runtime, vote_average, vote_count,
    popularity, profit, profit_ratio, overview_length

Each feature was transformed to have:

  • mean β‰ˆ 0
  • standard deviation β‰ˆ 1

🧩 Part 5 β€” Clustering & PCA (Unsupervised Learning)

πŸ”Ή K-Means Clustering

  • Features used:
    budget, runtime, vote_average, vote_count, popularity, profit
  • Algorithm: K-Means with n_clusters=4.
  • New feature: cluster_group β€” each movie assigned to one of 4 clusters.

Rough interpretation of clusters:

  • Cluster 0 β€” low-budget, low-revenue films
  • Cluster 1 β€” mid-range films
  • Cluster 2 β€” big-budget / blockbuster-style movies
  • Cluster 3 β€” more unusual / outlier-like cases

πŸ”Ή PCA for Visualization

  • Applied PCA (n_components=2) on cluster_features to reduce dimensionality.
  • Created pca1 and pca2 for each movie.
  • Plotted the movies in 2D using PCA, colored by cluster_group.

This allowed visual inspection of:

  • Cluster separation
  • Overlaps
  • Global structure in the data
    image

πŸ”Ή Distance to Centroid (Outlier Feature)

Computed:

  • distance_to_centroid for each movie = Euclidean distance between the movie and its cluster center.

Interpretation:

  • Small distance β†’ movie is β€œtypical” for its cluster.
  • Large distance β†’ movie is an outlier within its cluster.

This feature was later used as an additional signal for modeling.

image

🧱 Part 6 β€” Advanced Regression (With Engineered Features)

🎯 Goal

Use the engineered features + clustering-based features to improve regression performance.

πŸ”Ή Final Feature Set

Included:

  • Base numeric:
    budget, runtime, vote_average, vote_count, popularity
  • Engineered:
    profit, profit_ratio, overview_length, release_year, decade
  • Clustering:
    cluster_group, distance_to_centroid
  • One-Hot columns:
    All original_language_... and status_...

πŸ”Ή Models Trained

  • Linear Regression (on the enriched feature set)
  • Random Forest Regressor
  • Gradient Boosting Regressor

πŸ“Š Regression Results (With Engineered Features)

Model MAE RMSE RΒ²
Linear Regression ~0 (leakage) ~0 1.00
Random Forest 1,964,109 7,414,303 0.9975
Gradient Boosting 2,255,268 5,199,504 0.9988

πŸ“Œ Note:

  • The Linear Regression result is unrealistically perfect due to data leakage (features like profit are directly derived from revenue).
  • The real, meaningful comparison is between Random Forest and Gradient Boosting.

πŸ† Regression Winner

πŸ”₯ Gradient Boosting Regressor

  • Highest RΒ²
  • Lowest RMSE
  • Best at capturing non-linear relationships

🧱 Part 7 β€” Turning Regression into Classification

Instead of predicting the exact revenue, we converted the problem to a binary classification task:

  • Class 0: revenue < median(revenue)
  • Class 1: revenue β‰₯ median(revenue)

πŸ“Š Class Balance

Class 1 (high revenue): 2687
Class 0 (low revenue):  2682


### πŸ“Š Classification Results

#### Logistic Regression
- Accuracy: **0.977**
- Precision: **0.984**
- Recall: **0.968**
- F1: **0.976**

#### Random Forest
- Accuracy: **0.986**
- Precision: **0.988**
- Recall: **0.982**
- F1: **0.985**

#### Gradient Boosting Classifier
- Accuracy: **0.990**
- Precision: **0.990**
- Recall: **0.990**
- F1: **0.990**

---

## πŸ† Classification Winner  
πŸ”₯ **Gradient Boosting Classifier**  
- Highest accuracy  
- Balanced precision & recall  
- Best overall performance  

---

## πŸ“Œ Tools Used
- Python  
- pandas / numpy  
- scikit-learn  
- seaborn / matplotlib  
- Google Colab  

---

## 🎯 Final Summary
This project demonstrates a complete machine learning workflow:
- Data preprocessing  
- Feature engineering  
- K-Means clustering  
- PCA visualization  
- Regression models  
- Classification models  
- Full evaluation and comparison  

The strongest model in both regression and classification tasks was **Gradient Boosting**, delivering state-of-the-art performance.

---

πŸŽ₯ Watch the full project here:

https://www.loom.com/share/303dfe317514455db992438357cf8cb4

Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support