🎬 Movie Revenue Prediction — Full ML Pipeline

This project builds a complete machine learning workflow using real movie metadata.
It includes data cleaning, exploratory data analysis (EDA), feature engineering, clustering, visualization, regression models, classification models — and full performance evaluation.

🧪 Part 0 — Initial Research Questions (EDA)

Before any modeling, I asked a few basic questions about the dataset:

1️⃣ What is the relationship between budget and revenue?

Hypothesis: Higher budget → higher revenue.
Result: A clear positive trend, but with many outliers. Big-budget movies tend to earn more, but not always.

2️⃣ Is there a strong relationship between runtime and revenue?

Hypothesis: Longer movies might earn more.
Result: No strong pattern. Most successful movies fall in a “normal” runtime range (around 90–150 minutes), but runtime alone does not explain revenue.

3️⃣ What are the most common original languages in the dataset?

Result: English dominates by far as the main original_language, with a long tail of other languages (French, Spanish, Hindi, etc.).

These EDA steps helped build intuition before moving into modeling.

🧪 Main ML Research Questions

1️⃣ Can we accurately predict a movie’s revenue using metadata alone?

We test multiple regression models (Linear, Random Forest, Gradient Boosting) and evaluate how well different features explain revenue.

2️⃣ Which features have the strongest impact on movie revenue?

We explore the importance of:

budget
vote counts & vote average
popularity
profit & profit ratio
release year & decade
cluster-based features (cluster_group, distance_to_centroid)

3️⃣ Can we classify movies into “high revenue” vs. “low revenue” groups effectively?

We convert revenue into a balanced binary target and apply classification models.

4️⃣ Do clustering and unsupervised learning reveal meaningful structure in the dataset?

We use K-Means + PCA to explore hidden groups, outliers, and natural segmentation of movies.

🧱 Part 1 — Dataset & Basic Cleaning (Before Any Regression)

🔹 1. Loading the Data

Dataset: movies_metadata.csv (from Kaggle)
Target variable: revenue (continuous)

🔹 2. Basic Cleaning

Converted string columns like budget, revenue, runtime, popularity to numeric.
Parsed release_date as a datetime.
Removed clearly invalid rows, such as:
- budget == 0
- revenue == 0
- runtime == 0

This produced a smaller but more reliable dataset.

📊 Part 2 — Initial EDA (Before Any Model)

Key insights:

Budget vs Revenue
- Positive trend: higher budgets tend to lead to higher revenue, but with big variability and outliers.
Runtime vs Revenue
- No strong linear correlation. Being "very long" or "very short" does not guarantee success.
Original Language Distribution
- English is by far the most common language; most of the dataset is dominated by English-language films.

These findings motivated the next steps: building a simple baseline model and then adding smarter features.

🧪 Part 3 — Baseline Regression (Before Feature Engineering)

🎯 Goal

Build a simple baseline model that predicts movie revenue using only a few basic features:

budget
runtime
vote_average
vote_count

⚙️ Model

Linear Regression on the 4 basic features.
Train/Test split: 80% train / 20% test.

📊 Baseline Regression Results

Using only the basic features:

MAE ≈ 45,652,741
RMSE ≈ 79,524,121
R² ≈ 0.715

📌 Interpretation:

The model explains about 71.5% of the variance in revenue, which is quite strong for a first, simple model.
However, the errors (tens of millions) show there is still a lot of noise and missing information — which is expected in movie revenue prediction.

This baseline serves as a reference point before introducing engineered features.

🧱 Part 4 — Feature Engineering (Upgrading the Dataset)

To improve model performance, several new features were engineered:

🔹 New Numeric Features

profit = revenue - budget
profit_ratio = profit / budget
overview_length = length of the movie overview text
release_year = year extracted from release_date
decade = grouped release year by decade (e.g., 1980, 1990, 2000)

🔹 Categorical Encoding

adult converted from "True"/"False" to 1/0.
original_language and status encoded using One-Hot Encoding (with drop_first=True to avoid dummy variable trap).

🔹 Scaling Numerical Features

Used StandardScaler to standardize numeric columns:

budget, runtime, vote_average, vote_count,
popularity, profit, profit_ratio, overview_length

Each feature was transformed to have:

mean ≈ 0
standard deviation ≈ 1

🧩 Part 5 — Clustering & PCA (Unsupervised Learning)

🔹 K-Means Clustering

Features used:
budget, runtime, vote_average, vote_count, popularity, profit
Algorithm: K-Means with n_clusters=4.
New feature: cluster_group — each movie assigned to one of 4 clusters.

Rough interpretation of clusters:

Cluster 0 — low-budget, low-revenue films
Cluster 1 — mid-range films
Cluster 2 — big-budget / blockbuster-style movies
Cluster 3 — more unusual / outlier-like cases

🔹 PCA for Visualization

Applied PCA (n_components=2) on cluster_features to reduce dimensionality.
Created pca1 and pca2 for each movie.
Plotted the movies in 2D using PCA, colored by cluster_group.

This allowed visual inspection of:

Cluster separation
Overlaps
Global structure in the data

🔹 Distance to Centroid (Outlier Feature)

Computed:

distance_to_centroid for each movie = Euclidean distance between the movie and its cluster center.

Interpretation:

Small distance → movie is “typical” for its cluster.
Large distance → movie is an outlier within its cluster.

This feature was later used as an additional signal for modeling.

🧱 Part 6 — Advanced Regression (With Engineered Features)

🎯 Goal

Use the engineered features + clustering-based features to improve regression performance.

🔹 Final Feature Set

Included:

Base numeric:
budget, runtime, vote_average, vote_count, popularity
Engineered:
profit, profit_ratio, overview_length, release_year, decade
Clustering:
cluster_group, distance_to_centroid
One-Hot columns:
All original_language_... and status_...

🔹 Models Trained

Linear Regression (on the enriched feature set)
Random Forest Regressor
Gradient Boosting Regressor

📊 Regression Results (With Engineered Features)

Model	MAE	RMSE	R²
Linear Regression	~0 (leakage)	~0	1.00
Random Forest	1,964,109	7,414,303	0.9975
Gradient Boosting	2,255,268	5,199,504	0.9988

📌 Note:

The Linear Regression result is unrealistically perfect due to data leakage (features like profit are directly derived from revenue).
The real, meaningful comparison is between Random Forest and Gradient Boosting.

🏆 Regression Winner

🔥 Gradient Boosting Regressor

Highest R²
Lowest RMSE
Best at capturing non-linear relationships

🧱 Part 7 — Turning Regression into Classification

Instead of predicting the exact revenue, we converted the problem to a binary classification task:

Class 0: revenue < median(revenue)
Class 1: revenue ≥ median(revenue)

📊 Class Balance

Class 1 (high revenue): 2687
Class 0 (low revenue):  2682


### 📊 Classification Results

#### Logistic Regression
- Accuracy: **0.977**
- Precision: **0.984**
- Recall: **0.968**
- F1: **0.976**

#### Random Forest
- Accuracy: **0.986**
- Precision: **0.988**
- Recall: **0.982**
- F1: **0.985**

#### Gradient Boosting Classifier
- Accuracy: **0.990**
- Precision: **0.990**
- Recall: **0.990**
- F1: **0.990**

---

## 🏆 Classification Winner  
🔥 **Gradient Boosting Classifier**  
- Highest accuracy  
- Balanced precision & recall  
- Best overall performance  

---

## 📌 Tools Used
- Python  
- pandas / numpy  
- scikit-learn  
- seaborn / matplotlib  
- Google Colab  

---

## 🎯 Final Summary
This project demonstrates a complete machine learning workflow:
- Data preprocessing  
- Feature engineering  
- K-Means clustering  
- PCA visualization  
- Regression models  
- Classification models  
- Full evaluation and comparison  

The strongest model in both regression and classification tasks was **Gradient Boosting**, delivering state-of-the-art performance.

---

🎥 Watch the full project here:

https://www.loom.com/share/303dfe317514455db992438357cf8cb4

Downloads last month: -