From Data to Deployment: Building a Crop Yield Prediction App
From Data to Deployment: Building a Crop Yield Prediction App
Python Pandas Scikit-learn Streamlit ML Deployment
Predicting crop yield can help farmers, agri-startups, and policy makers make better decisions about resources, pricing, and risk. In this post, I share how I’m building a complete end-to-end Crop Yield Prediction project—from data collection & cleaning to modeling and deploying an interactive app with Streamlit.
1) Problem & Goal
Crop yields fluctuate due to rainfall, temperature, soil quality, and farming practices. The goal is to build a model that predicts yield for a given crop and location using historical climate and production data—then expose it via a simple web interface so anyone can try it.
Figure: Yield depends on climate patterns—our model learns those relationships.
2) Data: Sources & Cleaning
I started with open datasets that include year, state/district, crop, production (yield), rainfall, min/max temperature. After merging sources, I cleaned the data: fixed column names, handled missing values, converted types, and created features.
Sample Cleaning Code
# cleaning.py
import pandas as pd
import numpy as np
df_yield = pd.read_csv("data/yield.csv")
df_weather = pd.read_csv("data/weather.csv")
# Standardize columns
df_yield.columns = df_yield.columns.str.strip().str.lower().str.replace(" ", "_")
df_weather.columns = df_weather.columns.str.strip().str.lower().str.replace(" ", "_")
# Basic checks
print(df_yield.head())
print(df_weather.head())
# Merge on keys (example: year, state, crop)
df = (df_yield
.merge(df_weather, on=["year","state"], how="left"))
# Handle missing values
for col in ["rainfall_mm","avg_temp_c"]:
df[col] = df[col].interpolate().fillna(df[col].median())
# Feature engineering
df["rainfall_rolling"] = df.groupby(["state"])["rainfall_mm"].transform(lambda s: s.rolling(3,min_periods=1).mean())
df["temp_rolling"] = df.groupby(["state"])["avg_temp_c"].transform(lambda s: s.rolling(3,min_periods=1).mean())
df.to_csv("data/cleaned.csv", index=False)
print("Saved → data/cleaned.csv")
3) Exploratory Data Analysis (EDA)
Before modeling, I explored relationships: rainfall vs. yield, temperature vs. yield, and recent trends. Below is a quick example using Matplotlib.
# eda.py
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv("data/cleaned.csv")
subset = df[df["crop"]=="Rice"].groupby("year", as_index=False).agg({"yield_t_per_ha":"mean","rainfall_mm":"mean"})
plt.figure()
plt.plot(subset["year"], subset["yield_t_per_ha"], label="Yield")
plt.plot(subset["year"], subset["rainfall_mm"]/1000, label="Rainfall (scaled)")
plt.title("Rice: Yield vs Rainfall (avg by year)")
plt.xlabel("Year")
plt.legend()
plt.tight_layout()
plt.savefig("images/eda_rice_trend.png")
print("Saved → images/eda_rice_trend.png")
Figure: Example trend visualization (placeholder—replace with your plot).
4) Modeling: Baseline → Better
I began with a simple LinearRegression baseline, then tried
RandomForestRegressor and XGBRegressor to capture non-linear relationships and interactions.
# model.py
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error, r2_score
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
# from xgboost import XGBRegressor # optional if installed
df = pd.read_csv("data/cleaned.csv")
features = ["rainfall_mm","avg_temp_c","rainfall_rolling","temp_rolling"]
X = df[features]
y = df["yield_t_per_ha"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
pipe_lr = Pipeline([
("scaler", StandardScaler()),
("model", LinearRegression())
])
pipe_rf = Pipeline([
("model", RandomForestRegressor(n_estimators=300, random_state=42))
])
for name, model in [("Linear", pipe_lr), ("RandomForest", pipe_rf)]:
model.fit(X_train, y_train)
preds = model.predict(X_test)
mae = mean_absolute_error(y_test, preds)
r2 = r2_score(y_test, preds)
print(f"{name} → MAE: {mae:.3f} | R²: {r2:.3f}")
joblib.
5) Evaluation & Interpretation
I tracked MAE and R². To understand the model, I inspected feature importance (for tree models) and plan to add SHAP later for deeper explanations.
# importance.py
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
# (Assume X_train, y_train from previous code)
rf = RandomForestRegressor(n_estimators=300, random_state=42).fit(X_train, y_train)
importances = pd.Series(rf.feature_importances_, index=X_train.columns).sort_values(ascending=False)
print(importances)
Result: In my initial runs, rolling rainfall and average temperature were strong predictors—intuitively sensible for yield.
6) Deploying with Streamlit
Streamlit makes it easy to turn the model into an app. Users choose crop/location, enter rainfall/temperature, and get a predicted yield.
# app.py
import streamlit as st
import pandas as pd
import joblib
st.set_page_config(page_title="Crop Yield Prediction", page_icon="🌾")
st.title("🌾 Crop Yield Prediction")
st.write("Enter climate features and get an estimated yield (t/ha).")
model = joblib.load("models/best_model.joblib")
rain = st.number_input("Rainfall (mm)", min_value=0.0, value=850.0, step=10.0)
temp = st.number_input("Avg Temp (°C)", min_value=-5.0, value=27.0, step=0.1)
rain_roll = st.number_input("Rolling Rainfall (mm)", min_value=0.0, value=800.0, step=10.0)
temp_roll = st.number_input("Rolling Temp (°C)", min_value=-5.0, value=26.5, step=0.1)
if st.button("Predict"):
X = pd.DataFrame([{
"rainfall_mm": rain,
"avg_temp_c": temp,
"rainfall_rolling": rain_roll,
"temp_rolling": temp_roll
}])
pred = model.predict(X)[0]
st.success(f"Estimated Yield: **{pred:.2f} t/ha**")
Deploy Steps (Streamlit Cloud)
- Push your code to GitHub (include
requirements.txtandmodels/best_model.joblib). - Go to share.streamlit.io (or Streamlit Community Cloud) and connect your repo.
- Select
app.py, set Python version, and deploy.
7) What’s Next
- Add more features (soil type, NDVI/satellite indices).
- Do hyperparameter tuning (GridSearchCV/Optuna).
- Explainability with SHAP plots for feature impact.
- Location dropdowns and crop-specific models in the app.
8) Links
• Live App (Streamlit): [add when deployed]
• Portfolio: rethickcb.netlify.app


Comments
Post a Comment