From Data to Deployment: Building a Crop Yield Prediction App

From Data to Deployment: Building a Crop Yield Prediction App
By Rethick CB · 3rd-sem CSE (Data Science)

From Data to Deployment: Building a Crop Yield Prediction App

Python Pandas Scikit-learn Streamlit ML Deployment

Predicting crop yield can help farmers, agri-startups, and policy makers make better decisions about resources, pricing, and risk. In this post, I share how I’m building a complete end-to-end Crop Yield Prediction project—from data collection & cleaning to modeling and deploying an interactive app with Streamlit.

1) Problem & Goal

Crop yields fluctuate due to rainfall, temperature, soil quality, and farming practices. The goal is to build a model that predicts yield for a given crop and location using historical climate and production data—then expose it via a simple web interface so anyone can try it.

Illustration of crop fields and weather overlays

Figure: Yield depends on climate patterns—our model learns those relationships.

2) Data: Sources & Cleaning

I started with open datasets that include year, state/district, crop, production (yield), rainfall, min/max temperature. After merging sources, I cleaned the data: fixed column names, handled missing values, converted types, and created features.

Sample Cleaning Code

# cleaning.py
import pandas as pd
import numpy as np

df_yield = pd.read_csv("data/yield.csv")
df_weather = pd.read_csv("data/weather.csv")

# Standardize columns
df_yield.columns = df_yield.columns.str.strip().str.lower().str.replace(" ", "_")
df_weather.columns = df_weather.columns.str.strip().str.lower().str.replace(" ", "_")

# Basic checks
print(df_yield.head())
print(df_weather.head())

# Merge on keys (example: year, state, crop)
df = (df_yield
      .merge(df_weather, on=["year","state"], how="left"))

# Handle missing values
for col in ["rainfall_mm","avg_temp_c"]:
    df[col] = df[col].interpolate().fillna(df[col].median())

# Feature engineering
df["rainfall_rolling"] = df.groupby(["state"])["rainfall_mm"].transform(lambda s: s.rolling(3,min_periods=1).mean())
df["temp_rolling"] = df.groupby(["state"])["avg_temp_c"].transform(lambda s: s.rolling(3,min_periods=1).mean())

df.to_csv("data/cleaned.csv", index=False)
print("Saved → data/cleaned.csv")
Tip: Keep raw, interim, and processed data in separate folders to avoid accidental overwrites.

3) Exploratory Data Analysis (EDA)

Before modeling, I explored relationships: rainfall vs. yield, temperature vs. yield, and recent trends. Below is a quick example using Matplotlib.

# eda.py
import pandas as pd
import matplotlib.pyplot as plt

df = pd.read_csv("data/cleaned.csv")

subset = df[df["crop"]=="Rice"].groupby("year", as_index=False).agg({"yield_t_per_ha":"mean","rainfall_mm":"mean"})
plt.figure()
plt.plot(subset["year"], subset["yield_t_per_ha"], label="Yield")
plt.plot(subset["year"], subset["rainfall_mm"]/1000, label="Rainfall (scaled)")
plt.title("Rice: Yield vs Rainfall (avg by year)")
plt.xlabel("Year")
plt.legend()
plt.tight_layout()
plt.savefig("images/eda_rice_trend.png")
print("Saved → images/eda_rice_trend.png")
Line chart showing rice yield vs rainfall over time

Figure: Example trend visualization (placeholder—replace with your plot).

4) Modeling: Baseline → Better

I began with a simple LinearRegression baseline, then tried RandomForestRegressor and XGBRegressor to capture non-linear relationships and interactions.

# model.py
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error, r2_score
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
# from xgboost import XGBRegressor  # optional if installed

df = pd.read_csv("data/cleaned.csv")

features = ["rainfall_mm","avg_temp_c","rainfall_rolling","temp_rolling"]
X = df[features]
y = df["yield_t_per_ha"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

pipe_lr = Pipeline([
  ("scaler", StandardScaler()),
  ("model", LinearRegression())
])

pipe_rf = Pipeline([
  ("model", RandomForestRegressor(n_estimators=300, random_state=42))
])

for name, model in [("Linear", pipe_lr), ("RandomForest", pipe_rf)]:
    model.fit(X_train, y_train)
    preds = model.predict(X_test)
    mae = mean_absolute_error(y_test, preds)
    r2 = r2_score(y_test, preds)
    print(f"{name} → MAE: {mae:.3f} | R²: {r2:.3f}")
Evaluate on a held-out test set and keep a simple, readable pipeline. Save the best model with joblib.

5) Evaluation & Interpretation

I tracked MAE and . To understand the model, I inspected feature importance (for tree models) and plan to add SHAP later for deeper explanations.

# importance.py
import pandas as pd
from sklearn.ensemble import RandomForestRegressor

# (Assume X_train, y_train from previous code)
rf = RandomForestRegressor(n_estimators=300, random_state=42).fit(X_train, y_train)
importances = pd.Series(rf.feature_importances_, index=X_train.columns).sort_values(ascending=False)
print(importances)
Result: In my initial runs, rolling rainfall and average temperature were strong predictors—intuitively sensible for yield.

6) Deploying with Streamlit

Streamlit makes it easy to turn the model into an app. Users choose crop/location, enter rainfall/temperature, and get a predicted yield.

# app.py
import streamlit as st
import pandas as pd
import joblib

st.set_page_config(page_title="Crop Yield Prediction", page_icon="🌾")

st.title("🌾 Crop Yield Prediction")
st.write("Enter climate features and get an estimated yield (t/ha).")

model = joblib.load("models/best_model.joblib")

rain = st.number_input("Rainfall (mm)", min_value=0.0, value=850.0, step=10.0)
temp  = st.number_input("Avg Temp (°C)", min_value=-5.0, value=27.0, step=0.1)
rain_roll = st.number_input("Rolling Rainfall (mm)", min_value=0.0, value=800.0, step=10.0)
temp_roll  = st.number_input("Rolling Temp (°C)", min_value=-5.0, value=26.5, step=0.1)

if st.button("Predict"):
    X = pd.DataFrame([{
      "rainfall_mm": rain,
      "avg_temp_c": temp,
      "rainfall_rolling": rain_roll,
      "temp_rolling": temp_roll
    }])
    pred = model.predict(X)[0]
    st.success(f"Estimated Yield: **{pred:.2f} t/ha**")

Deploy Steps (Streamlit Cloud)

  1. Push your code to GitHub (include requirements.txt and models/best_model.joblib).
  2. Go to share.streamlit.io (or Streamlit Community Cloud) and connect your repo.
  3. Select app.py, set Python version, and deploy.
Keep large datasets out of the repo; store only the cleaned sample or the trained model.

7) What’s Next

  • Add more features (soil type, NDVI/satellite indices).
  • Do hyperparameter tuning (GridSearchCV/Optuna).
  • Explainability with SHAP plots for feature impact.
  • Location dropdowns and crop-specific models in the app.
" target="_blank" rel="noopener">github.com/cbrethick/crop-yield-prediction
• Live App (Streamlit): [add when deployed]
• Portfolio: rethickcb.netlify.app

Comments