A Coding Guide to Survey Bias Correction Using Facebook Research Balance with IPW CBPS Ranking and Post Stratification Methods
In this tutorial, we stroll by a whole, end-to-end workflow for correcting bias in survey information utilizing the balance library. We simulate a sensible inhabitants, intentionally introduce sampling bias, and then apply a number of re-weighting strategies to recuperate unbiased estimates. We deal with 4 broadly used strategies: Inverse Probability Weighting (IPW), Covariate Balancing Propensity Scores (CBPS), rating, and post-stratification, and consider how successfully every methodology restores stability between the pattern and the goal inhabitants. Throughout the method, we analyze diagnostics resembling ASMD, consequence estimates, and design results to construct a robust intuitive and sensible understanding of survey weighting.
import subprocess, sys
subprocess.check_call([sys.executable, "-m", "pip", "install", "-q", "balance"])
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings("ignore")
from stability import Sample
np.random.seed(2024)
sns.set_theme(type="whitegrid", context="pocket book")
We start by putting in the stability bundle and importing all of the required libraries for information manipulation and visualization. We set a random seed to guarantee reproducibility and configure plotting aesthetics for clearer diagnostics. This setup prepares a clear, constant atmosphere for operating the complete reweighting workflow.
def simulate_population(n=50_000):
age = np.clip(np.random.regular(45, 17, n), 18, 90).astype(int)
gender = np.random.alternative(["M", "F"], dimension=n, p=[0.49, 0.51])
schooling = np.random.alternative(
["HS", "SomeCollege", "Bachelor", "Graduate"],
dimension=n, p=[0.35, 0.25, 0.25, 0.15],
)
revenue = np.exp(np.random.regular(10.5, 0.5, n))
area = np.random.alternative(
["Urban", "Suburban", "Rural"], dimension=n, p=[0.40, 0.35, 0.25]
)
happiness = (
50
+ 0.20 * (age - 45)
+ (schooling == "Graduate") * 8
+ (schooling == "Bachelor") * 4
+ (area == "Urban") * 3
+ np.log(revenue) * 2
+ np.random.regular(0, 5, n)
)
return pd.DataBody({
"id": np.arange(n).astype(str),
"age": age,
"gender": gender,
"schooling": schooling,
"revenue": revenue.spherical(2),
"area": area,
"happiness": happiness.spherical(2),
})
def biased_sample(pop, n=2_000):
rating = (
-0.04 * (pop["age"] - 30)
+ (pop["education"] == "Graduate") * 1.0
+ (pop["education"] == "Bachelor") * 0.6
+ (pop["region"] == "Urban") * 0.7
- (pop["region"] == "Rural") * 0.5
)
p = 1 / (1 + np.exp(-score))
p = p / p.sum()
idx = np.random.alternative(pop.index, dimension=n, substitute=False, p=p)
return pop.loc[idx].reset_index(drop=True)
target_df = simulate_population(50_000)
sample_df = biased_sample(target_df, 2_000)
target_for_balance = target_df.drop(columns=["happiness"])
print(f"Sample dimension : {len(sample_df):,}")
print(f"Target dimension : {len(target_for_balance):,}")
print(f"nTRUE inhabitants imply happiness : {target_df['happiness'].imply():.2f}")
print(f"Naive pattern imply happiness : {sample_df['happiness'].imply():.2f} <-- biased!")
We simulate a sensible inhabitants dataset with demographic and socioeconomic options alongside with an consequence variable. We then introduce sampling bias by preferentially deciding on youthful, extra educated, and city people to mimic real-world survey bias. Finally, we evaluate the naive pattern imply to the true inhabitants imply to spotlight bias.
pattern = Sample.from_frame(
sample_df, id_column="id", outcome_columns=["happiness"]
)
goal = Sample.from_frame(target_for_balance, id_column="id")
sample_with_target = pattern.set_target(goal)
print("n--- Sample object ---")
print(sample_with_target)
print("n" + "=" * 60)
print(" PRE-ADJUSTMENT DIAGNOSTICS")
print("=" * 60)
asmd_before = sample_with_target.covars().asmd()
print("nASMD (Absolute Standardized Mean Difference) — decrease = higher stability")
print("Rule of thumb: |ASMD| > 0.10 signifies significant imbalance.")
print(asmd_before.T.spherical(3))
print("nMean of covariates (pattern vs goal):")
print(sample_with_target.covars().imply().T.spherical(3))
We convert each the biased pattern and the goal inhabitants into structured Sample objects for processing. We compute pre-adjustment diagnostics, resembling ASMD and covariate means, to quantify imbalance between the pattern and the goal. This step helps us clearly perceive how far the pattern deviates earlier than making use of any correction.
print("n" + "=" * 60)
print(" FITTING WEIGHTS — 4 METHODS")
print("=" * 60)
print("n>>> [1/4] IPW with LASSO logistic regression")
adjusted_ipw = sample_with_target.regulate(methodology="ipw")
print(adjusted_ipw.abstract())
print("n>>> [2/4] CBPS — Covariate Balancing Propensity Score")
strive:
adjusted_cbps = sample_with_target.regulate(methodology="cbps")
print(adjusted_cbps.abstract())
besides Exception as e:
print("CBPS failed (skipping):", e)
adjusted_cbps = None
print("n>>> [3/4] Raking (iterative proportional becoming)")
adjusted_rake = sample_with_target.regulate(methodology="rake")
print(adjusted_rake.abstract())
print("n>>> [4/4] Post-stratification (categoricals solely)")
cat_cols = ["id", "gender", "education", "region"]
sample_cat = Sample.from_frame(
sample_df[cat_cols + ["happiness"]],
id_column="id", outcome_columns=["happiness"],
)
target_cat = Sample.from_frame(target_for_balance[cat_cols], id_column="id")
adjusted_post = sample_cat.set_target(target_cat).regulate(methodology="poststratify")
print(adjusted_post.abstract())
print("n" + "=" * 60)
print(" METHOD COMPARISON")
print("=" * 60)
strategies = {
"IPW": adjusted_ipw,
"CBPS": adjusted_cbps,
"Rake": adjusted_rake,
"PostStrat": adjusted_post,
}
def safe_mean_asmd(asmd_df, desire="self"):
"""Mean ASMD throughout covariates from a stability asmd DataBody."""
row = desire if desire in asmd_df.index else asmd_df.index[0]
if "imply(asmd)" in asmd_df.columns:
return float(asmd_df.loc[row, "mean(asmd)"])
return float(asmd_df.loc[row].imply())
asmd_means = {"Unadjusted": safe_mean_asmd(asmd_before)}
outcome_means = {"Naive pattern": float(sample_df["happiness"].imply())}
deff_vals = {}
for identify, m in strategies.gadgets():
if m is None:
proceed
asmd_means[name] = safe_mean_asmd(m.covars().asmd(), desire="self")
outcome_means[name] = float(m.outcomes().imply()["happiness"].iloc[0])
w = m.to_df()["weight"].values
deff_vals[name] = (w.sum() ** 2) / (len(w) * np.sum(w ** 2))
outcome_means["TRUE pop"] = float(target_df["happiness"].imply())
print("nMean ASMD throughout covariates (decrease = higher stability):")
for okay, v in asmd_means.gadgets():
print(f" {okay:14s}: {v:.4f}")
print("nWeighted estimate of imply happiness:")
for okay, v in outcome_means.gadgets():
print(f" {okay:14s}: {v:.3f}")
print("nKish's efficient sample-size ratio (1.0 = no data loss):")
for okay, v in deff_vals.gadgets():
print(f" {okay:14s}: {v:.3f} (n_eff ≈ {int(v * len(sample_df))})")
We apply 4 completely different weighting strategies, IPW, CBPS, rating, and post-stratification, to regulate the biased pattern. We consider every methodology utilizing stability metrics, consequence estimates, and calculations of efficient pattern dimension. This comparability permits us to perceive how completely different strategies commerce off bias discount and variance.
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
colors_a = ["gray", "#1f77b4", "#ff7f0e", "#2ca02c", "#d62728"][: len(asmd_means)]
axes[0, 0].bar(record(asmd_means.keys()), record(asmd_means.values()), coloration=colors_a)
axes[0, 0].axhline(0.1, ls="--", coloration="purple", label="0.10 imbalance threshold")
axes[0, 0].set_title("Mean ASMD throughout covariates")
axes[0, 0].set_ylabel("Mean ASMD"); axes[0, 0].legend()
axes[0, 0].tick_params(axis="x", rotation=20)
fact = target_df["happiness"].imply()
colors_b = ["#888"] + ["#1f77b4", "#ff7f0e", "#2ca02c", "#d62728"][: len(methods)] + ["black"]
axes[0, 1].bar(record(outcome_means.keys()), record(outcome_means.values()),
coloration=colors_b[: len(outcome_means)])
axes[0, 1].axhline(fact, ls="--", coloration="black", label=f"fact = {fact:.2f}")
axes[0, 1].set_title("Estimated imply happiness vs floor fact")
axes[0, 1].set_ylabel("Mean happiness"); axes[0, 1].legend()
axes[0, 1].tick_params(axis="x", rotation=20)
w_ipw = adjusted_ipw.to_df()["weight"].values
axes[1, 0].hist(w_ipw, bins=40, coloration="steelblue", edgecolor="white")
axes[1, 0].set_title(
f"IPW weight distributionn"
f"min={w_ipw.min():.2f} median={np.median(w_ipw):.2f} max={w_ipw.max():.2f}"
)
axes[1, 0].set_xlabel("weight"); axes[1, 0].set_ylabel("depend")
ages = sample_df["age"].values
bins = np.linspace(18, 90, 31)
axes[1, 1].hist(target_df["age"], bins=bins, density=True, alpha=0.45,
coloration="inexperienced", label="Target (fact)")
axes[1, 1].hist(ages, bins=bins, density=True, alpha=0.45,
coloration="purple", label="Sample (biased)")
axes[1, 1].hist(ages, bins=bins, density=True, alpha=0.45,
coloration="blue", weights=w_ipw, label="Sample (IPW-weighted)")
axes[1, 1].set_title("Age distribution: bias correction by IPW")
axes[1, 1].set_xlabel("Age"); axes[1, 1].set_ylabel("density"); axes[1, 1].legend()
plt.tight_layout()
plt.savefig("balance_diagnostics.png", dpi=110, bbox_inches="tight")
plt.present()
print("n" + "=" * 60)
print(" ADVANCED — controlling variance with max_de")
print("=" * 60)
print("max_de=1.5 trims excessive weights so the design impact stays ≤ 1.5,")
print("buying and selling slightly bias for tighter confidence intervals.n")
adjusted_trim = sample_with_target.regulate(methodology="ipw", max_de=1.5)
print(adjusted_trim.abstract())
out = adjusted_ipw.to_df()
out.to_csv("balance_weighted_sample.csv", index=False)
print("nSaved weighted pattern → balance_weighted_sample.csv")
print("Saved diagnostics plot → balance_diagnostics.png")
print("nFirst 5 rows of weighted output:")
print(out.head())
err_naive = abs(sample_df["happiness"].imply() - fact)
err_ipw = abs(outcome_means["IPW"] - fact)
print("n" + "=" * 60)
print(" BIAS REDUCTION SUMMARY")
print("=" * 60)
print(f"Naive estimator error : {err_naive:.3f}")
print(f"IPW estimator error : {err_ipw:.3f}")
print(f"Bias discount : {(1 - err_ipw / max(err_naive, 1e-9)) * 100:.1f}%")
We visualize the outcomes utilizing plots for ASMD, consequence estimates, weight distributions, and function alignment. We additionally discover variance management utilizing trimmed weights and save the ultimate weighted dataset for downstream use. Also, we compute bias-reduction metrics to verify the extent to which the adjustment improves estimation accuracy.
In conclusion, we noticed how re-weighting strategies can considerably cut back bias and deliver pattern estimates a lot nearer to the true inhabitants values. We in contrast a number of adjustment strategies and examined the trade-offs between bias discount and variance, notably when dealing with excessive weights. Using the stability framework, we constructed a reproducible pipeline that not solely corrects for choice bias but in addition offers clear diagnostics and interpretability. This workflow equips us with sensible instruments to deal with real-world biased datasets, enabling extra dependable inference and decision-making in survey evaluation and observational research.
Check out the Codes with Notebook. Also, be happy to comply with us on Twitter and don’t neglect to be a part of our 130k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
Need to companion with us for selling your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar and so forth.? Connect with us
The put up A Coding Guide to Survey Bias Correction Using Facebook Research Balance with IPW CBPS Ranking and Post Stratification Methods appeared first on MarkTechPost.
