Intuition Behind the Doubly Robust Estimator 🌱

The Inverse Probability Weighted Estimator

  • Rebalancing data is necessary so that we aren’t dominated by outcomes under treatment/nontreatment which could inflate/deflate treatment effect Ξ”^IPW=1nβˆ‘i=1n[AiYiΟ€(Xi;Ξ³^)βˆ’(1βˆ’Ai)Yi1βˆ’Ο€(Xi;Ξ³^)]

Doubly Robust Estimator

Ξ”^DR=1nβˆ‘i=1n[AiYiΟ€(Xi;Ξ³^)βˆ’(1βˆ’Ai)Yi1βˆ’Ο€(Xi;Ξ³^)βˆ’[Aiβˆ’Ο€(Xi;Ξ³^)Ο€(Xi;Ξ³^)]ΞΌ(1,Xi;Ξ±^)βˆ’[Aiβˆ’Ο€(Xi;Ξ³^)1βˆ’Ο€(Xi;Ξ³^)]ΞΌ(0,Xi;Ξ±^)]

When Treatment = 1

Ο„iDR=[[Yi1e(Xi,Ξ²^i)βˆ’(1βˆ’e(Xi,Ξ²^i))e(Xi,Ξ²^i)m1(Xi,Ξ±^1)]βˆ’[m0(Xi,Ξ±^0)]]

=[[f1(Xi,Ξ±1^)βˆ’f0(Xi,Ξ±0^)]+Yi1βˆ’f1(Xi,Ξ±1^)p(Xi,Ξ²^i)] =f1βˆ’f0βˆ’Y0βˆ’f01βˆ’p
  • Left side of equation m1βˆ’m0 is the treatment effect based on your expected values
  • Right side of equation weights the prior information of the true outcome Yi, so that if Yi is lower than expected outcome, the estimation will be lowered (and if Yi is higher than expected then the estimated effect will be increased)
    • This adjustment is scaled by the propensity score. So if we notice that data points with one specific configuration of covariates are barely ever treated (i.e. low propensity score), then the $\frac{Y_{i}^{1}-m_{1}\left(\mathbf{X}{\mathbf{i}}, \hat{\alpha{1}}\right)}{e\left(\mathbf{X}{\mathbf{i}}, \hat{\beta}{i}\right)}willbelargersonowthem_1modelwillhavelessweightandtheactualoutcomeY_1$ will have more impact on the anticipated treatment effect for datapoints with those covariates.

When Treatment = 0

Ο„iDR=[m1(Xi,Ξ±^1)]βˆ’[[Yi0+m0(Xi,Ξ±0^)(1βˆ’e(Xi,Ξ²^i))βˆ’m0(Xi,Ξ±0^)1βˆ’e(Xi,Ξ²^i)]]

=[[f1(Xi,Ξ±1^)βˆ’f0(Xi,Ξ±0^)]βˆ’Yi0βˆ’f0(Xi,Ξ±0^)1βˆ’p(Xi,Ξ²^i)]

  • Imagine that we have a covariate such that high values of the covariate are almost always treated, but datapoints with high values of the covariate always (for both treated and non-treated groups) have a high value in the outcome variable. If we were to use just the m1 and m0 models for estimating ITE, then since we hardly observe any m0 values with a high value of this covariate, then for a datapoint with the high value that is not treated, we would likely overweight the treatment effect for this datapoint. However, since we include the outcome from the actual datapoint, Y0, and we weight it’s influence by the propensity (which will make the observed outcome have a larger influence in this situation) we can accurately estimate that the treatment effect for this datapoint won’t actually be that high. Therefore, this demonstrates how the doubly robust estimator can handle measured confounders just the feature in this example.

Imagine that we have a feature with the following properties:

  • high values of the covariate are almost always treated
  • datapoints with high values of the covariate always (for both treated and non-treated groups) have high values in the outcome variable.
  • the covariate has no effect on the treatment effect

  • If we were to use just the f1 and f0 models (i.e. the t-learner) for estimating the ITEs, then we would likely pick this feature out as a useful predictor for high values in the f1 model. Whereas in the f0 model, this feature probably wouldn’t even be used since we don’t observe many datapoints with high values that are untreated. Therefore, this feature would be deemed to have a signficiant influence on the treatment effect using the t-learner estimator.
  • If we use the dr_tau estimator then we weight the influence of this feature by the propensity to be treated for those with specific values of this feature. Suppose we have a nontreated datapoint Therefore, since the probability to be treated is low for datapoints with we would discount the influence from the f0 and f1 model we would likely overweight the treatment effect for this datapoint. However, since we include the outcome from the actual datapoint, Y0, and we weight it’s influence by the propensity (which will make the observed outcome have a larger influence in this situation) we can accurately estimate that the treatment effect for this datapoint won’t actually be that high. Therefore, this demonstrates how the doubly robust estimator can handle measured confounders just the feature in this example.

# IMPORTANT TAKEWAY

  • actually if we use this, then the treated values would have a high ITE, but the untreated values would have a low ITE
    • in this case then it would be more useful to do matching to nontreated and treated units
    • maybe we should only consider datapoints where nontreated and treated datapoints with the feature have high ITE

Interesting alternative perspective on propensity and regression

http://www.amitsharma.in/post/doubly-robust-estimation-a-simple-guide/

With messy data from the real-world, it is anybody’s guess whether the data is missing at random, or what the correct probabilities of omission are.

Notes mentioning this note

There are no notes linking to this note.


Here are all the notes in this garden, along with their links, visualized as a graph.

5G and WiFiAWS Step FunctionsAnalyzing Reddit Post on the Dollar StandardAsync, Await, and PromisesBayesian AverageBias Variance DecompositionBlockchain PresentationBreakpoint Debugging in VSCodeBrief Look into Measure TheoryC4 Model for Software ArchitectureCache vs Session StoreCant compare mean and median from different setsClient vs Server Side RenderingCode Production in an AI CompanyComparing Client Side Storage MethodsComputational Perception HighlightsConfidence Intervals for Known Distributions and...Cool Stocks ListCrazy Meeting with Obama, McCain, and Bush Post...Curse of DimentionalityDatabase vs Data Warehouse vs Data LakeDifferent Git AddsDocker (containerization) vs Vagrant (virtual...Explaining Decision Boundary of a Support Vector...Exporting Databricks Files to GithubFloyds Tortoise and Hare AlgorithmFresh Mac Setup Installation EssentialsGraphical Model IndependenciesHighlights from Bad SamaritansHighlights from Good Economics for Hard TimesHighlights from The Righteous MindHow Does Chromosomal Heredity WorkHow Does Light Influence the Rate of Capture in a...How Does Sweating WorkHow Does Version Naming WorkHow Not To Be Wrong Excerpt Self Selecting BiasHow Not to be Wrong Excerpt Public Opinion Doesn't...How Quantum Computers Could Quickly Break...How Someone Made a Spectral Lamp that Can Emit all...How are images compressed and stored in a computerHow do SPACs WorkHow does Hypothesis Testing WorkHow does air slow objects downHow is Neural Network a Universal ApproximatorHow is Unit Testing DoneHow to Access a Previous Commit with GitHow to Add to Your System Path Variable for MacHow to Build a Full Stack ApplicationHow to Clear Unused Docker ContainersHow to Convert from Celsius to FahrenheitHow to Delete a Branch GithubHow to Export Pandas DataFrame to CSV ProperlyHow to Force Pull and Overwrite GitHow to Get the Bootstrapped Standard Error for a...How to handle violations in positivityHow to Properly Explain Technical ToolsHow to Push Code for ProductionHow to Read a Path in S3How to Set Up Python Aliasing In the Command LineHow to Set a Specific Branch to Track a Specific...How to Store and Access SQL Queries in DatabricksHow to Take a Weighted AverageHow to Temporarily Stash Changes with Git StashHow to Untrack Committed Files from GitHow to Use PyenvHow to Use Sample Splitting for Doubly Robust...How to Write Output to Text FileHow to edit Obsidian themes with CSSHow to make copies of DNA with PCRHow to use Bounds and Sensitivity Analysis in...How to use Scipy Optimize to solve for values when...Info on Stock OptionsInspirational Computer PioneersIntuition Behind the Doubly Robust EstimatorInverting Hypothesis TestsInvesting LessonsJupyter Widgets ExistML CheatsheetsMaking Sense of a Betting Market with...Managing Ruby Versions with rbenvMarket Makers and Quant TradingMarket Making PresentationMatching IntuitionMethodology for Managing Web AppsMicroservices vs Monolithic ArchitectureModeling Advice and Lessons Learned Working at a...Multinomial to Binomial Stick Breaking...Music Theory NotesNotes from Michael Nielsen Effective Research PostNotes from the Martian by Andy WeirNotes on Bayesian OptimizationNotes on Exon Skipping with ASOsNotes on Options SpreadsNotes on Quantum CountryOne Persons Perspective About Why We Shouldnt Read...Presentation on the Kronovet Family Clothing...Python Dataclasses UpdatePython Package Reference InstructionsRandom CMU Course WebpagesRandom Facts from What If by Randall MunroeReading about Internet ServicesRock Thrust ExplainedSSHing into AWS and Running ThingsSome Bash Commands to Find Redundant Files and...Some Cool Python FeaturesSome Notes on Exploding Gradient ProblemStats BlogsStock Options in a CompanyTesting Code on GithubThoughts after Reading Hillbilly ElegyThoughts on Andy Matuschak Article on Teaching...Thoughts on Approaching Infinite KnowledgeThoughts on Maria Konnikova Knowledge Project...Thoughts on the End of Natural SelectionTor Network and .Onion DomainsUsing nonparametric models in doubly robust...Various Treatment Effects and their...Virtual Environment in AWSWhat Database do I useWhat are Git Pull and Push RequestsWhat are Information CriteriaWhat are Javascript WorkersWhat are MakefilesWhat are Moment Generating FunctionsWhat are Multiple CPU CoresWhat are Progressive Web AppsWhat are Wasserstein and Earth Movers DistancesWhat are the Four Fundamental Forces in Our...What is Apache SparkWhat is Bootstrapping in StatisticsWhat is Cryptocurrency StakingWhat is ElasticsearchWhat is Express.jsWhat is GLUEWhat is GraphQLWhat is HTTPSWhat is IV CrushWhat is Integration ReallyJAMStackWhat is KubernetesWhat is Mahalanobis DistanceWhat is MakerDAO CryptoWhat is Markov Chain Monte Carlo SamplingWhat is Nested Cross ValidationWhat is Next.jsWhat is PAC LearningWhat is R SquaredWhat is RedisWhat is ShrinkageWhat is Spearman CorrelationWhat is SvelteWhat is TerraformWhat is The Graph (Blockchain)What is Variational InferenceWhat is Vue.jsWhat is WebAssemblyWhat is a Credible IntervalWhat is a Fourier transformWhat is a Gaussian Mixture ModelWhat is a Gaussian ProcessWhat is a Object Relational MapperWhat is a Qini CurveWhat is a Sufficient StatisticWhat is independent component analysisWhat is the C-Statistic for BenefitWhat is the Dirichlet ProcessWhat is the EM AlgorithmWhat is the Hidden Markov ModelWhat is the Indian Buffet ProcessWhat is the Naive Bayes algorithmWhat is the Negative Binomial DistributionWhat is the Runtime of a LanguageWhat is the Studentized BootstrapWhat is the Wake Sleep AlgorithmWhat is the hypergeometric distributionWhy are Conjugate Priors UsefulWhy are there 12 Notes in Western MusicWhy is Cross Fitting Useful for Estimating...Why is a room hotter when you leave the fridge...Working with ClientsWorking with Terminaldata science overviewhighlights from Debt The First 5000 Yearshighlights from Enlightenment Nowhighlights from Hacking Darwinhighlights from How Not to be Wronghighlights from Leonardo da Vincihighlights from Open an Autobiographyhighlights from Range Why Generalists Triumph in a...highlights from Salt, Fat, Acid, HeatSapiens a Brief History of Humankindhighlights from Stumbling on Happinesshighlights from The Genehighlights from Thinking Fast and Slowhighlights from Trick Mirror