Revisiting the instrumental variables strategy for testing AGP GD causation

Autogynephilia correlates with cross-gender ideation, gender dysphoria, and other gender issues. Usually Blanchardians attribute this to autogynephilia causing gender issues, but critics point out that correlation!=causation, and often argue that it is instead gender issues that cause autogynephilia, because someone who wants to be a woman would also want to engage in sexual activities as a woman and such.

A while ago, I had the idea that we could test the causal relationship between AGP and GD by looking at people who are more or less kinky. Specifically, the idea was that while some could imagine that wanting to be a woman would cause autogynephilia, it wouldn’t make much sense for it to cause kinkiness in general. Therefore, if we observe a correlation between kinkiness and gender issues, it would make most sense for this to be due to a kinkiness -> AGP -> GD effect, and therefore it would support an AGP -> GD causality. I found such an association, and therefore concluded that there was support to the AGP -> GD effect.

Shortly after I wrote the post, Michael Bailey sent me an email criticizing it by pointing out that applying instrumental variables in this way can be problematic, linking to a paper where he made the critique in detail. Which in retrospect is pretty obvious; I even emphasized these sorts of problems in my blog post, but perhaps I didn’t take them seriously enough, considering that I did still attempt to do this.

I think I’ve come up with a way to fix the method, and test the AGP -> GD effect in a much more solid way. This blog post intends to give an introduction to this concept; I still need more data before I can definitely test it, but I can use the previous data as an illustration.

Empirical causal inference in science 101

The main point of doing research is to uncover causal relationships. A common problem in science is that you’ve got two variables X and Y (in this case, AGP and GD), and you want to figure out the causal effect of X on Y. To solve this problem, a broad range of methods have been developed. Enumerating them all can be daunting, but luckily they mostly tend to follow a pretty consistent formula: To identify the effect of X on Y, you isolate some cause of X and look at how Y varies as this cause varies. So for instance, when you do a randomized controlled experiment, the cause of X that you isolate is your experiment, and then you look at how Y varies from your control group to your experimental group.,q_auto:good,fl_progressive:steep/
Most forms of quantitative causal inference between variables X and Y involve finding some cause Xc of X that doesn’t suffer from problems due to confounding or reverse causation. See the blog post for details.

The core assumption that this method makes is that the cause you isolate is not correlated with the outcome of interest, other than via its effect on X. Putting the case of autogynephilia and gender dysphoria into this framework, my strategy was to isolate general kinkiness as a cause of autogynephilia, and then look at how gender dysphoria varies between non-kinky and highly kinky people. But one could easily question whether the assumption holds here; for instance, you might suspect that people who are more sexually open-minded are both more kinky and more likely to want to be the opposite sex. Or really, lots of other things.

In particular, part of the problem is that “kinkiness” is a particularly difficult sort of variable to use for this approach. If I take the average interest across a wide range of sexual interests, then the variable I am measuring is “whatever things contribute to a wide range of sexual interests”. This is a pretty unbounded category of causes; while I have trouble thinking of any one single thing that would go into it (libido maybe?), it also seems unlikely that this is definitely going to be unconfounded. My plan after writing the blog post was to start investigating these sorts of hypotheses, searching for confounders and adjusting for them. But ultimately the problem is that you only need a very tiny violation of the assumptions to get wrong results, and therefore this is not a viable strategy.

This is a general problem with figuring out the AGP <-> GD causality

I investigated the causal direction using general kinkiness as a root cause, but there are other attempts to figure out AGP <-> GD causality that fall into the same general category, and which encounters the same problems.

Consider for instance time as a cause of autogynephilia. Kids are, for complicated evolutionary reasons, not very sexual, with libido instead firing up at puberty. As such, Blanchardians might want to use the contrast between childhood gender issues and adulthood gender issues as a measure of the contribution of autogynephilia.1 This can be critiqued in a lot of ways, but perhaps the best critique is to point out that it’s far from obvious that this is unconfounded. Puberty is also a time where a lot of sexual differentiation happens, and where gender-related topics become relevant in new and different ways, so it’s very far from obvious that this is an unconfounded measure of the effect of AGP.

Another example involves relationship status. An AGP researcher I’ve talked to argued that you could use the differences in autogynephilia and gender issues between times where an autogynephile is single and times where the autogynephile is in a romantic relationship to estimate the effect of autogynephilia on gender issues.2 The idea is that some autogynephiles feel that they are more autogynephilic when they don’t have a girlfriend. Leaving aside the issue that I am kinda skeptical of the effect of relationship status on autogynephilia, it seems far from obvious to me that relationship status doesn’t influence gender issues through other means. It seems to definitely influence the pros and cons of transitioning, and it seems like someone who has more opportunity to transition would also have a greater interest in doing so. Which makes relationship status an invalid variable to use to estimate these things.

I think the problem pops up all the time in these debates. HRT, random variation in GD over time, shifts in GD when seeing or thinking about sexy women, etc.. Almost all the back-and-forth arguing about the validity of AGP models comes down to the issue that we’re trying to parse out causality from a bunch of proxy related variables, without having a definite idea of how these variables function.

It is worth saying that the problem is not that we know some specific confounding variable that makes the tests invalid. Rather, the bigger problem is that we have no idea how the variables are related, so there could easily be tons of confounders and unintended mediators that we don’t understand. These sorts of methods shouldn’t be taken lightly, with all the arguments mindlessly thrown at the wall to see what sticks. Rather, we need to take a step back and identify some more well-justified method for studying this.

Kan være et billede af udendørs og tekst, der siger "ENDOGENEITY Me adding more controls to my regression"

Recently, I decided that this whole class of methods was inherently flawed for investigating things, and looked into alternate methods of causal inference, most notably analogy-based reasoning. For instance, one such argument would be “we know autogynephilia is a sexual interest, and sexual interests cause desires, rather than being caused by the desires”. But these alternate methods have their own new and exciting difficulties to struggle with, so I haven’t been able to do anything definite with them. But as I mentioned in the beginning of the post, I’ve come up with a way to fix the standard approach for causal inference, so let’s get around to this.

Maybe we should just investigate how our causes work

So back to the matter at hand. We want to know the effect of autogynephilia on gender dysphoria. So to do this, we look at the causes of autogynephilia, and identify general paraphilic tendencies as a cause. But the problem is, we don’t know how general paraphilic tendencies work, so maybe they have some hidden correlation with gender dysphoria (e.g. via sexual openmindedness) that make our tests invalid.

The problem illustrated diagramatically. Each node represents a variable, and the arrows represent causal effects while the lines represent unknown effects. GFP refers to the general kinkiness variable that we estimate by asking people about a bunch of unrelated paraphilias. ??? refers to hidden confounders that may make our analysis invalid.

In fact, if we knew the strength of the hidden correlation, we could just subtract it off in order to make our tests valid again. There’s some asterisks here that should be taken into account, but I think it’s at least a promising path forward. But that raises the question, how do we figure out the hidden correlation between general paraphilia and gender issues?

The obvious way to figure out whether this is the case would be to correlate general paraphilic tendencies with gender issues. If there is some sort of connection between them, then that connection should show up as a correlation between the two. But of course the problem is, the connections between paraphilias and GD also include the kinkiness -> AGP -> GD connection, which is precisely the connection that we want to estimate. We would end up subtracting the correlation from itself, yielding zero no matter what.

So is there some way that we can figure out the kinkiness <-> GD correlation, minus the kinkiness -> AGP -> GD path? Here’s my idea: Just look at the correlation between kinkiness and GD in non-AGP men. If the men aren’t AGP, then the kinkiness -> AGP -> GD path cannot be in play. Next, subtract this off from the correlation between kinkiness and GD overall, and you get your causal estimate.

Rather than investigate the associations among all men, we can simply investigate the associations among non-AGP men. This doesn’t include the kink -> AGP -> GD path, allowing us to investigate potential confounders.


I figured this method out a while ago now, and I had actually intended to do a separate survey to collect new data to test it. But then I started getting distracted, and I figured, hey, I’ve got the previous porn survey that I originally tested this method in, I might as well try it again on that data. Later we will discuss some reasons why this survey isn’t ideal, but it seems like a reasonable starting point.

So a bit of background, the dataset I’m going to analyze comes from a survey I posted to /r/SampleSize, titled “[Casual] Can you look at some porn For Science? Survey #5 (18+) NSFW”. In the survey, I showed people various erotic images containing men and women doing various erotic things. In addition to this, I also asked a number of questions, including questions about sexual interests and gender issues. I got about 1000 male responses, making it quite a large sample size. Which is good, because this method is incredibly data-intensive.

To measure general paraphilia, I had some items measuring sexual interests by asking about arousal on a rating scale from “Not at all” to “Very”. I took the average response to how aroused the participants said they would get by the following themes (alpha=0.52):

  • Being tied up by your partner
  • Exposing my genitals to an unsuspecting stranger
  • Watching a video of yourself masturbating
  • Having an older sexual partner take on a dominant parent-like role in the relationship
  • Imagining having sex with an anthropomorphic animal (furry)
  • Caressing your partner’s feet

To measure autogynephilia, I took the average response to how aroused the pariticipants said they would get by the following themes (alpha=0.81):

  • Imagining being the opposite sex
  • Wearing clothes typically associated with the opposite sex (crossdressing)
  • Picturing a beautiful woman and imagining being her
  • Wearing sexy panties and bras
  • Imagining being hyperfeminized, i.e. turned into a sexy woman with exaggeratedly large breasts and wide hips

Those who answered “Not at all” to all of the above were categorized as non-AGP (n=316), while the remainder were classified as AGP (n=828).

To measure gender dysphoria, I had some items that asked about how masculine/feminine the participants were, with a rating scale going from “Disagree Strongly” to “Agree Strongly”. Among those, I used the following two to assess gender issues (alpha=0.61):

  • As a child I wanted to be the opposite sex
  • I feel I would be better off if I was the opposite sex

Among non-AGPs, the correlation between GFP and GD was 0.02 (with a standard error of 0.06 according to bootstrap). This could be taken to indicate that there was no confounding between GFP and GD at all, though make sure to read the rest of the blog post to see an asterisk with this interpretation. Among AGPs, the correlation between GFP and GD was 0.15 (SE 0.03). Therefore, subtracting them yielded a correlation of 0.13 (SE 0.07).

This 0.13 number is pretty low, but it is the value for the GFP -> AGP -> GD path, not for the AGP -> GD step of it. To get the latter, I divide out by the GFP <-> AGP correlation among AGP men. This is a correlation of 0.37 (SE 0.03), yielding a value of 0.35 (SE 0.2) as the causal effect of autogynephilia on gender issues among autogynephiles.

This effect is technically not an effect for the whole sample, but instead only among the subset that are autogynephilic. I can assume it simply linearly extrapolates to the entire sample, in which case I get a total effect of 0.36 (SE 0.2). If I subtract this off from the original correlation of 0.45 (SE 0.03) between autogynephilia and gender issues, that leaves an effect of 0.1 (SE 0.2) that isn’t explained by the AGP -> GD effect. So this examination indicates that 80% of the correlation between autogynephilia and gender issues is causal AGP -> GD.


And here’s the bad news: the effect of 0.36 is not statistically significant. That’s not to say that it’s too “small” to be important or something like that. Rather, statistical significance is a technical term used to describe when the sample size is big enough that it would be hard for the result to have been achieved by chance, just from randomly picking people who happen to align with the theory. In order for a result to be statistically significant, it must be the case that if there were no effect, you’d only get results as extreme as that result 5% of the time. But that would require our effect to be greater than 0.4, which it is not.

The good news is that the remaining correlation of 0.1 also wasn’t statistically significant. It would have to exceed 0.38 to be significant, which it very much did not.

What this lack of significance means is that this survey isn’t the final step in the story. We need to collect more, bigger data. Compared to just going with the direct correlation, this method needs very large sample sizes. I would estimate that this method requires about 15x as many participants as the more straightforward methods, though it depends very much on the details.

We also need better data. The paraphilia and gender issues measures used in this survey were very low-quality. I’ve been working on better measures, but I could still use improvements. The autogynephilia measure is also kind of ad-hoc, and could benefit from more coherence and thought.

It may also help to get more controls. If we can better account for other factors that influence gender dysphoria, then that can let us estimate the effects more precisely for autogynephilia. It may also be that we can somehow combine this with my analogy-based methods to improve things.

It should also be noted that this method can be used for other things than autogynephilia theory too. For instance, it could likely be used to test the “autoandrophobia” theory that is often brought up by critics of autogynephilia. This theory is rarely explicated, but I did once talk with a trans woman who gave me her idea of it. In that variant, people end up with certain random things that they are disgusted by, similar to how people end up with certain random things that they find erotic; and if one then ends up finding having male traits to be disgusting, then that would cause gender dysphoria. This theory could be tested by replacing the general factor of paraphilia with a general factor of disgust sensitivity, and replacing autogynephilia with autoandrophobia.

Finally, let’s take a discussion of the potential problems and assumptions with this method. This is going to get technical, so I guess be warned about that. After the discussion of problems.

Conditioning is not a counterfactual

This first point is kind of abstract, so let’s instead discuss my favorite statistical paradox, Berkson’s paradox. I like the examples given in this twitter thread: Why are handsome men jerks? Why don’t standardized test scores predict university performance great? Why are movies based on good books usually bad? Why are smart students less athletic? Why do taller NBA players not perform better at basketball?
Stolen slide illustrating Berkson’s paradox. By selecting a subset of the population, you introduce a negative correlation between the variables you select on.

If we filter our sample on the basis of some set of variables, then that filtering introduces a ton of spurious correlations between all of the variables that are upstream of our filtering. The usual pattern will be negative correlations between the causes, but we might have other things going on, depending on the specific details.

So when we compute things for the non-AGP and AGP men separately, we may very well introduce some additional correlations that don’t correspond to anything real. How big of a problem is this? Lemme give you my threat model, to evaluate what happens.

Threat model: AGP merely reflects a kinky way to express gender feelings. The association between the GFP and GD is not due to GFP -> AGP -> GD, but instead due to some underlying common cause, e.g. sexual open-mindedness or something abstract like that.

The most common critique of the AGP->GD hypothesis is to claim that it makes more sense for there to be a GD->AGP effect. If we then filter for those who are not AGP, then that seems like it should lead to exactly the sorts of classical Berkson’s paradox effect that I’ve brought up here: You would only be included in the sample if you are not AGP, which you would be unlikely to be if you were both kinky and GD, so you’d have to either be neither kinky nor GD, be only kinky, or be only GD. Further, if you were only GD, then you would probably need to be less kinky than average to cancel it out, while if you were only kinky, then you would probably need to be less GD than average to cancel it out. So this could explain why we got a correlation of 0.02 between kinkiness and gender issues among non-AGPs; maybe the “true” correlation was higher, but it was masked by the filter effect.

So that seems like a problem. But, this isn’t the only filtering we did. We also looked at the correlation between AGP and GD among AGP men, and subtracted off the correlations from each other. Thus, if the Berkson’s paradox effect is equally big for both of them, it should cancel out. Could that be the case? And if it isn’t the case, could we estimate the discrepancy and adjust for it?

Here’s one condition where it would be the case: All of the variables are normally distributed and linearly related, and when we filter for non-AGP men, we take the men who have below-average amounts of AGP, while when we filter for AGP men, we take the men who have above-average amounts of AGP. Because we’d then be filtering equally strongly when we took the below-average and above-average AGPs, it would exactly cancel out, and there would be nothing to be concerned about.

The problem with this condition is that it’s obviously wrong. For instance, the distribution of AGP looks like this:

That looks extremely non-normal to me.

But there are many ways that it could be rescued. Suppose, for instance, that you believe the participants see being a woman as having some degree of eroticism, which may be negative or positive, and suppose that a man ends up AGP if he sees being a woman as having a positive degree of eroticism. In that case, you’d expect to see some sort of distribution similar to the above, where there’s a large spike around 0, and a distribution above this. Further, if you believe that there are many factors that influence the latent eroticism (and you almost must, considering that we can’t find any factors that predict AGP), then it seems reasonably to suppose that this is normally distributed, as tends to happen in polyfactorial cases due to the central limit theorem. So in this model you would have AGP expressed as follows:

AGP = max(0, kinkiness + gender issues + ..?other factors?..)

An alternative would be a conjunctive model. The previous model assumes that if there is some factor that influences the latent eroticism of being a woman strongly enough, then that factor alone can cause AGP, by overpowering the other factors. But what if instead you think that factors need to interact to cause AGP? A simplistic example might be that if you are AGP if you are kinky and open to being a woman; but other more nuanced models are possible. Here you would express AGP as a product:

AGP = kinkiness * gender issues * ..?other factors?..

(Here, all of the factors would need to be positive; otherwise you get bizarre inversions where if a factor gets negative then all of the other factors end up having the opposite effect.)

It turns out that these models are approximately isomorphic! Specifically, first notice that the maximum function and the exponential function have approximately the same shape for small input values:

Shapes of the maximum function and the exponential function.

Therefore, we can approximately replace the first model with the following:

AGP = exp(kinkiness + gender issues + ..?other factors?..) = exp(kinkiness) * exp(gender issues) * …

Applying the exponential function to the other factors is exactly what is necessary to turn them strictly positive, as is expected by the conjunctive model. Overall I’ve spent a lot of time thinking of different models for how things could interact, and most of them seem like they end up approximately isomorphic to this model (though I’m open to hearing counterexamples if you have any), so I think it’s probably okay to use.3

So to recap, what this implies is that the Berkson’s paradox effect will be equally big if we filter equally hard on the AGP category and on the non-AGP category, which will happen if we have equally many in each of the categories.

And that’s actually part of the problem with the porn survey. I had 316 men in the non-AGP category, and 828 men in the AGP category, so that means only 28% of the respondents were non-AGP. Meanwhile, in the general population, about 3%-15% of men are AGP and the rest are non-AGP. So in neither case, I would end up with an even split. However, on reddit, the proportion of AGPs is actually often quite close to 50%, so it might be doable there. (I’m not sure what happened in the porn survey – I suspect it’s just that AGPs are horny.) Otherwise, it might also be interesting to look into whether there are any mathematical ways to adjust for the asymmetry.

Nonlinearities kill

Part of the assumption made in this method is that whichever confounders there may be between kinkiness and gender issues work the same way in AGPs and non-AGPs. If this is true, then I think the approach is in pretty good standing. However, what if they don’t? Suppose for instance that we have some sort of situation like this:

That is, suppose gender dysphoria is caused by some sort of neurological feminization (it’s not particularly important that this is so, but I had to pick some concrete variable), and suppose that gender issues arise from this. But suppose further that sexual openmindedness (or whatever, the particular variable isn’t very important) moderates this effect, such that the effect of ladybrains on gender issues is stronger for those who are sexually openminded (maybe the others repress, or are unwilling to admit their gender issues, or whatever).

In that case, AGPs would be more likely than non-AGPs to have ladybrains, and therefore the confounding between GFP and GD would be stronger for them. Which would lead to my method concluding that AGP causes GD, even though in this case it doesn’t.

It would probably be a good idea to evaluate how sensitive this method is to nonlinearities. In additions, ways of making it more robust should be evaluated. Further, in the context of nonlinearities, it should be noted that the method sort of relies on something nonlinear-like going on. I split on the basis of AGP vs non-AGP, with the logic being that the GFP can’t influence AGP among non-AGPs. But for there to be some context where the GFP can’t influence AGP, there must be a nonlinear relationship between the GFP and AGP.

Estimation shenanigans

When I computed the effects, I did all sorts of subtractions of correlations and such from each other. This isn’t strictly valid; the correct way to adjust for the confounding between GFP and GD depends on the nature of how the confounding works, leading to a spectrum of possible adjustments. Furthermore, if variances differ (for instance, there’s more variance in AGP among AGPs than among non-AGPs, as non-AGPs have 0 variance in AGP), then using correlations rather than regression coefficients is invalid.

In fact, if I take this last point into account and reevaluate the coefficient from the data, then I get an effect size of 0.38 (SE 0.17), which just barely manages to be statistically significant. But this isn’t the only estimation shenanigan I did, and in order for the results to be believable, it would be good to go through and see if the estimation can be made more accurate. In cases where we don’t have sufficient information to make it more accurate, we should try varying the assumptions to see how sensitive it is to them.

Overall, due to all of these complications, this should merely be seen as a proof of concept, and not necessarily as a finished, definite solution. But I think the trick I presented in this post, of comparing the effect in AGPs and non-AGPs, make me more open to the possibility that this class of methods for causal inference may be workable for deciding the validity of AGP->GD causality.

1. From the perspective of Blanchardian theory, what would be most convenient would be if AGPs didn’t have any childhood gender issues at all, because this would seriously cast doubt on the possibility of GD -> AGP. However, pursuing this argument is not very viable, because when pressed, Blanchardians admit that often times, AGPs do have some gender issues in childhood.

Blanchardians argue that this may be analogous to how children sometimes end up with childhood crushes, with the childhood gender issues corresponding to a sort of romantic ideation. Which, sure, whatever, seems like a fair enough possibility. But it complicates the idea of using time as a cause of autogynephilia for causal inference, and Blanchardians should stop making this argument.

2. The idea behind this proposal is that autogynephilia and gynephilia “compete”; at times where someone is more sexually engaged with women, they don’t have enough “left-over attraction” to be attracted to being women. I have not seen much convincing theory or hard data supporting this; as far as I can tell, it’s solely based on some clinical anecdotes. I don’t really buy it, which makes me extra critical about using it to estimate these things.

3. One interesting thought that comes up here is the question of, if there’s a continuous liability of eroticizing being female, is it really only the positive part that affects things? For instance, you could imagine that the negative part represents finding AGP themes to be a direct “turnoff”. But the estimation method I came up with ends up assuming that there is no effect in the negative part of the spectrum, and attributing any effect there is found to confounding. From a theory point of view, if there is such a thing as “negative AGP”, then that would obviously disprove Blanchardianism.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s