Home/Blog/mbti test retest reliability

MBTI Methodology Guide

MBTI Test-Retest Reliability: What 0.5 Actually Means (And Why Your Type Might Change On Retest)

If you've taken the MBTI multiple times and gotten different four-letter type codes, you are not alone — you are in roughly half of all MBTI test-takers. This is not a glitch or a malfunction; it is a documented property of the assessment that has been measured and reviewed in the academic literature for decades. The technical name for this property is test-retest reliability, and MBTI's reliability is approximately 0.5-0.6 per dimension. By comparison, Big Five trait measures typically show 0.7-0.9 per dimension. This guide unpacks what the 0.5-0.6 number actually means in non-statistical terms, why MBTI reliability is lower than Big Five reliability, why most type "changes" are single-dimension flips at the midpoint rather than personality reorganizations, and what to do practically when your dimension scores sit near the cutoff. Primary sources: Pittenger 2005 (the canonical psychometric review), Capraro & Capraro 2002 (the meta-analysis aggregating reliability studies), Druckman & Bjork 1991 (the National Research Council's review), and McCrae & Costa 1989 (the Big Five comparison).

Short answer

Test-retest reliability of approximately 0.5-0.6 means that across the four MBTI dichotomies, the per-dimension scores reproduce themselves at about the level you'd expect if half the score is signal and half is noise. The categorical type code (a binary cutoff on each dimension) flips on retest in approximately 50% of cases within five weeks. Most flips are single-dimension shifts at the midpoint — not personality reorganization. The practical implication: dimension scores near 50% indicate weak preferences in those areas, while dimension scores far from 50% indicate stable preferences. Read your type's strong dimensions as reliable signal and weak dimensions as approximate.

Last reviewed: 2026-04-28

Key takeaways

Six things to know before reading further:

  • MBTI per-dimension test-retest reliability is approximately 0.5-0.6 (Pittenger 2005, DOI 10.1037/1065-9293.57.3.210; Capraro & Capraro 2002 meta-analysis, DOI 10.1177/00131640221102234). Big Five per-dimension reliability is typically 0.7-0.9.
  • Approximately 50% of MBTI test-takers receive a different four-letter type code on retest within five weeks. This is the most-quoted MBTI psychometric statistic and a direct consequence of categorical scoring at modest reliability.
  • Most type "changes" are single-dimension flips at the cutoff — typically the J/P dimension if your score there is near 50%. Multi-dimension flips are much rarer and often indicate test-condition differences (mood, fatigue, framing) rather than personality change.
  • Reliability of 0.5-0.6 per dimension is not zero — it carries real signal. The signal is just modest enough that categorical type assignment near the cutoff is unstable. Strong dimension scores (far from 50%) reproduce reliably; weak dimension scores (near 50%) flip.
  • Druckman & Bjork's 1991 National Research Council review ("In the Mind's Eye") was the most prominent independent assessment of MBTI's psychometric properties at the time. The review found reliability and validity evidence below the level required for selection contexts (consistent with Pittenger's later 2005 review).
  • Practical move: when reading your MBTI result, look at the dimension-level percentages. The dimensions far from 50% are the ones reliably capturing your preferences. The dimensions near 50% are weak preferences that flip on retest — that is itself useful self-knowledge, not a measurement error.

What test-retest reliability actually measures

Test-retest reliability is a psychometric statistic that captures how consistent a measurement instrument is when applied to the same people across multiple sessions. Mathematically, it is the correlation between scores from session 1 and scores from session 2 (typically with a 1-5 week interval between sessions). The correlation is bounded between 0.0 (scores are random across sessions) and 1.0 (scores reproduce perfectly).

The interpretation: if you take a personality test twice within a few weeks, and the test has reliability 0.9, your two scores will be very close to each other. If the test has reliability 0.5, your two scores will share about half their variance — meaning they will correlate but with substantial drift. The drift is what produces the famous "50% of people get a different MBTI type on retest" finding: when categorical cutoffs are applied to dimension scores with reliability 0.5-0.6, the binary letter assignments flip easily for anyone whose dimension score is near the midpoint.

Reliability is independent of validity. A test can be highly reliable (reproduces itself) without being valid (measuring what it claims to measure); a test can be valid (measuring something real) without being highly reliable (the measurement is noisy). The two properties are separately important. MBTI's reliability problem is specifically about consistency across sessions — what your type code is depends partly on which session you happened to take the test in. The validity question (whether MBTI is actually measuring what it claims) is a separate axis.

Per psychometric convention, reliability above 0.7 is considered acceptable for research purposes; above 0.8 for clinical or selection contexts. MBTI's 0.5-0.6 falls below both thresholds. Big Five trait measures typically clear the 0.7 bar and often clear the 0.8 bar. The reliability gap between the two frameworks is the technical foundation of why one is preferred for measurement contexts.

What 0.5-0.6 reliability means for individual users in plain language

If you took the MBTI yesterday and got INFP, and you take it again next week, what should you expect? At reliability 0.5-0.6, here is the rough breakdown:

**Most likely outcome (about 50% probability)**: you get INFP again. Your dimension scores will not be identical between sessions, but they will land on the same side of the midpoint cutoff for all four dichotomies. This is the modal case.

**Second most likely outcome (about 35% probability)**: you get a single-letter different type, typically with the flipped letter being on the dimension where you originally scored closest to 50%. If your J/P score on session 1 was 52% J / 48% P, the J/P letter is the most likely to flip on retest.

**Third most likely outcome (about 12% probability)**: you get a two-letter different type. This is rarer and more often indicates that the testing conditions varied (one session in a focused mood, another in a tired or distracted mood) than that your underlying personality changed.

**Rarest outcome (about 3% probability)**: you get a three- or four-letter different type. When this happens, it usually means either the test was rushed in one of the sessions, or the question framing was interpreted differently across sessions, or there was a substantial life-circumstance change between sessions that shifted self-perception.

These probability estimates are approximate (derived from Pittenger 2005's review of multiple studies; the exact numbers vary across study samples). The pattern is the structurally important point: most type changes are single-dimension flips at the cutoff, not personality reorganization. If your type changed from INFP to INFJ, you have not become a different person; you've answered slightly differently on items that map to the J/P dimension.

Why most flips happen on a specific dimension (J/P, then E/I)

Across the four MBTI dichotomies, the reliability is not uniform. Capraro and Capraro's 2002 meta-analysis (DOI 10.1177/00131640221102234) aggregated reliability studies and found per-dimension test-retest reliability roughly as follows: E/I ≈ 0.66, S/N ≈ 0.65, T/F ≈ 0.59, J/P ≈ 0.61. The J/P dimension is the lowest reliability, with T/F close behind, while E/I and S/N are slightly more stable.

This is consistent with empirical reports of which letter changes most often: in retest studies, J/P is the most-frequently-flipped letter, followed by T/F. E/I and S/N change less often. The reason is partly methodological — the J/P items measure preferences that are more context-dependent (e.g., "Do you prefer planning or going with the flow?" can be answered differently depending on what's currently going on in your life). The reason is partly conceptual — J/P combines structure-preference with closure-seeking, two correlated but distinct constructs that the dichotomy treats as one.

The practical implication: if you took MBTI multiple times and your type changed, check which letter flipped. If it's J/P, you are in the most common pattern, and the appropriate read is that you have a near-midpoint J/P preference rather than that your personality changed. If it's S/N or E/I, the change is rarer and worth more reflection — those letters tend to be more stable, so a flip in them is more likely to indicate a meaningful self-perception shift or a testing-condition difference.

None of this is a critique of MBTI's design — it is descriptive of how the framework actually behaves under measurement. The framework's creators (Isabel Myers and Katharine Briggs) did not have the modern psychometric tools to optimize for reliability when they built the assessment in the 1940s, and the assessment has retained its original four-dichotomy structure for backward-compatibility reasons across decades of practice. Modern Big Five assessments were designed with reliability as a primary engineering target and consequently achieve higher reliability scores.

Why MBTI reliability is lower than Big Five reliability — the architectural reason

MBTI's modest reliability is not primarily a function of how the items are written or how the test is administered — it is primarily a function of the architectural choice to convert continuous dimensional scores into categorical type codes via binary cutoffs. The categorical conversion is the source of most of the reliability problem.

Consider what happens in scoring. A test-taker answers 93 items (the Form M MBTI). The items map onto four dimensions, producing four continuous scores: E/I score, S/N score, T/F score, J/P score. Each continuous score is then converted to a binary letter (E or I, S or N, T or F, J or P) by comparing it against a cutoff (usually 50%). Two test-takers with continuous scores of 51% J and 49% J get assigned to different MBTI types (one J, one P) despite being functionally identical on that dimension. The categorical conversion throws away information.

Big Five test scoring keeps the continuous scores. A Big Five report says "73rd percentile Conscientiousness, 41st percentile Openness, ..." rather than "Conscientious type, Open type, ...." When the scores are continuous, retest drift produces small numerical changes ("73rd percentile session 1, 71st percentile session 2") that don't trigger categorical re-classification. When the scores are categorical, the same retest drift can produce a different letter ("J session 1, P session 2") that looks like a major change.

This is the architectural reason MBTI's apparent reliability looks worse than its underlying dimensional reliability. The dimensional reliability (the consistency of the continuous score before binary cutoff) is closer to 0.6-0.7 — still below Big Five's 0.7-0.9 but not as bad as the categorical reliability suggests. The categorical conversion artificially inflates the appearance of instability for any test-taker whose dimension score is near the midpoint.

Pittenger's 2005 review (DOI 10.1037/1065-9293.57.3.210) discusses this explicitly: most of MBTI's famous "50% retest type change" finding is single-dimension flips at the cutoff, not deep personality reorganization. The categorical framing magnifies what would be a small numerical drift into what appears to be a major identity shift. This is the strongest single technical reason researchers prefer Big Five for measurement contexts.

What Druckman & Bjork's 1991 NRC review found

In 1991, the U.S. National Research Council convened a panel to review the scientific evidence behind various human-performance enhancement techniques being marketed to the U.S. military and federal agencies. The resulting volume, edited by Daniel Druckman and Robert Bjork ("In the Mind's Eye: Enhancing Human Performance," National Academy Press, 1991), included a chapter on personality assessment that directly addressed MBTI.

The NRC panel's conclusion: "At this time, there is not sufficient, well-designed research to justify the use of the MBTI in career counseling programs." The panel reviewed the available reliability and validity evidence and found that the assessment did not meet the standards required for selection or career-direction use cases. The reliability evidence reviewed was consistent with the 0.5-0.6 range later confirmed by Capraro & Capraro 2002 and Pittenger 2005.

The NRC review is historically significant because it represented an independent, government-commissioned evaluation by a panel of senior psychometricians. The panel had no commercial relationship with the MBTI publisher and no incentive to either support or undermine the framework. Its conclusion aligned with the Pittenger 2005 academic-research consensus and with the Myers-Briggs Foundation's own "not for selection" Ethical Use Guidelines.

The NRC review is sometimes cited in popular discussions of MBTI as the canonical "NRC concluded MBTI is invalid" finding. That gloss is too strong — the NRC panel did not conclude MBTI is invalid in all uses; it concluded the evidence was insufficient to justify selection or career-counseling use. This is a narrower technical claim than the popular gloss and is consistent with the publisher's own narrowing of legitimate use to development contexts.

What to do practically: read dimension scores not type letters

If you have your MBTI dimension scores (the percentages or numerical values for each of the four dichotomies), here is the practical reading framework that takes the reliability findings seriously.

  • **Strong dimension (score > 70% on one side)**: this dimension is reliably capturing a real preference of yours. The corresponding letter (I, N, F, or J at >70%) is stable across retests and carries genuine signal about your behavior.
  • **Moderate dimension (score 60-70% on one side)**: this is a real preference but not strongly. The letter is more often than not stable on retest, but flips occasionally when testing conditions or framing shift.
  • **Weak dimension (score 50-60% on one side)**: this is a borderline preference. The letter flips on retest about half the time. Don't read either type description (yours or the opposite) as fully descriptive of your behavior on this axis — you genuinely have mixed preferences here.
  • **Near-midpoint dimension (score 50-55% on one side)**: this is essentially a coin-flip outcome. The MBTI cutoff happened to land on one side; on retest it often lands on the other. Read both type descriptions for the relevant axis and notice that you behave more like one in some contexts and like the other in different contexts — that is the actual signal, and the binary letter is misrepresenting it.
  • **The rule**: trust the dimension where your score is far from 50%. Treat the dimension near 50% as approximate. Your "true type" code is the combination of strong preferences plus best-guess on weak preferences. The strong preferences are the reliable signal.

If your type changed on retest — how to interpret it

A common practical question: I tested as INFP last year and as INFJ this year — which is my real type? The reliability framework gives a structured answer.

**Step 1: Identify which letter flipped.** Compare the two type codes. If only one letter is different, you flipped on that single dimension. If two or more letters differ, the testing-condition or framing differences between sessions are likely playing a larger role and the answer is harder to read.

**Step 2: Look at the dimension score for the flipped letter (if you have it).** If you have access to the dimension percentages from at least one of the test sessions, check whether your score on the flipped dimension was near 50%. If yes, the flip is a near-midpoint flip and means you have a weak preference on that axis — neither type is fully wrong, you're just close to the cutoff.

**Step 3: Don't try to pick "the real type" between the two.** If the flipped letter is on a near-midpoint dimension, neither type code is more correct than the other. Instead, read both type descriptions, identify the dimensions where they agree (those are your reliable letters), and treat the dimensions where they disagree as your near-midpoint axes.

**Step 4: Behavioral test for the flipped axis.** If the flipped letter is J/P, ask yourself behaviorally: do you actually plan ahead and prefer closure, or do you actually keep options open and decide near the deadline? If your real-life behavior pattern is consistent across contexts, that is more reliable signal than either of the two test sessions. If your behavior is genuinely mixed (J in some contexts, P in others), then your preference is mixed and the type code is hiding that nuance.

If you've worked through these four steps and still feel split between two types, the honest read is that MBTI's categorical framework does not carve nature at the joints for you on the contested axis. The Big Five continuous-dimension framework (which doesn't force a binary cutoff) is more honest about borderline cases. For more on this trade-off, see /blog/mbti-vs-big-five.

What this all means for using MBTI

Three honest takeaways from the reliability evidence.

**Takeaway 1: MBTI carries real signal at the dimension-score level.** Reliability of 0.5-0.6 is modest but not zero. Your specific I/E, S/N, T/F, J/P scores are produced by a deterministic algorithm from your specific answers, and they carry information about your underlying preferences. The signal is just not strong enough to make near-midpoint dimensions stable across retests.

**Takeaway 2: The categorical type code is less reliable than the dimensional scores underneath.** When you read "I'm INFP" or "I'm ENTJ," treat that as compressed information about a four-dimensional measurement. The compression is convenient (the letters travel in conversation) but loses the percentile information that distinguishes a strong INFP (90% I, 80% N, 75% F, 70% P) from a borderline INFP (52% I, 53% N, 51% F, 51% P). The borderline cases are where retest flips happen.

**Takeaway 3: For measurement contexts (research, hiring, longitudinal self-tracking), Big Five is the better instrument because of higher reliability.** For vocabulary contexts (team-building, self-reflection, conversational shorthand), MBTI is fine because the categorical labels work as memorable anchors. The two frameworks serve different purposes and are not in zero-sum competition.

For the long-form treatment of MBTI's measurement properties beyond reliability, see /blog/mbti-common-misconceptions-and-data. For the framework comparison with Big Five, see /blog/mbti-vs-big-five. For the specific case of why MBTI is not a fit for hiring, see /blog/mbti-for-hiring.

Caveats — what 0.5-0.6 reliability does and doesn't establish

Three caveats to keep the reliability findings calibrated.

**Caveat 1: Reliability of 0.5-0.6 is modest, not catastrophic.** The number sometimes gets quoted with the implication that MBTI is essentially random — "50% of people get a different type on retest, so it's a coin flip." That overshoots. Most retest changes are single-dimension flips at the cutoff; multi-dimension flips are much rarer. The framework is producing real measurement, just with substantial noise on the categorical letters.

**Caveat 2: Reliability and validity are independent.** A test can be reliable (reproduces itself) without being valid (measuring what it claims). A test can be valid (measuring something real) without being highly reliable (the measurement is noisy). The reliability evidence reviewed here does not, by itself, settle the validity question. The validity question is covered separately in Pittenger 2005 and in /blog/mbti-common-misconceptions-and-data.

**Caveat 3: The reliability findings do not invalidate all uses of MBTI.** The Myers-Briggs Foundation explicitly endorses development, team-building, and individual-coaching uses where the categorical labels work as vocabulary. Those uses do not require high test-retest reliability the way selection contexts do. The reliability evidence is most consequential for uses where consistency matters (hiring, research, performance evaluation), and least consequential for uses where the labels function as conversational anchors.

Free · No email required

Find out your MBTI type now

20 questions. Instant result. No account needed.

Take the Free Test →

Related

More blog articles

See all blog articles

FAQ

Common follow-up questions

Review the methodology

What does MBTI test-retest reliability of 0.5 actually mean?

Test-retest reliability is the correlation between scores from two test sessions taken by the same people a few weeks apart. Reliability of 0.5 means the two sessions share about half their variance — there is real signal, but with substantial noise. The practical consequence: when categorical cutoffs are applied to dimension scores with reliability 0.5-0.6, approximately 50% of test-takers receive a different four-letter type code on retest. Most of those changes are single-dimension flips at the cutoff (typically J/P), not personality reorganization.

Is reliability of 0.5 considered acceptable in psychometrics?

Below standard thresholds. Per psychometric convention, reliability above 0.7 is considered acceptable for research purposes; above 0.8 for clinical or selection contexts. MBTI's 0.5-0.6 per dimension falls below both thresholds. By comparison, Big Five trait measures typically reach 0.7-0.9 per dimension. This reliability gap is the technical foundation of why researchers prefer Big Five for measurement contexts where consistency matters.

Why does my MBTI type change every time I take the test?

Most likely you have one or two dimension scores near the 50% cutoff, and the categorical letter assignment flips between sessions on those dimensions. The most common flip is on J/P (per Capraro & Capraro 2002, J/P has the lowest per-dimension reliability of the four MBTI dichotomies). The flip is not a sign that your personality is unstable — it is a sign that your preference on the flipping axis is weak (close to 50/50). Strong preferences (dimension scores far from 50%) reproduce reliably across retests.

Which MBTI dimension flips most often on retest?

J/P is the most-flipped dimension across retest studies, followed by T/F. E/I and S/N are slightly more stable. Per Capraro & Capraro 2002, the per-dimension reliabilities are roughly E/I 0.66, S/N 0.65, T/F 0.59, J/P 0.61. The J/P items measure preferences that are more context-dependent (e.g., 'Do you prefer planning or going with the flow?' can be answered differently in different life circumstances), which contributes to the lower reliability there.

Did the National Research Council really say MBTI is invalid?

The 1991 NRC report (Druckman & Bjork, 'In the Mind's Eye: Enhancing Human Performance,' National Academy Press) concluded that 'there is not sufficient, well-designed research to justify the use of the MBTI in career counseling programs.' That is a narrower technical claim than 'MBTI is invalid' — the NRC panel said the available evidence was insufficient for selection and career-counseling use cases, not that the framework was wholesale invalid. The conclusion aligns with the publisher's own 'not for selection' Ethical Use Guidelines.

How is Big Five reliability higher than MBTI reliability?

Two reasons. First, Big Five tests preserve continuous scores rather than converting to binary categories — categorical conversion at modest reliability artificially inflates apparent instability for borderline cases. Second, Big Five assessments were designed with reliability as a primary engineering target using modern psychometric tools, whereas MBTI was designed in the 1940s before those tools existed and has retained its four-dichotomy structure for backward-compatibility reasons. The combination produces 0.7-0.9 per dimension reliability for Big Five vs 0.5-0.6 for MBTI.

Should I retest until I get a stable type?

No — that approach selects for whichever session happened to land on a particular side of the cutoff for your near-midpoint dimensions. The honest answer is that if your type changes between sessions, the dimension(s) where it changes are weak preferences for you, and neither type code is fully correct. Read both type descriptions for the contested dimensions; the agreement parts are your reliable letters and the disagreement parts are areas of mixed preference. That is itself useful self-knowledge.

Does low MBTI reliability mean the framework is useless?

No — low reliability means the framework is not a strong measurement instrument, which matters for measurement use cases (research, hiring, performance evaluation) but doesn't undermine vocabulary use cases (team-building, self-reflection, conversational shorthand). MBTI is most useful as a vocabulary for working-style differences, supplemented by dimension-level scores when measurement claims need to be made, with awareness that the categorical type code can flip on retest near the cutoff. For the framework comparison with Big Five for measurement vs vocabulary use cases, see /blog/mbti-vs-big-five.

All 16 types

Find your type and read the full profile

Browse all types