Diversity and Overlap estimates in immune repertoires: the good, the bad and the ugly
Publication date
Authors
DOI
Document Type
Master Thesis
Metadata
Show full item recordCollections
License
CC-BY-NC-ND
Abstract
This literature review examines how diversity and overlap in T cell receptor (TCR) repertoires are measured and why these estimates are often difficult to compare across studies. TCR repertoires represent the collection of unique receptors on T cells, and their diversity and overlap are widely used to study immune responses in contexts such as infection, cancer, and autoimmune disease.
The review identifies three main challenges. First, methodological variation strongly influences results. Different sequencing approaches (bulk vs. single-cell) and library preparation methods introduce biases in which clonotypes are detected and how abundances are measured. In addition, the definition of a “clonotype” varies between studies, ranging from simple sequence matching to more complex criteria including gene segments or paired chains. These differences directly affect calculated diversity and overlap, making cross-study comparisons difficult.
Second, incomplete sampling is a fundamental issue. TCR repertoires are highly skewed, with a few abundant clonotypes and many rare ones. Because sequencing captures only a fraction of the repertoire, rare clonotypes are often missed. This leads to underestimation of diversity and overlap. The review highlights ecological methods, such as Hill numbers, rarefaction, extrapolation, and coverage-based standardization, as tools to correct for incomplete sampling and enable fairer comparisons between samples with different sequencing depths.
Third, choice of statistical measures affects interpretation. Diversity indices differ in how they weight rare versus common clonotypes, while overlap metrics differ in whether they consider only shared presence or also abundance. As a result, different indices can lead to different conclusions from the same data. Coverage-adjusted estimators can reduce bias caused by under-sampling but are not yet widely used.
A key complication arises in workflows using unique molecular identifiers (UMIs). UMIs count molecules rather than cells, meaning that diversity estimates reflect molecule counts per clonotype rather than true cell-level diversity. This can cause larger samples to appear more diverse even after normalization, complicating standard approaches like rarefaction.
The review concludes that many reported differences in TCR diversity and overlap may reflect technical and analytical choices rather than true biological variation. To improve comparability and reliability, it recommends clearly reporting clonotype definitions, quantifying sample completeness (e.g., using coverage), using diversity estimators that account for incomplete sampling, and applying overlap measures that adjust for unequal sampling depth. Together, these practices help distinguish biological signals from methodological artifacts.