A total of 9,715 samples from 31 human metagenomic datasets (total: 5.17 × 1011 reads, average: 5.32 × 107 reads per sample) with available metadata to enable assessment of microbiome transmission between healthy mothers and offspring, households, twin pairs, villages and populations (that is, cohabitation information) were selected for inclusion in this study (Supplementary Tables 1 and 2). We also included publicly available stool shotgun metagenomic datasets with samples from at least 15 healthy individuals to whom no intervention (such as antibiotic or drug treatment, or specific diet) was performed, with at least 2 of the samples taken less than 6 months apart to assess within-subject strain retention and set species-specific operational definitions of strain identity 25 datasets were publicly available, three of which were expanded in this study with 14 (FerrettiP_20189), 32 (Ghana dataset34) and 61 (Tanzania dataset34) samples. Newly included samples were collected and processed following the protocols described in the original publications. In addition, eight datasets (total: 2,800 samples) were newly collected and sequenced in the context of this study as described below, using similar methods (although differences in sample processing, DNA extraction and sequencing library preparation do not directly affect the phylogenetic distances that we use to infer strain sharing).
Consistent metadata collection and organization
We retrieved the metadata on sample and subject identifiers, time points, participant’s age, gender, mode of delivery (vaginal or caesarian section), family identifiers, family relationships, twin zygosity and age at which twins moved apart, village, and country from curatedMetagenomicData 3.0.0 (ref. 61) when included in the resource, and from the publications’ supplementary materials or specified repository otherwise. Metadata of all metagenomes, including newly sequenced samples, were curated and organized in the curatedMetagenomicData format and are available in Supplementary Table 2. Partners were defined as couples that share a household. Populations were classified on the basis of their westernization status (westernized or non-westernized), considered as the adoption of a westernized lifestyle and not in geographical terms, and defined as intake of diets typically rich in highly processed foods (with high fat content, low in complex carbohydrates and rich in refined sugars and salt), access to healthcare and pharmaceutical products, hygiene and sanitation conditions, reduced exposure to livestock, and increased population density. The classification was based on the information available on how populations included in the study differ on the above criteria and how the samples were reported in the original publications. While we acknowledge that this binary classification has evident limitations62, it enables insight into the association of person-to-person microbiome transmission with host lifestyle.
Newly sequenced metagenomic datasets
A total of 14 mothers (16–37 years old) and 13 of their infants below 1 year of age in rural areas in Argentina (villages of Villa Minetti, Esteban Rams, Pozo Borrado, Las Arenas, Cuatro Bocas, Logroño, Montefiore and Belgrano; Santa Fe province; Supplementary Table 2)—considered here as a non-westernized population—were enroled in the study. DNA was extracted from faecal samples using the QIAamp DNA stool kit (Qiagen) following the manufacturer’s instructions. Sequencing libraries were prepared using the Nextera DNA Flex Library Preparation Kit (Illumina), following the manufacturer’s guidelines. Sequencing was performed on the Illumina NovaSeq 6000 platform following manufacturer’s protocols.
A total of 12 mothers (15–40 years old) and 12 of their infants below 6 months of age from communities of the Wayúu ethnic group from the Caribbean Region in Colombia (communities of Etkishimana, Koustshachon, Paraiso, Invasión, Tocomana, Warruptamana and Wayawikat; Supplementary Table 2)—considered here as a non-westernized population—were enroled in the study. DNA from stool samples was extracted using the Master-Pure DNA extraction Kit (Epicentre) following the manufacturer’s instructions with the following modifications: samples were treated with lysozyme (20 mg ml−1) and mutanolysin (5 U ml−1) for 60 min at 37 °C and a preliminary step of cell disruption with 3-μm diameter glass beads during 1 min at 6 m s−1 by a bead beater FastPrep 24-5G Homogenizer (MP Biomedicals). Purification of the DNA was performed using DNA Purification Kit (Macherey–Nagel) according to manufacturer’s instructions. DNA concentration was measured using Qubit 2.0 Fluorometer (Life Technologies) for further analysis. Sequencing libraries were prepared using the Nextera DNA Flex Library Preparation Kit (Illumina), following the manufacturer’s guidelines. Sequencing was performed on the Illumina NovaSeq 6000 platform following manufacturer’s protocols.
A total of 116 nonagenarians and centenarians (97 female, 19 male, 94–105 years old) and 231 of their offspring (79 female, 152 male, 50–85 years old) in the city of Qidong (Jiangsu province, China) were enroled (considered here as a westernized population)63. All participants were free of major illnesses at the time of inclusion. Fresh stool samples were collected at the Shanghai Tenth Hospital, and stored at −20 °C upon collection. DNA was extracted using the EZNA Stool DNA Kit (Omega Bio-tek) following manufacturer’s instructions. DNA integrity and size were evaluated by 1% agarose gel electrophoresis, and DNA concentrations determined with NanoDrop (Thermo Fisher Scientific). DNA libraries were constructed according to the TruSeq DNA Sample Prep v2 Guide (Illumina), with 2 μg of genomic DNA and an average insert size of 500 bp. Library quality was evaluated with a DNA LabChip 1000 Kit (Agilent Technologies). Sequencing was conducted on an Illumina HiSeq 4000 platform with a 150 bp paired-end read length.
A total of 8 mothers and 19 infants below 1 year of age in a rural population in China (Bin county, Shaanxi province, northwest China) were enroled as part of a larger study (ClinicalTrials.gov NCT02537392); they were considered here as a non-westernized population. DNA was extracted with the QIAamp Fast DNA Stool Mini Kit (Qiagen), and precipitated with ethanol. Sequencing libraries were prepared using the Nextera DNA Flex Library Preparation Kit (Illumina), following the manufacturer’s guidelines. Sequencing was performed on the Illumina NovaSeq 6000 platform following manufacturer’s protocols.
Samples from 342 volunteers (0–85 years old) in 74 households in the island of Bubaque (Bijagos Archipelago, Guinea-Bissau)—considered here as a non-westernized population—were collected and DNA extracted as part of a previous study64. In brief, samples were frozen at −20 °C at a reference laboratory. After homogenization and washing, DNA was extracted using the DNeasy PowerSoil PRO kit (Qiagen) with custom modifications64. Sequencing libraries were prepared using the Nextera DNA Flex Library Preparation Kit (Illumina), following the manufacturer’s guidelines. Sequencing was performed on the Illumina NovaSeq 6000 platform following manufacturer’s protocols.
A total of 4 mothers (37–46 years old) and their 8 children (0–2 years old) were enroled at the Santa Chiara Hospital in Trento, Italy; they were considered here as a westernized population. Mother stool samples were collected during or shortly after the delivery by the hospital staff, using faecal material collection tubes (Sarstedt). Infant stool samples were collected by the mothers, frozen at −20 °C upon collection and moved to a −80 °C facility within a week. 48 samples were collected in total (Supplementary Table 2). DNA was extracted using the PowerSoil DNA Isolation Kit (MoBio Laboratories), as described in the HMP protocol (Human Microbiome Project Consortium)65, with addition of a preliminary heating step (65 °C for 10 min, 95 °C for 10 min). DNA was recovered in 10 mM Tris pH 7.4 and quantified using the Qubit 2.0 (Thermo Fisher Scientific) fluorometer per the manufacturer’s instructions. Sequencing libraries were prepared using the NexteraXT DNA Library Preparation Kit (Illumina), following the manufacturer’s guidelines. Sequencing was performed on the Illumina HiSeq 2500 platform.
A total of 19 mothers (30–47 years old) and 37 healthy children (0–11 years old) were enroled at the IRCCS Istituto Giannina Gaslini in Genoa, Italy as part of a larger study, considered here as a westernized population. Stool samples were collected in DNA/RNA shield faecal collection tubes (Zymoresearch) and stored at −80 °C until DNA extraction. DNA extraction was performed with the DNeasy PowerSoil Pro Kit (Qiagen) according to the manufacturer’s procedures. DNA concentration was measured using the NanoDrop spectrophotometer (Thermo Fisher scientific) and stored at −20 °C. Sequencing libraries were prepared using the NexteraXT DNA Library Preparation Kit (Illumina), following the manufacturer’s guidelines. Sequencing was performed on the Illumina NovaSeq 6000 platform following manufacturer’s protocols.
A total of 1,929 saliva samples from 646 families in the NY Genome Center Cohort of the SPARK collection (Western IRB (https://www.wcgirb.com/), protocol tracking number: WIRB20151664, considered here as a westernized population) were included in the analysis, consisting of 640 mother samples (22–55 years old), 631 father samples (23–67 years old), and 658 samples from normally developing offspring (0–18 years old). Saliva was collected using the OGD-500 kit (DNA Genotek), and DNA was extracted using a Chemomagic MSM1/360 DNA extraction instrument and eluted into 110ul of TE buffer at PreventionGenetics (Marshfield). Sequencing libraries were prepared with the Illumina DNA PCR-Free Library Prep kit (Illumina), following the manufacturer’s guidelines. Sequencing was performed on the Illumina NovaSeq 6000 platform using S2/S4 flow cells and following manufacturer protocols.
Metagenome pre-processing and quality control
Newly sequenced stool samples were pre-processed using the pipeline described at https://github.com/SegataLab/preprocessing. Shortly, metagenomic reads were quality-controlled and reads of low quality (quality score <Q20), fragmented short reads (<75 bp), and reads with >2 ambiguous nucleotides were removed with Trim Galore (v0.6.6). Contaminant and host DNA was identified with Bowtie2 (v184.108.40.206)66 using the -sensitive-local parameter, allowing confident removal of the phiX 174 Illumina spike-in and human-associated reads (hg19 human genome release). Remaining high-quality reads were sorted and split to create standard forward, reverse and unpaired reads output files for each metagenome.
Newly sequenced saliva samples were pre-processed using a custom version of the pipeline described in https://github.com/SegataLab/preprocessing. Shortly, metagenomic reads were quality-controlled, removing reads of low quality (quality score <Q20), fragmented short reads (<75 bp), and reads with >2 ambiguous nucleotides. Contaminant and host DNA was identified with Bowtie2 (v220.127.116.11)66 in ‘end-to-end’ global mode, allowing confident removal of human-associated reads (hg19). Remaining high-quality reads were sorted and split to create standard forward, reverse and unpaired reads output files for each metagenome.
Read statistics of stool and saliva samples (number of reads, number of bases, minimum and median read length per sample) are detailed in Supplementary Table 2. Metagenomes with ≥3 million reads were included in the analysis (n = 7,646 stool, n = 2,069 oral), while metagenomes with insufficient sequencing depth were excluded (n = 97 stool, n = 0 oral).
Expanded SGB database
A custom database containing 160,267 MAGs and 75,446 isolate sequencing genomes was retrieved from ref. 30, and expanded with 184 MAGs from the Italian mother–infant dataset9 expanded in the current study, 1,439 MAGs from Italian centenarians67, 3,584 MAGs obtained from stool samples of individuals in non-westernized populations34, 2,985 MAGs from stool samples of non-human primates68, 20,404 MAGs from cow rumen69, 14,097 MAGs from mouse samples70,71,72,73,74,75,76,77,78,79,80,81,82,83, 1,235 MAGs from termites (PRJNA365052, PRJNA365053, PRJNA365054, PRJNA365049, PRJNA365050, PRJNA365051, PRJNA405700, PRJNA405701, PRJNA405702, PRJNA405782, PRJNA405783, PRJNA366373, PRJNA366374, PRJNA366375, PRJNA366251, PRJNA405703, PRJNA366252, PRJNA366766, PRJNA366357, PRJNA366358, PRJNA366361, PRJNA366362, PRJNA366363, PRJNA366255, PRJNA366256, PRJNA366257, PRJNA366253, PRJNA405704, PRJNA366254 and PRJNA405781), 7,760 MAGs available from a previous catalogue84, 2,137 MAGs from NCBI GenBank, and 63,142 reference genomes from NCBI GenBank (see https://github.com/SegataLab/MetaRefSGB for details). MAGs from the Italian mother–infant dataset, and those of non-human hosts were assembled using MEGAHIT85, while those of the Italian centenarian dataset and non-westernized populations were assembled with metaSPAdes86, using default parameters in both cases.
For the newly added MAGs we employed the following protocol on the metagenomic assemblies. Assembled contigs longer than 1,500 nucleotides were binned into MAGs using MetaBAT287. Quality control of all genomes was performed with CheckM version 1.1.3 (ref. 88), and only medium- and high-quality genomes (completeness ≥50% and contamination ≤5%) were included in the database. Prokka version 1.12 and 1.13 (ref. 89) were used to annotate open reading frames of the genomes. Coding sequences were then assigned to a UniRef90 cluster90 by performing a Diamond search (version 0.9.24)91 of the coding sequences against the UniRef90 database (version 201906) and assigning a UniRef90 ID if the mean sequence identity to the centroid sequence was above 90% and covered more than 80% of the centroid sequence. Protein sequences that could not be assigned to any UniRef90 cluster were de novo clustered using MMseqs292 within SGBs following the Uniclust90 criteria93.
Genomes were clustered into species-level genome bins (SGBs) spanning ≤5% genetic diversity, and those to genus-level genome bins (GGBs, 15% distance) and family-level genome bins (FGBs, 30% distance), as described in ref. 30. MAGs were assigned to SGBs by applying ‘phylophlan_metagenomic’, a subroutine of PhyloPhlAn 3 (ref. 94), which uses Mash95 to compute the whole-genome average nucleotide identity among genomes. When no SGB was below 5% genetic distance to a genome, new SGBs were defined, based on the average linkage assignment and hierarchical clustering (allowing a 5% genetic distance among genomes in the dendrogram). The same procedure was followed to assign SGBs to novel GGBs and FGBs when those were not yet defined.
Taxonomic assignment of SGBs and definition of kSGBs and uSGBs
SGBs containing at least one reference genome (kSGBs) were assigned the taxonomy of the reference genomes following a majority rule, up to the species level. SGBs with no reference genomes (uSGBs) were assigned the taxonomy of its corresponding GGB (up to the genus level) if this contained reference genomes, and of its corresponding FGB (up to the family level) if the latter contained reference genomes. If no reference genomes were present in the FGB, a phylum was assigned based on the majority rule applied on up to 100 closest reference genomes to the MAGs in the SGB as provided by ‘phylophlan_metagenomic’. Taxonomic assignment of SGBs profiled at strain level in this study can be found in Supplementary Tables 3 and 4.
Species-level profiling of metagenomic samples
Species-level profiling was performed on all the 9,715 samples with MetaPhlAn 4 (refs. 38,39) with default parameters and the custom SGB database. uSGBs with less than 5 MAGs were discarded as potential assembly artefacts or chimeric sequences and unlikely to reach the prevalence thresholds in the profiling. SGB core genes were defined as open reading frames in an existing UniRef90 or in a de novo clustered gene family (following the Uniclust90 clustering procedure93) present in at least half of the genomes (that is, ‘coreness’ 50%) of the SGB. Core genes were further optimized by selecting the highest coreness threshold that allowed retrieval of at least 800 core genes. Core genes of each SGBs were then screened to identify marker genes by checking their presence in other SGBs. This was done by a procedure that first divided core genes into fragments of 150 nt and then aligned the fragments against the genomes of all SGBs using Bowtie2 (version 18.104.22.168; -sensitive option)66. Marker genes were defined as core genes with no fragments found in at least 99% of the genomes of any other SGB. For SGBs with less than 10 marker genes, conflicts were defined as occurrences of more than 200 core genes of an SGB in more than 1% of genomes of another SGB, and conflict graphs were generated by retrieving all conflicts for that SGB. Each conflict graph was processed iteratively, retrieving all the possible merging scenarios, in order to get the optimal merges for the conflict that both minimize the number of merged SGBs and maximize the number of markers retrieved. Finally, for each SGB, a maximum of 200 marker genes were selected based first on their uniqueness and then on their size (bigger first), and SGBs still with less than 10 markers were discarded. Merged gut and oral SGBs (SGB_group) can be found in Supplementary Tables 3 and 4, respectively. The resulting 3.3M marker genes (189 ± 34marker genes per SGB(mean ± s.d.)) were used as a new reference database for MetaPhlAn and StrainPhlAn profiling.
Strain-level profiling of metagenomic samples
Strain profiling was performed with StrainPhlAn438,39 using the custom SGB marker database, with parameters “marker_in_n_samples 1 -sample_with_n_markers 10 –phylophlan_mode accurate -mutation_rates”. To reduce noise, only SGBs detected in ≥20 samples and at least 10% of samples in a dataset with ≥10 markers (-print_clades_only argument in StrainPhlAn) were selected for strain-level profiling (n = 646 and n = 252 SGBs in stool and oral samples respectively). The total of 200 marker genes was available for the majority of SGBs (n = 481/646 gut SGBs and n = 148/252 oral SGBs). The average coverage across SGBs was 1.3×. For the SGBs potentially derived from fermented foods, sequences of MAGs assembled in ref. 40 were added using parameter “-r”. Compared to an assembly based approach (high-quality MAGs defined as >90% completeness and <5% contamination; assembly method reported in the section “Expanded SGB database” above), strain-level profiling with StrainPhlAn allowed strain-sharing assessment among species in many more samples (median of 355 strain-level profiles per SGB and interquartile range (IQR) = [185, 806] versus median of 69 high-quality MAGs per SGB and IQR = [7, 60]).
Detection of strain-sharing events
To detect strain-sharing events, we first set SGB-specific normalized phylogenetic distance (nGD) thresholds that optimally separated same-individual longitudinal strain retention (same strain) from unrelated-individual (different strain) nGD distributions in five published stool metagenomic datasets from four different countries (Germany, Kazakhstan, Spain and United States) on three continents20,22,27,28,31. nGDs were calculated as leaf-to-leaf branch lengths normalized by total tree branch length in phylogenetic trees produced by StrainPhlAn, which are built on marker gene alignments on positions with at least 1% variability. For SGBs detected in at least 50 pairs of same-individual stool samples obtained no more than 6 months apart (n = 145 SGBs; the two samples for a certain individual in which the species could be profiled at the strain level and that were closest in time were selected), nGD thresholds were defined based on maximizing Youden’s index, and limiting at 5% the fraction of unrelated individuals to share the same strain as a bound on a false discovery rate (Extended Data Fig. 3). The assumption of frequent strain persistence in an individual for at least 6 months is supported by the distribution of phylogenetic distances in the longitudinal sets: for all species this has a peak at nGD approaching 0 (Extended Data Fig. 3), notably higher than that observed for inter-individual sample comparisons. For SGBs detected in less than 50 same-individual close pairs (n = 501) and in oral samples (n = 252), for which species-specific nGD cannot be reliably estimated, the nGD corresponding to the 3rd percentile of the unrelated individual nGD distribution was used. This value is the median percentile of the inter-individual nGD distribution corresponding to the nGD maximizing the Youden’s index of SGBs with at least 50 same-individual comparisons. The three sets of thresholds are thus three technical definitions of the same principle—that is, the individual specificity and the persistence of strains in the gut microbiome, and did not lead to significant differences in nGD values (Kruskal–Wallis test, χ2 = 2.34, P = 0.31; Extended Data Fig. 10a). nGD thresholds also did not significantly differ by phylum (Extended Data Fig. 10b), and those set in stool and oral samples were similar (median nGD difference = 0.006). If not limiting at 5% the fraction of unrelated individuals to share the same strain as a bound on a false discovery rate, the resulting percentile would only be of a median of 8.2% (range = [5.2–22.3%]) on these 38 SGBs (Supplementary Table 4). When using single metagenomic datasets instead of the five datasets we included to set the strain identity thresholds, often not enough longitudinal samples were available (<50 same-individual pairs) and some variation was observed (Extended Data Fig. 10c), which supports the use of the largest set of samples available.
Overall, the median SNV rate nGD thresholds corresponded to is 0.005, below the estimated >0.1% sequencing error rate by Illumina HiSeq and NovaSeq platforms96 (Supplementary Table 4). The nGD thresholds correspond to a SNV rate of 0 for some SGBs (n = 16 out of 646—that is, 2.5%), mostly those encompassing very low genetic variation (for example, B. animalis SGB17278). In SGB trees containing MAGs of microorganisms obtained from fermented foods, we identified and discarded any strains with high similarity (≤0.0015 SNV rate as determined by PhyloPhlAn 3 (https://github.com/biobakery/phylophlan/wiki#mutation-rates-table)—that is, the number of positions that have nucleotide differences divided by the length of the alignment) to food MAGs (Supplementary Table 6). For B. animalis (SGB17278), 62 strains profiled in 7 public mouse metagenome datasets73,75,97,98,99,100,101 were added to better assess its phylogenetic diversity. The trees produced by StrainPhlAn together with the SGB-specific nGD thresholds were used in StrainPhlAn4’s strain_transmission.py script (-threshold argument) (https://github.com/biobakery/MetaPhlAn/blob/master/metaphlan/utils/strain_transmission.py). Pairs of strains with pairwise nGD below the strain identity threshold were defined as strain-sharing events. Centred nGD is defined as the nGD divided by the median nGD in the phylogenetic tree. We opted for strain identity thresholds based on phylogenetic distances in contrast to SNV rates due to (1) the rather low coverage that we obtain for species in metagenomic samples even after passing our sequencing depth threshold (mean coverage = 7.2×, median = 0.69 and IQR = [0.14, 3.09]) that would add noise especially to SNV rate estimations; (2) the limited length of the marker gene alignment of some SGBs (mean trimmed alignment length = 74,348 nt, median = 70,879 and IQR = [42,513, 104,347]) that would make SNV rates rather unreliable; and (3) the valuable information on evolutionary models (for example, distinguishing synonymous from non-synonymous nucleotide changes) that is provided by phylogenetic trees.
We compared the new species-specific strain identity thresholds with the nGD = 0.1 threshold (that is, considering the lowest 10% phylogenetic distances to be between the same strains) used in some previous publications and StrainPhlAn versions prior to version 4 (refs. 9,32,102). We found that while the previous threshold would produce a median 44% mother–infant strain-sharing rate—in contrast to the 50% strain-sharing rate we obtain here—the novel method yields a lower strain-sharing rate between infants and unrelated mothers, which are likely to be false positives: 3.5% versus 4%. This supports the better performance of the species-specific strain identity thresholds as they detect—at the same time—more strain-sharing events between matched mothers and infants and fewer strain-sharing events between unrelated mother–infant pairs.
To assess the reproducibility of the species-specific strain identity thresholds on additional unrelated data, we used independent datasets of patients undergoing faecal microbiome transplantation (FMT). As we used the publicly available metagenomic cohorts with no intervention and longitudinal sampling20,22,27,28,31 to set the species-specific thresholds, we used for validation the completely independent FMT datasets as a distinct setting in which strain transmission can be expected. In FMT, part of the strains from a healthy donor are successfully transferred to a patient, while some strains from the donor’s original sample remain after the intervention. We included 1,371 samples from 25 different cohorts of patients undergoing FMT103,104,105,106,107,108,109,110,111,112,113,114,115,116,117,118,119,120,121,122,123 that were analysed as part of a meta-analysis124. In this evaluation, similar to what we did in the set of longitudinal samples, we assessed the separation between the distribution of the nGD distances of strains from the same SGB in the two following situations: (1) the strains are from samples of the same individual or from a FMT donor and their recipient after the FMT, and (2) the strains are from samples belonging to different FMT triads (defined by the samples from the donor, those of the patient before FMT, and those of the patients after FMT). We performed this analysis for each of the 95 SGBs of our set that were also profiled in the Ianiro et al study. We considered as true positives pairwise phylogenetic distance (nGD) values between samples in (1) that were below the species-specific strain identity threshold (defined on the independent longitudinal datasets), false positives as those from (2) that were below the threshold, true negatives as those from (2) above the threshold, and false negatives as those from (1) above the threshold. We found that StrainPhlAn4 with the species-specific strain identity thresholds defined here performed very well in distinguishing strains in the same individual or FMT triad from different strains in different FMT triads: median recall = 0.97 and IQR = [0.95,0.99], precision = 0.72 [0.67,0.82], F-score = 0.97 [0.96,0.98] (Supplementary Table 35).
Assessment of person–person strain-sharing rates and SGB transmissibility
Person-to-person strain-sharing rates were calculated as the number of strains shared between two individuals divided by the number of shared SGBs profiled by StrainPhlAn (number of shared strains/number of shared SGBs). When multiple samples were available for an individual, detection of strain or SGB sharing at any time point was considered as the strain or SGB was shared. For a robust calculation, person-to-person strain-sharing rates were only assessed when at least ten SGBs were shared between two individuals. The same calculation was used to assess same-individual strain retention between two time points in longitudinal datasets. Strain acquisition rates by the offspring (Extended Data Fig. 6a) were defined as the proportion of strains profiled in the offspring that were shared with the mother, thus putatively originating from her. For a robust calculation, strain acquisition rates by the offspring were only assessed when at least ten SGBs were shared between the mother and the offspring. As StrainPhlAn36,38,39 profiles the dominant strain for each species, the total number of strains shared between two samples ranges between 0 and the total number of shared profiled SGBs, whereas strain-sharing rates and strain acquisition rates by the offspring are bound between 0 and 1.
SGB transmissibility was defined as the number of strain-sharing events detected for an SGB divided by the total potential number of strain-sharing events based on the presence of a strain-level profile by StrainPhlAn4. When multiple samples were available for an individual, detection of strain sharing at any time point was considered as the strain was shared. For a robust calculation, SGB transmissibility was only assessed on SGBs with at least ten potential strain-sharing events in multiple datasets, and with at least three potential strain-sharing events for single dataset calculations. To assess concordance of SGB transmissibility among datasets, Spearman’s correlations (cor.test function in R (https://www.R-project.org/)) were performed between datasets with at least ten SGBs with assessed transmissibility. Highly transmitted SGBs were defined as those with SGB transmissibility >0.5 and significantly higher within-group than among-group transmissibility (Chi-squared tests, Padj < 0.05). We found no significant association between SGB transmissibility and the length of the trimmed alignment (Spearman’s test, ρ = 0.06, P = 0.13).
We assessed strain sharing across three main transmission modes: mother–infant (defined between mother and their offspring up to one year of age), household (defined as between cohabiting individuals), and intra-population (defined as that between non-cohabiting individuals in a population with no evidence of kinship).
Species-level beta diversity and ordination
For the appropriate analysis of microbiome compositional data, species-level abundance matrices obtained by MetaPhlAn were centred log ratio-transformed using the codaSeq.clr function in the CoDaSeq R package (v0.99.6)125, using the minimum proportional abundance detected for each taxon for the imputation of zeros. A principal component analysis plot on Aitchison distance was produced with the ordinate and plot_ordination function in phyloseq (v1.28.0)126, using one randomly selected sample per individual (n = 4,840 gut samples, n = 2,069 oral samples). To compare species-level similarity to strain-sharing rates, beta diversity metrics (Aitchison distance, Bray–Curtis dissimilarity, and Jaccard binary distance) computed with the vegan R package (v2.5–7) were converted to similarity indices (1 − (distance or dissimilarity)).
Unsupervised networks based on shared strains and species were visualized with R packages ggraph (v2.0.5), igraph (v1.2.6)127, and tidygraph (v1.2.0) with stress layout, showing connections with ≥5 shared strains or ≥50 shared species (edges) among individuals (nodes).
Annotation of species phenotypic traits
Experimentally determined bacterial phenotypes were fetched from the Microbe Directory v2.0 (ref. 128), and matched to kSGBs by NCBI taxonomic identifiers. Phenotypic traits that have previously been hypothesized to be linked with species transmissibility3 were predicted for all SGBs using Traitar (version 1.1.12)60 on the 50% core genes (genes present in 50% of genomes available in the expanded SGB database). Only annotations for which the phypat and the phypat + PGL classifiers (the second including additionally evolutionary information on phenotype gains and losses) annotations matched were kept. Associations between SGB transmissibility and microorganism phenotypes were assessed with Wilcoxon rank-sum tests on the 25% most transmissible SGBs as compared to the 25% least transmissible ones.
Statistical analyses and graphical representations were performed in R using packages vegan (version 2.5–7), phyloseq (v1.28.0)126, QuantPsyc (v1.5), ggplot2 (v3.3.3), ggpubr (v0.4.0) and corrplot (v0.84). Correction for multiple testing (Benjamini–Hochberg procedure, Padj) was applied when appropriate and significance was defined at Padj < 0.05. All tests were two-sided except where specified otherwise. The association between metadata variables and distance matrices was assessed by PERMANOVA with the adonis function in vegan. Differences between two groups were assessed with Wilcoxon rank-sum tests. For more than two groups, the Kruskal–Wallis test with post hoc Dunn tests was used. Correlations were assessed with Spearman’s tests. To assess correlations between variables while partialling out potential confounders, GLMs were fitted with the glm R function (Gaussian, link = identity). Standardized GLM regression coefficients were calculated using the lm.beta R function (QuantPsyc R package). The significance was assessed by performing log likelihood (Chi-squared) tests on nested GLMs.
All study procedures are compliant with all relevant ethical regulations. The procedures were performed in compliance with the Declaration of Helsinki. Ethical approval of the Argentina cohort was granted by the Ethics and Safety committee (CEySTE), CCT Santa Fe, Argentina (29112019). The Colombia cohort was approved by the Research Bioethics committee, Universidad Metropolitana, Colombia (NIT 890105361-5). The China_1 dataset research protocol was approved by the Ethics Committee of Shanghai Tenth Hospital, Tongji University School of Medicine (SHSY-IEC-pap-18-1), and China_2 was approved by the Ethics committee of the Health Science Center, Xi’an Jiaotong University, China (2016-114). The Guinea-Bissau study was approved by the Health Ethics National Committee (Comitê Nacional da Ética na Saude), Ministry of Public Health, Guinea-Bissau (076/CNES/INASA/2017) and by the London School of Hygiene and Tropical Medicine Ethics Committee (reference number 22898). The Italy_1 dataset research protocol was approved by the Ethics Committee of Santa Chiara Hospital, Trento, Italy (51082283, 30 July 2014) and the Ethics Committee of the University of Trento, Italy, and Italy_2 by the Liguria Regional Ethics Committee, Italy (006/2019). Ethical approval for the USA dataset was granted by Western IRB (https://www.wcgirb.com/), with protocol tracking number WIRB20151664. Written informed consent was obtained from all adult participants and from parents of non-adult participants.
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.