Molecular Ecology (2011) 20, 2693–2708 doi: 10.1111/j.1365-294X.2011.05130.x
The Bantu expansion revisited: a new analysis ofY chromosome variation in Central Western Africa
VALERIA MONTANO,*† GIANMARCO FERRI ,‡ VERONICA MARCARI , * CHIARA BATINI ,§
OKORIE ANYAELE,– GIOVANNI DESTRO-BISOL** and DAVID COMAS†
*Dipartimento di Biologia Ambientale, Sapienza Universita di Roma, P.le Aldo Moro 5, 00185 Rome, Italy, †Institut de Biologia
Evolutiva (CSIC-UPF), CEXS-UPF-PRBB, Doctor Aiguader 88, 08003 Barcelona 08003, Spain, ‡Department of Diagnostic and
Laboratory Service and Legal Medicine, Section of Legal Medicine, University of Modena and Reggio Emilia, Italy, §Department
of Genetics, University of Leicester, Leicester LEI 7RH, UK, –Department of Zoology, University of Ibadan, Ibadan, Oyo State,
Nigeria, **Istituto Italiano di Antropologia, P.le Aldo Moro 5, 00185 Rome, Italy
Corresponde
E-mail: david
� 2011 Black
Abstract
The current distribution of Bantu languages is commonly considered to be a
consequence of a relatively recent population expansion (3–5 kya) in Central Western
Africa. While there is a substantial consensus regarding the centre of origin of Bantu
languages (the Benue River Valley, between South East Nigeria and Western
Cameroon), the identification of the area from where the population expansion actually
started, the relation between the processes leading to the spread of languages and
peoples and the relevance of local migratory events remain controversial. In order to
shed new light on these aspects, we studied Y chromosome variation in a broad dataset
of populations encompassing Nigeria, Cameroon, Gabon and Congo. Our results
evidence an evolutionary scenario which is more complex than had been previously
thought, pointing to a marked differentiation of Cameroonian populations from the rest
of the dataset. In fact, in contrast with the current view of Bantu speakers as a
homogeneous group of populations, we observed an unexpectedly high level of
interpopulation genetic heterogeneity and highlighted previously undetected diversity
for lineages associated with the diffusion of Bantu languages (E1b1a (M2) sub-
branches). We also detected substantial differences in local demographic histories,
which concord with the hypotheses regarding an early diffusion of Bantu languages
into the forest area and a subsequent demographic expansion and migration towards
eastern and western Africa.
Keywords: Bantu languages, Central Africa, demographic expansion, Y chromosome
Received 23 September 2010; revision revised 30 March 2011; accepted 12 April 2011
Introduction
The term Bantu refers to a family of languages which is
widespread in most of the sub-Saharan continent and is
currently spoken by almost 220 million people (Marten
2006). Despite their adoption by populations which are
settled in a very wide territory encompassing a large
portion of the continent from the equatorial belt to
Southern Africa, Bantu languages are characterized by a
nce: David Comas, Fax: +34 93 3160901;
well Publishing Ltd
high degree of similarity even among the most geo-
graphically distant communities (Greenberg 1955, 1972;
Oliver 1966a). As a result of almost a century of linguis-
tic and archaeological studies, the distribution of Bantu
languages is thought to be the effect of a population
expansion (commonly referred to as the Bantu expan-
sion) which started from the Benue River Valley,
between South East Nigeria and Western Cameroon
(Johnston 1919; Bakel 1981; Vansina 1984, 1995). This is
mainly supported by the fact that Bantoid languages,
regarded as being ancestral to the Bantu ones, are pres-
ently spoken in this area (Greenberg 1949; Guthrie 1962;
2694 V. MON TAN O E T A L.
Oliver 1966a; b; Lwanga-Lunyiigo 1976). A relatively
recent population growth and colonization (�3–5 kya)
of new territories is still accepted today by most schol-
ars as the most reasonable explanation for the geo-
graphical dispersal and relative homogeneity of Bantu
languages (Schoenbrun 2001). It has also been proposed
that the first steps of migration could have followed
two main routes which have been defined as the ‘Wes-
tern’ and ‘Eastern’ streams (Vansina 1984, 1995; Scho-
enbrun 2001). An alternative scenario was proposed by
Guthrie (1962). While agreeing with Greenberg and oth-
ers about the centre of origin of Bantu languages, he
proposed the Katanga region, in the South of the Demo-
cratic Republic of Congo, in the middle of the equato-
rial forest, as the area from where Bantu-speaking
populations spread towards Western and Eastern
Africa. However, some authors have highlighted the
reductionism of these hypotheses based on a single
huge population migration linked to the spread of lan-
guages, and have underlined the relevance of local
migration processes (Lwanga-Lunyiigo 1976; Ehret 2001;
Schoenbrun 2001).
Population genetic studies may clarify the dynamics
underlying the present distribution of Bantu-speaking
populations at both regional and sub-continental levels
(Mitchell 2010; Scheinfeldt et al. 2010). Unilinear trans-
mitted polymorphisms of the Y chromosome are partic-
ularly useful for this purpose, since they may be used
either to draw phylogeographic inferences or to detect
signatures of male driven demographic processes. As
an additional advantage, the widespread practice of pa-
trilocality among Bantu-speaking populations makes
the distribution of paternal lineages less prone to the
confounding effect of recent gene flow than maternal
lineages (Hammer et al. 2001; Destro-Bisol et al. 2004;
Wilder et al. 2004a,b; Berniell-Lee et al. 2009; Coia et al.
2009).
Despite its potential, Y chromosome variation has
been scantily explored in sub-Saharan Africa and has
been studied even less than the other unilinearly trans-
mitted marker, mitochondrial DNA (mtDNA) (Salas
et al. 2002; Pakendorf & Stoneking 2005; Destro-Bisol
et al. 2010). In fact, previous Y-chromosomal studies
have been carried out on a local geographic scale or
have investigated a limited number of geographically
dispersed Bantu-speaking populations (Beleza et al.
2005; Coia et al. 2005; Berniell-Lee et al. 2009), and
even the broadest datasets do not contain certain areas
of primary importance to test the hypotheses concern-
ing the Bantu expansion (Hammer et al. 2001; Under-
hill et al. 2001; Wood et al. 2005; De Filippo et al.
2011). Nonetheless, there is a substantial convergence
concerning the hypothesis that specific paternal lin-
eages, defined using single nucleotide polymorphisms
(SNPs) and short tandem repeats (STRs), could be a
genetic legacy of the Bantu expansion. This fact is sup-
ported by their distribution and prevalence among
Bantu speakers (Thomas et al. 2000; Underhill et al.
2000, 2001; Cruciani et al. 2002; Pereira et al. 2002;
Beleza et al. 2005; Wood et al. 2005; Berniell-Lee et al.
2009) and by estimates of time of expansion, as in the
case of haplogroups E1b1a7 (defined by M191) and
E1b1a (defined by M2), which have been dated back to
between 3.4 and 5.2 kya (Zhivotovsky et al. 2006;
Berniell-Lee et al. 2009). In general, previous studies
regarded the low level of variation occurring at
Y-chromosome, and other genetic systems among
Bantu-speaking populations as a signature of a recent
population expansion (Bandelt et al. 1995; Alves-Silva
et al. 2000; Jobling et al. 2004; Plaza et al. 2004; Berni-
ell-Lee et al. 2009; Tishkoff et al. 2009). On the whole,
previous genetic investigations have highlighted the
agreement between the genetic structure of Bantu-
speaking populations and some generic predictions of
linguistic theories (Jobling et al. 2004; Zhivotovsky
et al. 2004; Berniell-Lee et al. 2009). However, genetic
studies should be more fruitfully considered as an
independent tool to clarify anthropological issues and
explore their complexity, since they may provide infor-
mation that can be compared and, eventually, inte-
grated with data and inferences from other disciplines.
Here, we present a study of Y chromosome variation
in a broad dataset encompassing Nigeria, Cameroon,
Gabon and Congo, focusing on the haplogroup E1b1a
(M2) and its sub-branches, which are the most frequent
lineages in sub-Saharan Africa. Populations sampled
are either native groups settled in the area where Bantu
expansion is thought to have originated or in the
regions located in the putative origin of the Western
stream. The analysis of paternal lineages was based on
recently discovered SNPs (Wilder et al. 2004a; Sims
et al. 2007; Karafet et al. 2008), which make our level of
resolution higher than in previous studies on the
genetic legacy of Bantu expansion (Jobling et al. 2004;
Beleza et al. 2005; Wood et al. 2005; Berniell-Lee et al.
2009).
The availability of a large and tailored population
dataset together with a more-in-depth dissection of
genetic variation made it possible to perform an analy-
sis of the relationships between genetic variation, geo-
graphical, and linguistic factors in the Bantu area.
Furthermore, we used genetic data to draw demo-
graphic inferences on the peopling processes in Central
Western Africa. Both these approaches disclose greater
complexity than highlighted by previous research on
the genetic legacy of the Bantu expansion, showing dif-
ferent genetic patterns among the populations under
study and signatures of ancient demographic events.
� 2011 Blackwell Publishing Ltd
THE GENETIC L EG ACY OF T HE BANTU EXPANSI ON 2695
Materials and methods
Population sampling and Y chromosome genotyping
The dataset consists of a total of 505 unrelated male
individuals from 17 sub-Saharan African populations,
including both unstudied populations from Nigeria and
previously partially investigated groups from Camer-
oon, Gabon and Congo (Berniell-Lee et al. 2009; Coia
et al. 2009; Fig. 1a; Table S1). Pairwise distances
between sample collection sites (measured as air dis-
tances) range from 69.4 km for the nearest villages
(a)
(b)
� 2011 Blackwell Publishing Ltd
(Gran Zambe and Kouambo, Cameroon) to 1 320 km
for the furthest villages (Idah in Nigeria and Ollebi in
Congo).
An appropriate consent form was signed by each
DNA donor. DNA extraction from a cheek swab or
from blood was performed with a phenol ⁄ chloroform
standard protocol (Gill et al. 1985) and extraction prod-
ucts were quantified with the Quantifiler� Human
DNA Quantification Kit (Applied Biosystems).
Twenty Y chromosome SNPs were genotyped in a
hierarchical manner with two different methods: a
probe hybridization approach with TaqMan� SNP
Fig. 1 (a) Map of sampling locations
with pie charts of haplogroup frequen-
cies for each population. (b) Maps of
the distributions of the main haplo-
groups of the Y chromosome. Circles
represent the geographic position of the
populations. The intensity of shades is
proportional to the values of interpo-
lated haplogroup frequencies.
2696 V. MON TAN O E T A L.
Genotyping Assays (Applied Biosystems) and a multi-
ple single-base extension reaction approach with the
SNaPshot� Multiplex Kit (Applied Biosystems). Loci
analysed with TaqMan probes include M96(E),
M2(E1b1a), M191(E1b1a7), M207(R), M17(R1a1), P116
(E1b1a7a3), 50f2P(B2b). The real time PCRs were per-
formed with a 7900HT Fast Real-Time PCR System
(Applied Biosystems) using program default conditions
and adapting the number of cycles to probe perfor-
mances, from a minimum of 40 to a maximum of 60.
The rest of the SNPs were typed using the SNaPshot
technique performing three multiplexes [first: M91(A),
M60(B), M150(B2a); second: M75(E2), P2(E1b1),
M215(E1b1b), M154(E1b1a4); third: U175(E1b1a8), U174
(E1b1a7a), U209 (E1b1a8a), P9.2(E1b1a7a1), P115
(E1b1a7a2), U290(E1b1a8a1)]. The first PCR step was
performed with a QIAGEN Multiplex PCR kit.
Seventeen Y chromosome STRs were typed using the
AmpF‘STR� Yfiler� PCR Amplification Kit (Applied
Biosystems) designed for loci: DYS456, DYS389I,
DYS390, DYS389II, DYS458, DYS19, DYS385 a ⁄ b,
DYS393, DYS391, DYS439, DYS635, DYS392, Y GATA
H4, DYS437, DYS438, DYS448.
SNaPshot and Y-filer products were run in a 3130xl
Genetic Analyser (Applied Biosystems) and analysed
with GeneMapper software (Applied Biosystems) to
assign individual genotypes. Derived alleles for loci
M154, M215, P9.2 and M17 were not observed. Y chro-
mosome haplogroup classification was based on Karafet
et al. (2008).
Statistical analyses
A graphical representation of the geographical distribu-
tion of Y chromosome haplogroup frequencies was
drawn with Surfer 8.0 software (Golden Software Prod-
ucts).
A phylogenetic reconstruction of haplotype relation-
ships was inferred using the reduced median network
algorithm, whose output was used to calculate the med-
ian joining vectors, using the Network Software 4.5 (Ban-
delt et al. 1995, 1999). The network was built
introducing both SNP and STR data for each individual.
A weight of 99 was assigned to all the SNPs, while the
weight of each STR locus (ranging from 5 for DYS635 to
57 for DYS392) was calculated on the basis of its sample
variance according to Meyer et al. (1999). Sixteen indi-
viduals with missing loci were excluded. Loci DYS385
a ⁄ b and DYS389II were excluded due to their phyloge-
netic uncertainty. In fact, loci DYS385 a ⁄ b are amplified
together in the same fragment and cannot be electro-
phoretically separated, which makes a correct alignment
impossible. Similarly, locus DYS389II is co-amplified
with DYS389I and its size could be calculated only indi-
rectly, by subtracting the DYS389I allele (which is also
amplified separately) from the total fragment (see Gus-
mao et al. 2006 for details on forensic applications).
In order to describe intrapopulation diversity, we cal-
culated haplotype and haplogroup frequencies, haplo-
group and haplotype diversity, and mean number of
pairwise differences (MNPD), using Arlequin 3.11 (Ex-
coffier et al. 2005). The weighted intralineage mean
pairwise (WIMP) values were obtained subdividing the
dataset of each region into lineages and calculating the
MNPD within each lineage. In this way, it is possible to
estimate the weighted average of the mean pairwise dif-
ferences among lineages, using the formula of weighted
mean based on variance (Sokal & Rohlf 1995). The
weighted interpopulation mean pairwise (WPMP) is the
weighted average of the mean pairwise differences
among populations of the same region, calculated on
the basis of MNPD variance for each population.
Interpopulation genetic distances were obtained
according to Slatkin (1995) for STR haplotypes (Rst). A
graphical representation of genetic distance matrix was
performed through SPSS 15.0 (SPSS for Windows, Rel.
11.2006. Chicago: SPSS Inc.) with a metric multidimen-
sional scaling method.
In order to carry out a simultaneous exploration of
diversity among populations and the relative weight of
genetic variables, we conducted a principal component
analysis (PCA) for SNP haplogroup frequency data with
the R software package ade4 (Dray & Dufour 2007; R
Development Core Team 2008).
To detect signals of population structure, a hierarchi-
cal analysis of molecular variance (AMOVA) was carried
out grouping the populations according to both geo-
graphical and linguistic criteria, with Arlequin 3.11 soft-
ware (Excoffier et al. 2005). Geographical groups were
defined on the basis of political borders, with the excep-
tion of the only population from Congo which was
included in the Gabonese dataset. Linguistic groups are
based on Ethonologue linguistic classification
(Lewis2009. Ethnologue: SIL International. Online ver-
sion: http: ⁄ ⁄ www.ethnologue.com ⁄ ) and have been
divided into four main categories: (i) Benue-Congo nor
Bantoid neither Bantu (Idoma and Igala); (ii) Benue-
Congo Bantoid (Tiv and Bamileke); (iii) Benue-Congo
Bantoid Bantu family A (Bakaka, Bassa, Ewondo,
Ngoumba, Fang, Makina and Benga); (iv) Benue-Congo
Bantoid Bantu family B (Duma, Kota, Ndumu, Nzebi
and Bateke) (Table S1, Supporting information).
To investigate the potential relationship between geo-
graphic distances and genetic variation, a spatial princi-
pal component analysis (sPCA) was carried out for SNP
haplogroup frequencies using the algorithm imple-
mented in the R software package adegenet (Jombart
2008; Jombart et al. 2008; R Development Core Team
� 2011 Blackwell Publishing Ltd
THE GENETIC L EG ACY OF T HE BANTU EXPANSI ON 2697
2008). Basically, the method explicitly summarizes the
spatial autocorrelation calculated on a set of allelic fre-
quencies using Moran’s index (Moran’s I), with infor-
mation regarding the genetic variance among entities
(individuals or populations) in order to detect the pres-
ence of spatial patterns. The spatial information used
for the computation of Moran’s I is stored in a symmet-
rical binary matrix where populations or individuals
are assigned as neighbours and non-neighbours, that is
1 or 0, respectively. In our case, the inverse matrix of
pairwise distances was considered, where all popula-
tions are neighbours and the spatial information is con-
verted into a matrix of weights which are proportional
to inverse spatial distances. Differently from the con-
ventional PCA, the independent components found by
sPCA are both positive and negative, since they opti-
mize the product between the genetic variance among
entities and their spatial autocorrelation. The most
informative components are the most positive (associ-
ated to positive spatial autocorrelation) and the most
negative (associated to negative spatial autocorrelation),
which contain the information about the global and the
local structure of the sample, respectively. A global
structure implies that each sampling location is geneti-
cally closer to neighbours than randomly chosen loca-
tions, as it happens with spatial groups, clines or
intermediate states. Conversely, a stronger genetic dif-
ferentiation among neighbours than among random
pairs of entities characterizes the local structure. The
component to take into consideration is the one with
the absolute highest eigenvalue. To evaluate the consis-
tency of the detected geographical structures versus a
random spatial distribution of the observed genetic var-
iance, a Monte-Carlo based test is applied (Jombart
et al. 2008). This test simulates a random distribution of
the genetic variability (H0 or null hypothesis) on the
connection network and calculates a p-value depending
on the dataset. The simulated distribution represents
the correlation of the randomized genetic variables with
the vectors of the Moran’s I predicting for the global or
local structure. If the value associated to the observed
pattern is higher than the p-value, it means that the spa-
tial distribution of the genetic variance is not random
and the null hypothesis can be rejected. We applied the
test with 100 000 iterations.
The BATWING software was used to estimate the fol-
lowing demographic and evolutionary parameters:
ancestral effective population size, time of the begin-
ning of the population demographic expansion, and
time to the most recent common ancestor (Wilson et al.
2003). The software is based on the coalescent theory
and can test three different demographic models with a
Bayesian approach: constant population size, growing
population size, and constant population size followed
� 2011 Blackwell Publishing Ltd
by demographic growth. The last one seems to be the
most reasonable for populations that have undergone
an agricultural revolution. Moreover, it has been dem-
onstrated that this model is the most appropriate for
African populations (Laval et al. 2010). Consequently,
the whole data set was tested using the above-men-
tioned model. Since the BATWING coalescent model was
not designed to take gene flow into account (Wilson
et al. 2003), which is likely to be intensive among the
populations analysed (Destro-Bisol et al. 2004), we
decided not to estimate the time of the splitting of the
populations. Prior distributions were established to
cover a range of expectations which is concordant with
human population history (Wilson et al. 2003). For the
effective population size, a lognormal distribution (9, 1)
was used, whereas for the alpha and beta priors,
gamma distributions of (1, 200) and (0.5, 1), respec-
tively, were used. To obtain the most reliable evaluation
of the posterior mutation rate distribution, only 12 te-
tranucleotide loci (DYS456, DYS389I, DYS390, DYS458,
DYS19, DYS393, DYS391, DYS439, DYS635, GATA_H4,
DYS437, DYS438) were used, and a width gamma
mutation prior distribution of (7 7500) was assigned to
all STR loci, with a mean equal to 9.3 · 10-4, covering a
range between 10-3 and 10-4 in accordance with the
expected values of the Y chromosome STR mutation
rate of both observed and effective estimates (see YHR-
D.ORG.3.0 database for a summary of the main publica-
tions about Y chromosome STRs mutation rates;
Zhivotovsky et al. 2004). This prior distribution is wider
than the ones used in previous studies which were
based on meioses, where the variance is very narrow
(Balaresque et al. 2010; Shi et al. 2010). SNP information
was integrated for the phylogenetic reconstruction to
discriminate possible STR haplotype homoplasies, but it
was not considered for posterior estimates. Chain con-
vergence was evaluated with three independent runs
(starting from different seeds) using two different diag-
nostics implemented in the R package coda (Plummer
et al. 2006; R Development Core Team 2009): the Gel-
man diagnostic (Gelman & Rubin 1992) and the Geweke
diagnostic (Geweke 1992). Number of samples was
2 · 106 with treebetN = 10 and Nbetsamp = 20. The
mode values of the posterior distributions were calcu-
lated through an R software package modeest (Poncet
2009; R Development Core Team 2009). To test the pres-
ence of genetic signatures of the ‘Western stream’ of the
Bantu expansion, the demographic parameters were
inferred in the two spatial groups of populations identi-
fied by the sPCA analysis (see Fig. 4c). Given the con-
trast between the sign of sPCA score for the Ndumu
from Gabon (see Fig. 4c), although close to zero, and
that of its neighbours, we repeated the analysis for both
groups with and without this population.
2698 V. MON TAN O E T A L.
Results
Intrapopulation diversity of paternal lineages inCentral Western Africa
The analysis of 20 Y chromosome biallelic markers has
shown that Central African samples are characterized
by the presence of several sub-branches within haplo-
group E (M96). Haplogroups E1b1a7a* (U174), E1b1a8a
(xE1b1a8a1) (U209), and E1b1a8a1 (U290) are the most
frequent, accounting for 75% of our data set (Table 1
and Fig. 1A). Haplogroups A, B and R are also found
at lower frequencies. As previously reported (Cruciani
et al. 2002; Berniell-Lee et al. 2009), haplogroups B2a
(M150) and R (xR1a) (M207) occur at frequencies from
moderate to high in Central Africa (from 5.4% to 40%).
The geographical haplogroup distribution is shown in
Fig. 1A, and its interpolation in Fig. 1b. Since the sam-
pling does not evenly cover the area, the readers should
be aware that the representation is prone to over-inter-
pretation. Nonetheless, we believe that these maps are a
useful tool to visualize the phylogeographic patterns
inferred from the data which is presently available.
Within the haplogroup E (M96), the green component
corresponding to E1b1a7a* (U174) is prevalent in Nige-
ria and Gabon (v2 = 18.33, 0.05 > P > 0.001), while the
blue component representing E1b1a8a (U209) is signifi-
cantly more frequent in Cameroon (v2 = 32.64,
P < 0.001). It is worth noting that the sub-clade
E1b1a7a3 (P116) was only detected in Gabon and in one
population from Cameroon (Bassa), whereas E1b1a7a2
(P115) was only observed among Fang (both in light
green, see Fig. 1B). A phylogenetic reconstruction for
all the haplogroups is provided in Fig. S1 (Supporting
information).
The intrapopulation haplogroup diversity indices
range from 0.561 to 0.847 (Table 2), attaining values
which are comparable to or slightly higher than those
reported in previous studies (Beleza et al. 2005; Rosa
et al. 2007), as is to be expected given the increased res-
olution of our SNP panel. Nigerian samples exhibit the
lowest values of haplogroup diversity, which gradually
increases in Cameroonian and Gabonese samples. Con-
cerning Y chromosome STR haplotypes, the values of
intra population haplotype diversity are greater than
97% in all populations with the exception of Bakaka
and Ewondo, thus achieving in most cases the power of
discrimination expected for forensic markers. The
WIMP value for Nigeria (2.914) is markedly lower than
the ones obtained for Cameroon and Gabon (4.264 and
4.219, respectively, Table 2). This result is due to the
presence of a single predominant haplogroup (E1b1a7
(U174)) in Nigerians, suggesting a reduced variation in
the ancestral population and limited gene flow from
other regions. However, the distributions of MNPD and
WIMP values obtained for Nigeria are still partially
overlapping. Furthermore, regional WPMP and MNPD
values are similar in Gabon and Nigeria, whereas
WPMP is lower in Cameroon due to the greater hetero-
geneity among populations in terms of haplogroup
composition, although the difference from the MNPD
value is not significant.
Interpopulation diversity of paternal lineages inCentral Western Africa
The Multidimensional scaling (MDS) plot of Rst genetic
distances (Fig. 2) stresses the differentiation and hetero-
geneity of Cameroonian samples compared to the rest
of populations. In contrast, no statistically significant
genetic distance was observed between Nigerian and
Gabonese populations (Table S2, Supporting informa-
tion). This is particularly evident for the Bakaka, Ew-
ondo and Ngoumba who behave as outliers. However,
Nigerians and, to a lesser extent, Gabonese, group
together. As expected on the basis of their common eth-
nic affiliation, no significant differentiation can be
observed between Fang from Cameroon and Gabon
(Table S2). The only Congolese population (Bateke) is
close to groups from Gabon, reflecting their geographi-
cal proximity.
The differentiation among Cameroon populations and
the relative homogeneity of Nigerians and Gabonese is
confirmed by the PC plot (Fig. 3). The Nigerian popula-
tions are grouped together until the fifth component
(data not shown), reflecting their marked similarity in
haplogroup composition. Ngoumba are less distant
from other populations than in the MDS plot, probably
due to the fact that the haplogroup B, which is particu-
larly frequent in this population, does not give a high
contribution to the first two principal components (see
loading scores, Fig. S2, Supporting information).
Accordingly, their diversity from the rest of the dataset
is better highlighted by the third and fourth PCs (data
not shown). Conversely, the outlier position of the Ew-
ondo is further stressed, due to the prevalence of
E1b1a8a1(U290), which is the haplogroup that gives the
highest contribution to the second PC (see loading
scores, Fig. S2).
An AMOVA was performed to detect possible linguistic
and ⁄ or geographical structuring of genetic variation
(Table 3). A significant genetic heterogeneity was found
when all populations were taken as a single group
(8.17% for SNPs and 5.35% for STRs). The results
obtained for each geographical group indicate that
Cameroon is the main contributor to the observed het-
erogeneity, as predicted by PCA and genetic distances.
This is confirmed when using a jacknife procedure, by
� 2011 Blackwell Publishing Ltd
Tab
le1
Y-c
hro
mo
som
eh
aplo
gro
up
freq
uen
cies
inth
ese
ven
teen
po
pu
lati
on
san
aly
sed
Nig
eria
Cam
ero
on
Co
ng
oG
abo
n
To
tal
TIV
IDO
IGA
BA
KE
WO
BA
SN
GO
FA
NC
BA
MB
AT
NB
EN
DU
MK
OT
MA
KN
ZE
ND
UF
AN
G
A(M
91)
11
21
16
B(M
60)
11
13
B2a
(M15
0)1
21
16
12
16
21
24
B2b
(50f
2P)
11
E1b
1a(M
2)4
21
21
21
11
15
E1b
1a7(
M19
1)2
11
11
6
E1b
1a7a
(U17
4)*
3426
2515
73
46
136
1311
915
212
420
5
E1b
1a7a
2(P
115)
12
3
E1b
1a7a
3(P
116)
83
32
42
426
E1b
1a8(
U17
5)1
1
E1b
1a8a
(U20
9)4
11
242
233
103
49
53
105
210
9
E1b
1a8a
1(U
290)
26
91
152
31
76
13
15
35
171
E2(
M75
)3
12
21
21
12
217
R(M
207)
22
11
11
46
18
TO
TA
L52
4040
4326
4115
1232
1922
3221
3225
3320
505
Lo
ciM
96an
dP
2ar
eb
asal
no
des
for
Eh
aplo
gro
up
and
are
no
tre
po
rted
.L
ist
of
abb
rev
iati
on
sfo
llo
win
gth
eo
rder
of
the
tab
le:
Tiv
(Tiv
),Id
o(I
do
ma)
,Ig
a(I
gal
a),
Bak
(Bak
aka)
,
Ew
o(E
wo
nd
o),
Bas
(Bas
sa),
Ng
o(N
go
um
ba)
,F
anC
(Fan
gC
amer
oo
n),
Bam
(Bam
ilek
e),
Bat
N(N
oth
Bat
eke)
,B
en(B
eng
a),
Du
m(D
um
a),
Ko
t(K
ota
),M
ak(M
akin
a),
Nze
(Nze
bi)
,
Nd
u(N
du
mu
),F
anG
(Fan
gG
abo
n).
THE GENETIC L EG ACY OF T HE BANTU EXPANSI ON 2699
� 2011 Blackwell Publishing Ltd
Ta
ble
2In
trap
op
ula
tio
nd
iver
sity
ind
ices
for
Ych
rom
oso
me
dat
a
Hap
log
rou
ps
(SN
Ps)
Hap
loty
pes
(ST
Rs)
NN
hap
log
rou
ps
Hap
log
rou
p
div
ersi
tyN
hap
loty
pes
Hap
loty
pe
div
ersi
tyM
NP
DM
NP
D⁄r
egio
nW
PM
PW
IMP
Nig
eria
TIV
529
0.56
4(±
0.07
9)43
0.99
0(±
0.00
6)6.
41(±
3.08
6)6.
541
(±3.
112)
6.46
3(±
0.55
5)2.
914
(±1.
401)
IDO
MA
408
0.56
1(±
0.08
6)37
0.99
6(±
0.00
6)6.
62(±
3.19
2)
IGA
LA
406
0.56
6(±
0.07
5)37
0.99
6(±
0.00
6)6.
37(±
3.08
4)
Cam
eroo
n
BA
KA
KA
435
0.57
7(±
0.05
1)22
0.95
2(±
0.01
5)4.
38(±
2.20
4)6.
131
(±2.
931)
5.61
6(±
0.86
4)4.
264
(±1.
158)
EW
ON
DO
266
0.64
6(±
0.07
5)19
0.96
0(±
0.02
5)5.
49(±
2.73
1)
BA
SS
A41
70.
651
(±0.
072)
340.
984
(±0.
012)
6.09
(±2.
961)
NG
OU
MB
A15
40.
761
(±0.
066)
140.
990
(±0.
028)
8.04
(±3.
961)
FA
NG
C12
50.
727
(±0.
113)
121.
000
(±0.
034)
5.85
(±3.
007)
BA
MIL
EK
E32
40.
707
(±0.
039)
310.
998
(±0.
009)
5.93
(±2.
908)
Con
go
BA
TE
KE
N19
60.
801
(±0.
055)
170.
988
(±0.
021)
6.46
(±3.
197)
–––
Gab
on
BE
NG
A22
60.
632
(±0.
104)
170.
974
(±0.
022)
5.93
(±2.
944)
6.62
1(±
3.14
2)6.
519
(±0.
810)
4.21
9(±
1.16
4)
DU
MA
329
0.80
2(±
0.04
7)27
0.98
7(±
0.01
2)6.
77(±
3.27
4)
KO
TA
216
0.74
2(±
0.06
8)19
0.99
0(±
0.01
8)6.
39(±
3.15
6)
MA
KIN
A32
60.
729
(±0.
062)
300.
996
(±0.
009)
6.76
(±3.
270)
NZ
EB
I25
80.
810
(±0.
063)
240.
996
(±0.
012)
6.37
(±3.
122)
ND
UM
U33
90.
822
(±0.
047)
320.
998
(±0.
008)
7.13
(±3.
431)
FA
NG
G20
50.
847
(±0.
047)
180.
989
(±0.
019)
6.50
(±3.
893)
Fo
urt
een
Y-c
hro
mo
som
eS
TR
sh
ave
bee
nu
sed
for
the
esti
mat
ion
s.
Lo
ciD
YS
389I
Ian
dD
YS
385
a⁄b
wer
eex
clu
ded
fro
mth
ees
tim
ates
bec
ause
of
thei
rp
hy
log
enet
icu
nce
rtai
nty
,as
reco
mm
end
edb
yG
usm
aoet
al.
(200
6).
MN
PD
,m
ean
nu
mb
ero
f
pai
rwis
ed
iffe
ren
ces;
WP
MP
,w
eig
hte
din
terp
op
ula
tio
nm
ean
pai
rwis
eu
sin
gre
lati
ve
var
ian
ce;
WIM
P,
wei
gh
ted
inte
rlin
eag
em
ean
pai
rwis
eu
sin
gre
lati
ve
var
ian
ce.
2700 V. MON TAN O E T A L.
� 2011 Blackwell Publishing Ltd
Fig. 2 Multidimensional scaling of the genetic distances of the
populations. The stress value (0.203) is acceptable according to
Sturrock & Rocha (2000).
Fig. 3 Principal component analysis based on haplogroup fre-
quencies.
THE GENETIC L EG ACY OF T HE BANTU EXPANSI ON 2701
which we observed that the percentage of molecular
variance explained at population level substantially
decreases in Cameroon after excluding the Ngoumba
population (from 10.50% to 4.18% P = 0.027 for SNPs;
from 11.69% to 8.33% P < 0.001 for STRs). The removal
of any other population does not lead to comparable
reductions (data not shown). The AMOVA using the geo-
graphical classification (Nigeria, Cameroon, Gabon and
Congo) shows significant variance among groups
(5.90% for SNPs, 1.97% for STRs). In this latter case
however, the proportion of variation due to differences
among groups is lower than that found among popula-
tions within groups. The significant percentage of the
� 2011 Blackwell Publishing Ltd
variance detected among populations within groups
using both SNPs and STRs is due to the presence of
Cameroonian populations, as shown by regional AMOVAs
(Table 3). Conversely, no significant differentiation
among groups of populations was found when popula-
tions were grouped according to their linguistic affilia-
tion, even after removing the group including Idoma
and Igala from the analysis, which is linguistically het-
erogeneous (data not shown). This suggests a lack of
correlation between paternal lineage distribution and
linguistic diversity.
In order to obtain further insights into the geographic
distribution of the genetic diversity, a sPCA was per-
formed using haplogroup frequencies (Jombart et al.
2008). The plots identify two groups of populations
(Fig. 4), Nigeria, Bakaka and Bamileke from Cameroon
on the one hand, and the remaining populations on the
other, with the exception of Ndumu from South Gabon
which shows a positive score (Fig. 4c). The strongest
genetic differentiation is found at the border between
these two geographic areas (as indicated by the increas-
ing density of white lines in Fig. 4b). The highest eigen-
value obtained is the most positive one which is
associated to the global structure. According to the test
of significance, the geographical distribution of the
genetic variability was found to be compatible with a
random global structure, the P-value of the Monte-Carlo
test being 0.156 and the observed value 0.119 (see
Fig. 4d).
To infer demographic parameters, 16 individuals with
missing data were excluded from the dataset (giving a
total of 489 samples), while Bateke and populations
from Gabon were pooled on the basis of their geo-
graphical closeness and lack of statistically significant
genetic diversity. It should be noted that our demo-
graphic estimates are associated with wide and partially
overlapping confidence intervals, a problem often
encountered when applying Bayesian methods. How-
ever, the reliability of our results is supported by the
convergence for the three runs we performed on each
dataset and further strengthened by a previous study
showing that the number of loci we used is sufficient to
achieve correct point estimates, although the variance
associated to the posterior distribution is high (Shi et al.
2010). The posterior mutation rate estimate agrees with
the one reported by Zhivotovsky et al. (2004) (6.97 · 10-4;
Fig. S3, Supporting information). A time since expan-
sion of �8.0 kya for the whole dataset was obtained,
with an initial effective size of �2800 individuals.
Approximate mode, median and mean posterior values
for the main parameters estimated are shown in
Table 4. The same simulation was performed on the
two sPCA groups of populations. Estimates for the spa-
tial group including Tiv, Idoma, Igala, Bakaka and
Simulations of spatial autocorrelation
freq
Monte carlo test
(a) (b)
(c) (d)
Fig. 4 Spatial Principal Component Analysis based on haplogroup frequencies. The represented component is the most positive one,
containing the information regarding the global pattern. (a) Relative geographical positions of populations under study. The reticula-
tion presented was chosen only for graphical reasons. This is because the matrix of distances used in the sPCA analysis would have
connected all possible population pairs, complicating the visualization of the objects within the figure. (b) Graphical interpolation of
population scores. The darkest regions represent positive scores relative to the first component, while the whitest regions represent
negative ones. The proximity of white lines is proportional to the degree of genetic differentiation. (c) Single population scores are
represented with black ⁄ white squares, with the black associated to positive values and white to negative ones. Square size is propor-
tional to the absolute value standing for the degree of differentiation. (d) On the abscissa, values of spatial autocorrelation for ran-
domized allelic frequencies obtained through simulations (100 000 permutations); on the coordinate, frequency of class values.
Table 3 Analyses of the molecular var-
iance (AMOVA)Among
groups
Among
populations Within populations
YSNPs YSTRs YSNPs YSTRs YSNPs YSTRs
All samples 8.17** 5.35** 91.83** 94.65**
Nigeria )0.43† 0.90 100.43** 99.10**
Cameroon 10.50** 11.69** 89.50** 88.31**
Gabon 0.97 1.57 99.03** 98.43**
Linguistic groups 0.17 0.26 8.04** 5.05** 91.79** 94.70**
Geographical groups 5.90* 1.97* 3.88** 3.82** 90.22** 94.21**
Values are in percentage. All analyses have been performed using either haplogroup
(SNPs) or haplotype (STRs) information (see Materials and Methods for further details on
linguistic group assignation).
*P < 0.05; **P < 0.001.†When haplotypes randomly drawn from different populations have a higher probability
of being identical compared to haplotypes taken from the same population, the AMOVA
algorithm may produce small negative values (Excoffier et al. 1992).
2702 V. MON TAN O E T A L.
� 2011 Blackwell Publishing Ltd
Table 4 Posterior estimations of demographic parameter values obtained using BATWING
NA NA (95% CI) t0 t0 (95% CI) r r (95% CI) T T (95% CI)
All populations
Mode 2 800 1 500–6 700 7 970 2 400–48 000 0.0065 0.0025–0.0124 50 000 16 000–233 000
Median 3 140 10 180 0.0068 59 000
Mean 3 500 12 898 0.0072 71 600
Group 1
Mode 1 804 905–4 600 10 550 2 400–83 400 0.0046 0.0024–0.0092 45 000 12 800–254 600
Median 1 991 13 600 0.0049 54 500
Mean 2 226 19 300 0.0052 67 600
Group 2
Mode 3 360 2 100–8 300 6 100 2 024–25 300 0.0091 0.0047–0.0179 61 200 23 000–256 600
Median 3 644 6 610 0.0096 70 700
Mean 4 023 7 990 0.0010 84 200
NA, effective ancestral population size; t0, time to start of population growth; r, population growth rate; T, time to the most recent
common ancestor. Time is given in years. Group 1 corresponds to populations with a positive score in sPCA analysis, with the
exception of Ndumu population from Gabon (see Methods and Results for further details). Group 2 includes populations presenting
a negative score in sPCA analysis.
THE GENETIC L EG ACY OF T HE BANTU EXPANSI ON 2703
Bamileke point to a time since expansion of 10.55 kya,
while the most likely effective population size was
around 1800 individuals. A more recent time since
expansion (6.10 kya) and an almost double effective
population size (�3 800) were obtained for the other
group composed by all Gabonese and some Cameroon
populations (Table 4, Fig. 5).
As a methodological choice, the Ndumu population
was excluded from the analysis, due to the contrast
between the sign of their sPCA score, although close to
zero, and that of its neighbours. In fact, it seemed unli-
kely that a demographic expansion which occurred in
the Bantoid region could have also involved this distant
population. However, grouping them with the popula-
tions settled in the forest would have been in contrast
with the use of sPCA as a method to define groups on
which perform demographic inferences. It is anyway
reassuring that even including Ndumu in the black
(10.55 vs. 10.79) or in the white squared group (6.10 vs.
5.92 kya), the estimates of the time since expansion chan-
ged only slightly. Finally, to understand the demographic
history of this population, we analysed it separately,
obtaining a time since the expansion of �4.8 kya, in
agreement with the trend shown by the forest region.
Fig. 5 A physical map of the region under study with sPCA
score for each population. In red, the Benue River Valley
region. In purple, the area of distribution of Bantoid languages.
In green, the upper bound of the Equatorial rainforest (from
Bartholome et al. 2002).
Discussion
A male perspective on the genetic structure ofBantu-speaking populations
As a contribution to the knowledge of the human pre-
history of the African continent south of the Sahara des-
ert, we surveyed a number of populations settled in a
broad transect encompassing the area where the Bantu
� 2011 Blackwell Publishing Ltd
expansion is supposed to have originated (Benue River
Valley) and part of the western stream (Cameroon,
Congo and Gabon). In order to better exploit the poten-
tial usefulness of Y-chromosomal polymorphisms for
2704 V. MON TAN O E T A L.
the analysis of the evolutionary history of Bantu-speak-
ing populations, we analysed both SNP and STR poly-
morphisms. The substantial agreement among the MDS
using genetic distances, PCA based on haplogroup fre-
quencies, and AMOVA carried out using the two types of
polymorphisms indicates that our results provide an
adequate and robust picture of Y-chromosomal diver-
sity and there is no substantial ascertainment bias asso-
ciated with the use of SNPs alone (Wilder et al.
2004a,b).
Our results do not show a clear relationship between
genetic variation and linguistic diversity. This is well
exemplified by the Nigerian populations, where the low
heterogeneity among the three populations surveyed
(as coherently shown by MDS, PC, AMOVA and
WIMP ⁄ MNPD ratio) contrasts with their different lan-
guages, i.e. Bantoid, Yoruboid and Idomoid (see
Table S1). At the same time, we observed a high level
of genetic diversity among Cameroonian populations
despite the fact they have a common linguistic affilia-
tion (Bantu), the only exception being the Bamileke
who speak a Bantoid language. This is consistent with
previous regional studies on Y chromosome diversity
carried out in sub-Saharan Africa or in other continents
which failed to detect a robust correlation between
genetic and linguistic distances (Lane et al. 2002; Coia
et al. 2009; Mona et al. 2009; Veeramah et al. 2010).
However, cases have been shown where linguistic affili-
ation proved to be a good predictor of genetic diversity
both in Africa and elsewhere (Poloni et al. 1997; Hassan
et al. 2008; Mirabal et al. 2009; Cruciani et al. 2010).
This study adds new information to the current
knowledge of co-evolution between genetic and cultural
traits in sub-Saharan populations. In fact, the increased
level of resolution of the SNP panel used in this study
highlights previously undetected variation within E1b1a
(M2), the diagnostic haplogroup of Bantu-speaking pop-
ulations (Jobling et al. 2004; Beleza et al. 2005; Wood
et al. 2005; Berniell-Lee et al. 2009). In this way, we
were able to detect some noteworthy differences within
and among Bantu-speaking populations, mostly due to
haplogroups E1b1a7a (U174), E1b1a8a (U209) and
E1b1a8a1 (U290), which contribute to their high level of
interpopulation differentiation and to the presence of
distinct regional patterns of genetic variation. All these
findings contradict the current view of Bantu speakers
as a homogeneous group of populations whose gene
pools are mostly if not exclusively the result of a rela-
tively recent population expansion (Cavalli-Sforza et al.
1994; Berniell-Lee et al. 2009). In fact, the strongest sig-
nal of diversity is given by Cameroonian populations.
The presence of non-Bantu ethnic groups in this coun-
try raises the possibility that the diversity of Cameroo-
nian populations from other Bantus could be the result
of differential admixture. However, such a scenario is
in contrast with previous studies on Y-chromosome and
nuclear loci which do not support occurrence of gene
flow between the Bantu speakers of South Cameroon
and the Afro-Asiatic and Adamawa populations from
the northern part of the country (Coia et al. 2009; Tishk-
off et al. 2009).
The lack of statistical support for the global structure
observed in the sPCA indicates that genetic affinity is
not consistently greater between neighbouring than dis-
tant populations. This is particularly evident for the
populations settled to the South of the Cameroonian
mountain range (Fig. 5), and could be the consequence
of the low male mobility due to the patrilocal tradition.
However, focusing on a narrower area, the same analy-
sis suggests a genetic change in Central Cameroon,
which approximately coincides with and could be
related to the presence of high mountain ranges (Ba-
menda, Bamileke, and Mambilla highlands, or western
highlands with a mean height of 2000 m; Fishpool &
Evans 2001). Further population sampling and addi-
tional genetic information are needed to confirm this
local pattern.
Demographic dynamics along the western stream of theBantu expansion
In order to gain insights into the past demographic
dynamics of the western stream of the Bantu expansion,
we used a Bayesian coalescent approach. Our analysis
differs from previous studies on Bantu-speaking popu-
lations in that we performed demographic inferences
based on population data instead of single lineages
(Zhivotovsky et al. 2004; Berniell-Lee et al. 2009). This
choice was based on previous observations suggesting
that the frequency of Y-haplogroups might vary sub-
stantially across generations due to fluctuations in the
effective population size among lineages (Zhivotovsky
et al. 2006). Such perturbations, which could be due to
both stochastic and selective processes (Pritchard et al.
1999), could act as confounding factors for evolutionary
inferences based on single lineages.
Our results point to a general pre-agricultural expan-
sion time of �8.0 kya in Central ⁄ Western Africa. This is
in accordance with previous studies on Y chromosome
variation in sub-Saharan Africa which have detected
signatures of pre-Neolithic expansions (Pritchard et al.
1999; Shi et al. 2010). However, some differences in time
estimates can be found across datasets. Our data points
to a more recent time frame (�8.0 kya) compared to
previous results obtained at a continental level
(�15.0 kya, Pritchard et al. 1999). This discrepancy could
be explained by the presence in their dataset of popula-
tions such as Bantu farmers and hunter-gatherers, which
� 2011 Blackwell Publishing Ltd
THE GENETIC L EG ACY OF T HE BANTU EXPANSI ON 2705
are known to have undergone an ancient separation and
experienced different demographic histories (Excoffier &
Schneider 1999; Patin et al. 2009; Batini et al. 2011). In
this regard, it is worth underlining that we obtained con-
siderable differences even in the local demographic his-
tories of populations which are more closely related than
those studied by Pritchard et al. (1999).
Concerning the hypotheses of the expansion of Bantu
languages, linguistic and archaeological knowledge sug-
gest the area between South East Nigeria and West
Cameroon as the origin of the Bantu expansion with a
time frame of 3–5 kya (Greenberg 1955, 1972; Oliver
1966a; Vansina 1984, 1995, 2006). However, our paternal
lineage estimates show older signatures of demographic
expansion. The results for populations from Nigeria
and part of those from Cameroon, indeed, suggest that
a population expansion occurred in the Bantoid area
before the diffusion of Bantu languages (Fig. 5). None-
theless, signatures of a more recent demographic expan-
sion that could be related to the spread of Bantu
languages were detected in the forest area. These results
seem to provide support to the hypothesis of Guthrie
(1962) and Oliver (1966b) who postulated an early diffu-
sion of Bantu languages into the forest. According to
these authors, such an event may have been followed
by a demographic expansion and migration towards
eastern and western directions. In any case, the Bantu
language spread might not have been a direct conse-
quence of a single huge population migration (Lwanga-
Lunyiigo 1976; Ehret 2001; Schoenbrun 2001), since
population movements within sub-Saharan Africa were
probably much more complex and stepwise during the
last millennia.
In conclusion, the signatures we detected in the male
gene pool of the populations of Western Central Africa
depict an evolutionary scenario which is more complex
than suggested or implied by previous research. Our
study reveals so far undetected diversity for lineages
associated to the Bantu expansion, while pointing to a
high level of interpopulation genetic heterogeneity and
highlighting substantial differences in demographic his-
tory from one region to another. Undoubtedly, most of
the points discussed here require further investigations
based on increased samplings and using additional
genetic markers. Nonetheless, we hope that our study
may represent a first step towards a better understand-
ing of the complex genetic and demographic back-
ground behind the spread of the Bantu languages.
Author contributions
Study conception: V.MO., G.D.B., D.C. Field work: V.MO.,
V.MA, O.A. Molecular analysis: V.MO., G.F., C.B. Statisti-
cal analysis: V.MO. Manuscript preparation: V.MO.,
� 2011 Blackwell Publishing Ltd
G.D.B., D.C. All co-authors have reviewed the manuscript
prior to submission.
Acknowledgements
This study was made possible thanks to the contribution of all
the DNA donors from sub-Saharan Africa. The laboratory of
Molecular Anthropology of Rome and the University of Ibadan
(Nigeria) collaborated for the sampling in the Benue River Val-
ley. This study was supported by Spanish Ministry grant
CGL2007-61016 ⁄ BOS and Generalitat de Catalunya grant
2009SGR1101. A special thank you must go to T. Jombart and
I. Wilson for their precious feedback concerning the methods
they developed. We are grateful to the anonymous reviewers
for their patience and suggestions which helped us improve
this work.
References
Alves-Silva J, da Silva Santos M, Guimaraes PE et al. (2000)
The ancestry of Brazilian mtDNA lineages. American Journal
of Human Genetics, 67, 444–461.
Bakel M (1981) The ‘‘Bantu’’ expansion: demographic models.
Current Anthropology, 22, 688–691.
Balaresque P, Bowden GR, Adams SM et al. (2010) A
predominantly neolithic origin for European paternal
lineages. PLoS Biology, 19, e1000285.
Bandelt HJ, Forster P, Sykes BC, Richards MB (1995)
Mitochondrial portraits of human populations using median
networks. Genetics, 141, 743–753.
Bandelt HJ, Forster P, Rohl A (1999) Median-joining networks
for inferring intraspecific phylogenies. Molecular Biology and
Evolution, 16, 37–48.
Bartholome E, Belward AS, Achard F et al. (2002) GLC 2000:
Global Land Cover Mapping for the Year 2000. EUR 20524 EN.
European Commission, Luxembourg.
Batini C, Lopes J, Behar DM et al. (2011) Insights into the
demographic history of African Pygmies from complete
mitochondrial genomes. Molecular Biology and Evolution, 28,
1099–1110.
Beleza S, Gusmao L, Amorim A, Carracedo A, Salas A (2005)
The genetic legacy of western Bantu migrations. Human
Genetics, 117, 366–375.
Berniell-Lee G, Calafell F, Bosch E et al. (2009) Genetic and
demographic implications of the Bantu expansion: insights
from human paternal lineages. Molecular Biology and
Evolution, 26, 1581–1589.
Cavalli-Sforza LL, Menozzi P, Piazza A (1994) The History and
Geography of Human Genes. Princeton University Press,
Princeton.
Coia V, Destro-Bisol G, Verginelli F et al. (2005) Brief
communication: mtDNA variation in North Cameroon: lack
of Asian lineages and implications for back migration from
Asia to sub-Saharan Africa. American Journal of Physical
Anthropology, 128, 678–681.
Coia V, Brisighelli F, Donati F et al. (2009) A multi-perspective
view of genetic variation in Cameroon. American Journal of
Physical Anthropology, 140, 454–464.
Cruciani F, Santolamazza P, Shen P et al. (2002) A back
migration from Asia to sub-Saharan Africa is supported by
2706 V. MON TAN O E T A L.
high-resolution analysis of human Y-chromosome
haplotypes. American Journal of Human Genetics, 70, 1197–
1214.
Cruciani F, Trombetta B, Sellitto D et al. (2010) Human Y
chromosome haplogroup R-V88: a paternal genetic record of
early mid Holocene trans-Saharan connections and the
spread of Chadic languages. European Journal of Human
Genetics, 18, 800–807.
De Filippo C, Barbieri C, Whitten M et al. (2011) Y-
chromosomal variation in Sub-Saharan Africa: insights into
the history of Niger-Congo groups. Molecular Biology and
Evolution, 28, 1255–1269.
Destro-Bisol G, Donati F, Coia V et al. (2004) Variation of
female and male lineages in sub-Saharan populations: the
importance of sociocultural factors. Molecular Biology and
Evolution, 21, 1673–1682.
Destro-Bisol G, Jobling MA, Rocha J et al. (2010) Molecular
anthropology in the genomic era. Journal of Anthropological
Sciences, 88, 93–112.
Dray S, Dufour AB (2007) The ade4 package: implementing the
duality diagram for ecologists. Journal of Statistical Software,
22, 1–20.
Ehret C (2001) Bantu expansions: re-envisioning a central
problem of early African history. International Journal of
African Historical Studies, 34, 5–40.
Excoffier L, Schneider S (1999) Why hunter-gatherer
populations do not show signs of pleistocene demographic
expansions. Proceedings of the National Academy of Sciences,
USA, 96, 10597–10602.
Excoffier L, Smouse PE, Quattro JM (1992) Analysis of
molecular variance inferred from metric distances among
DNA haplotypes: application to human mitochondrial DNA
restriction data. Genetics, 131, 479–491.
Excoffier L, Laval G, Schneider S (2005) Arlequin (version 3.0):
an integrated software package for population genetics data
analysis. Evolutionary Bioinformatics Online, 1, 47–50.
Fishpool LDC, Evans MI. (2001). Important bird areas in Africa
and associated islands: priority sites for conservation. In:
Birdlife Conservation Series No. 11 (eds Fishpool LDC and
Evans MI), pp. 133–159. Pisces Publications and BirdLife
International, Newbury and Cambridge.
Gelman A, Rubin DB (1992) Inference from iterative simulation
using multiple sequences. Statistical Science, 7, 457–472.
Geweke J (1992) Evaluating the accuracy of sampling-based
approaches to calculating posterior moments. In: Bayesian
Statistics 4 (eds Bernado JM,Berger JO, Dawid AP and Smith
AFM), pp. 169–193. Clarendon Press, Oxford.
Gill P, Jeffreys AJ, Werrett DJ (1985) Forensic application of
DNA ‘fingerprints’. Nature, 318, 577–579.
Greenberg JH (1949) Studies in African linguistic classification:
I. The Niger-Congo Family. Southwestern Journal of
Anthropology, 5, 79–100.
Greenberg JH (1955) Studies in African Linguistic Classification.
Compass Press, New Haven, Connecticut.
Greenberg JH (1972) Linguistic evidence regarding Bantu
Origins. Journal of African History, 13, 189–216.
Gusmao L, Butler JM, Carracedo A et al. (2006) DNA
Commission of the International Society of Forensic Genetics
(ISFG): an update of the recommendations on the use of Y-
STRs in forensic analysis. Forensic Science International, 157,
187–197.
Guthrie M (1962) Some developments in the prehistory of the
Bantu languages. Journal of African History, 3, 273–282.
Hammer MF, Karafet TM, Redd AJ et al. (2001) Hierarchical
patterns of global human Y-chromosome diversity. Molecular
Biology and Evolution, 18, 1189–1203.
Hassan HY, Underhill PA, Cavalli-Sforza LL et al. (2008) Y-
chromosome variation among Sudanese: restricted gene
flow, concordance with language, geography, and history.
American Journal of Physical Anthropology, 137, 316–323.
Jobling MA, Hurles ME, Tyler-Smith C (2004) Human
Evolutionary Genetics, Garland Science, New York and
Abingdon.
Johnston HH (1919) A Comparative Study of the Bantu and Semi-
Bantu Languages, vol. 2. Clarendon Press, Oxford.
Jombart T (2008) Adegenet: a R package for the multivariate
analysis of genetic markers. Bioinformatics, 24, 1403–1405.
Jombart T, Devillard S, Dufour AB, Pontier D (2008) Revealing
cryptic spatial patterns in genetic variability by a new
multivariate method. Heredity, 101, 92–103.
Karafet TM, Mendez FL, Meilerman MB et al. (2008) New
binary polymorphisms reshape and increase resolution of
the human Y chromosomal haplogroup tree. Genome
Research, 18, 830–838.
Lane AB, Soodyall H, Arndt S et al. (2002) Genetic
substructure in South African Bantu-speakers: evidence from
autosomal DNA and Y-chromosome studies. American
Journal of Physical Anthropology, 119, 175–185.
Laval G, Patin E, Barreiro LB, Quintana-Murci L (2010)
Formulating a historical and demographic model of recent
human evolution based on resequencing data from
noncoding regions. PLoS ONE, 5, e10284.
Lewis M, Paul (ed.), (2009) Ethnologue: Languages of the World,
Sixteenth edition, Dallas, Tex, SIL International. Online
version: http://www.ethnologue.com/
Lwanga-Lunyiigo S (1976) The Bantu problem reconsidered.
Current Anthropology, 17, 282–286.
Marten L (2006) Bantu classification, Bantu Trees and
phylogenetic methods. In: Phylogenetic Methods and the
Prehistory of Languages (eds Peter F, Colin R), pp. 43–55.
McDonald Institute for Archaeological Research, Cambridge.
Meyer S, Weiss G, von Haeseler A (1999) Pattern of nucleotide
substitution and rate heterogeneity in the hypervariable
regions I and II of human mtDNA. Genetics, 152, 1103–1110.
Mirabal S, Regueiro M, Cadenas AM et al. (2009) Y-
chromosome distribution within the geo-linguistic landscape
of northwestern Russia. European Journal of Human Genetics,
17, 1260–1273.
Mitchell P (2010) Genetics and southern African prehistory: an
archaeological view. Journal of Anthropological Sciences, 88,
73–92.
Mona S, Grunz KE, Brauer S et al. (2009) Genetic admixture
history of Eastern Indonesia as revealed by Y-chromosome
and mitochondrial DNA analysis. Molecular Biology and
Evolution, 26, 1865–1877.
Oliver R (1966a) An inquiry into some problems of early Bantu
history. African Affairs, 65, 245–258.
Oliver R (1966b) The problem of the Bantu expansion. Journal
of African History, 7, 361–376.
Pakendorf B, Stoneking M (2005) Mitochondrial DNA and
human evolution. Annual Review of Genomics Human Genetics,
6, 165–183.
� 2011 Blackwell Publishing Ltd
THE GENETIC L EG ACY OF T HE BANTU EXPANSI ON 2707
Patin E, Laval G, Barreiro LB et al. (2009) Inferring the
demographic history of African farmers and pygmy hunter-
gatherers using a multilocus resequencing data set. PLoS
Genetics, 5, e1000448.
Pereira L, Gusmao L, Alves C et al. (2002) Bantu and European
Y-lineages in Sub-Saharan Africa. Annals of Human Genetics,
66, 369–378.
Plaza S, Salas A, Calafell F et al. (2004) Insights into the
western Bantu dispersal: mtDNA lineage analysis in Angola.
Human Genetics, 115, 439–447.
Plummer M, Best N, Cowles K, Vines K (2006) CODA:
convergence diagnosis and output analysis for MCMC. R
News, 6, 7–11.
Poloni ES, Semino O, Passarino G et al. (1997) Human genetic
affinities for Y-chromosome P49a,f ⁄ TaqI haplotypes show
strong correspondence with linguistics. American Journal of
Human Genetics, 6, 11015–11035.
Poncet P (2009) modeest: Mode Estimation. R package version 1.09.
Pritchard JK, Seielstad MT, Perez-Lezaun A, Feldman MW
(1999) Population growth of human Y chromosomes: a study
of Y chromosome microsatellites. Molecular Biology and
Evolution, 16, 1791–1798.
R Development Core Team (2008) R: A language and
environment for statistical computing, R Foundation for
Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0,
URL http://www.R-project.org.
Rosa A, Ornelas C, Jobling MA et al. (2007) Y-chromosomal
diversity in the population of Guinea-Bissau: a multiethnic
perspective. BMC Evolutionary Biology, 27, 7–124.
Salas A, Richards M, De la Fe T et al. (2002) The making of the
African mtDNA landscape. American Journal of Human
Genetics, 71, 1082–1111.
Scheinfeldt LB, Soi S, Tishkoff SA (2010) Colloquium paper:
working toward a synthesis of archaeological, linguistic, and
genetic data for inferring African population history.
Proceedings of the National Academy of Sciences, USA,
107(Suppl 2), 8931–8938.
Schoenbrun D (2001) Representing the Bantu expansions:
What’s at stake? International Journal of African Histological
Studies, 34, 1–4.
Shi W, Ayub Q, Vermeulen M et al. (2010) A worldwide
survey of human male demographic history based on Y-SNP
and Y-STR data from the HGDP-CEPH populations.
Molecular Biology and Evolution, 27, 385–393.
Sims LM, Garvey D, Ballantyne J (2007) Sub-populations
within the major European and African derived haplogroups
R1b3 and E3a are differentiated by previously
phylogenetically undefined Y-SNPs. Human Mutation, 28, 97.
Slatkin M (1995) A measure of population subdivision based
on microsatellite allele frequencies. Genetics, 139, 457–462.
Sokal RR, Rohlf FJ (1995) Biometry, WH. Freeman and
Company, New York.
Sturrock K, Rocha J (2000) A multidimensional scaling stress
evaluation table. Field Methods, 12, 49–60.
Thomas MG, Parfitt T, Weiss DA et al. (2000) Y chromosomes
traveling south: the cohen modal haplotype and the origins
of the Lemba – the ‘‘Black Jews of Southern Africa’’.
American Journal of Human Genetics, 66, 674–686.
Tishkoff SA, Reed FA, Friedlaender FR et al. (2009) The genetic
structure and history of Africans and African Americans.
Science, 324, 1035–1044.
� 2011 Blackwell Publishing Ltd
Underhill PA, Shen P, Lin AA et al. (2000) Y chromosome
sequence variation and the history of human populations.
Nature Genetics, 26, 358–361.
Underhill PA, Passarino G, Lin AA et al. (2001) The
phylogeography of Y chromosome binary haplotypes and
the origins of modern human populations. Annals of Human
Genetics, 65, 43–62.
Vansina J (1984) Western Bantu expansion. Journal of African
History, 25, 129–145.
Vansina J (1995) New linguistic evidence and ‘The Bantu
Expansion’. Journal of African History, 36, 173–195.
Vansina J (2006) Linguistic evidence for the introduction of
ironworking in Bantu-speaking Africa. History in Africa, 33,
321–361.
Veeramah KR, Connell BA, Pour NA et al. (2010) Little genetic
differentiation as assessed by uniparental markers in the
presence of substantial language variation in peoples of the
Cross River region of Nigeria. BMC Evolutionary Biology, 10, 92.
Wilder JA, Kingan SB, Mobasher Z, Pilkington MM, Hammer
MF (2004a) Global patterns of human mitochondrial DNA
and Y-chromosome structure are not influenced by higher
migration rates of females versus males. Nature Genetics, 36,
1122–1125.
Wilder JA, Mobasher Z, Hammer MF (2004b) Genetic evidence
for unequal effective population sizes of human females and
males. Molecular Biology and Evolution, 21, 2047–2057.
Wilson IJ, Weale ME, Balding DJ (2003) Inferences from DNA
data: population histories, evolutionary processes and
forensic match probabilities. Journal of Royal Statistical Society,
166, 155–201.
Wood ET, Stover DA, Ehret C et al. (2005) Contrasting patterns
of Y chromosome and mtDNA variation in Africa: evidence
for sex-biased demographic processes. European Journal of
Human Genetics, 13, 867–876.
Zhivotovsky LA, Underhill PA, Cinnioglu C et al. (2004) The
effective mutation rate at Y chromosome short tandem
repeats, with application to human population-divergence
time. American Journal of Human Genetics, 74, 50–61.
Zhivotovsky LA, Underhill PA, Feldman MW (2006) Difference
between evolutionarily effective and germ line mutation rate
due to stochastically varying haplogroup size. Molecular
Biology and Evolution, 23, 2268–2270.
V.MO. is mainly interested in the application of multivariate
and bayesian methods to the study of population genetic struc-
ture and demography in human as well as non human popula-
tions. Her current work focuses on the co-evolutionary
processes at the community level from the comparison of inter-
species phylogenies to the interaction of interspecies popula-
tion dynamics. G.F. is interested in forensic genetic, human
population genetic, species identification (botany and animal)
and the study of SNPs related to phenotypic traits. V.MA.
research experience concerns the parallel on language and bio-
logical evolution in human populations and the study of mole-
cular conservation biology of insects. C.B. main interests are
focused on ancient history of human populations through the
study of genetic variation with an effort in integrating human
evolutionary genetics within the broader context of anthropolo-
gical studies. O.A. is an entomologist mainly working on the
evolution of anopheles vectors. G.D.B. research interests are
2708 V. MON TAN O E T A L.
related to the microevolutionary history of populations living
south of the Sahara desert and the effects of socio-cultural fac-
tors on genetic structure in human groups. D.C. research is
focused on the human genome diversity analysis in order to
infer the (genomic and population) processes that have mod-
elled the current human variability and try to establish their
(population and epidemiological) consequences.
Data accessibility
Individual SNP and STR genotypes are available in:
http: ⁄ ⁄ dx.doi.org ⁄ 10.5061 ⁄ dryad.9112.
Supporting information
Additional supporting information may be found in the online
version of this article.
Fig. S1 Network of individuals integrating SNP and STR hap-
lotype information. A) Phylogenetic network with individuals
assigned to haplogroups. B) Phylogenetic network with indi-
viduals assigned to populations.
Fig. S2 Loading scores of variables to: A) first principal compo-
nent and B) second principal component of the PCA shown on
Figure 3.
Fig. S3 Posterior distributions of mutation rate estimated with
Batwing software for 12 tetranucleotide loci (DYS456, DYS389I,
DYS390, DYS458, DYS19, DYS393, DYS391, DYS439, DYS635,
GATA_H4, DYS437, DYS438). Each curve corresponds to the
estimate obtained for the whole dataset (black), group1 (red),
group2 (green). The comprehensive mode value is 0.00066 with
a 0.05 to 0.95 range of 0.00036–0.00122.
Table S1 List of the populations with the principal sampling
location and its geographic coordinates.
Table S2 Matrix of genetic distances.
Please note: Wiley-Blackwell are not responsible for the content
or functionality of any supporting information supplied by the
authors. Any queries (other than missing material) should be
directed to the corresponding author for the article.
� 2011 Blackwell Publishing Ltd