The Bantu expansion revisited: a new analysis of Y ... · Southern Africa, Bantu languages are...

Molecular Ecology (2011) 20, 2693–2708 doi: 10.1111/j.1365-294X.2011.05130.x

The Bantu expansion revisited: a new analysis ofY chromosome variation in Central Western Africa

VALERIA MONTANO,*† GIANMARCO FERRI ,‡ VERONICA MARCARI , * CHIARA BATINI ,§

OKORIE ANYAELE,– GIOVANNI DESTRO-BISOL** and DAVID COMAS†

*Dipartimento di Biologia Ambientale, Sapienza Universita di Roma, P.le Aldo Moro 5, 00185 Rome, Italy, †Institut de Biologia

Evolutiva (CSIC-UPF), CEXS-UPF-PRBB, Doctor Aiguader 88, 08003 Barcelona 08003, Spain, ‡Department of Diagnostic and

Laboratory Service and Legal Medicine, Section of Legal Medicine, University of Modena and Reggio Emilia, Italy, §Department

of Genetics, University of Leicester, Leicester LEI 7RH, UK, –Department of Zoology, University of Ibadan, Ibadan, Oyo State,

Nigeria, **Istituto Italiano di Antropologia, P.le Aldo Moro 5, 00185 Rome, Italy

Corresponde

E-mail: david

� 2011 Black

Abstract

The current distribution of Bantu languages is commonly considered to be a

consequence of a relatively recent population expansion (3–5 kya) in Central Western

Africa. While there is a substantial consensus regarding the centre of origin of Bantu

languages (the Benue River Valley, between South East Nigeria and Western

Cameroon), the identification of the area from where the population expansion actually

started, the relation between the processes leading to the spread of languages and

peoples and the relevance of local migratory events remain controversial. In order to

shed new light on these aspects, we studied Y chromosome variation in a broad dataset

of populations encompassing Nigeria, Cameroon, Gabon and Congo. Our results

evidence an evolutionary scenario which is more complex than had been previously

thought, pointing to a marked differentiation of Cameroonian populations from the rest

of the dataset. In fact, in contrast with the current view of Bantu speakers as a

homogeneous group of populations, we observed an unexpectedly high level of

interpopulation genetic heterogeneity and highlighted previously undetected diversity

for lineages associated with the diffusion of Bantu languages (E1b1a (M2) sub-

branches). We also detected substantial differences in local demographic histories,

which concord with the hypotheses regarding an early diffusion of Bantu languages

into the forest area and a subsequent demographic expansion and migration towards

eastern and western Africa.

Keywords: Bantu languages, Central Africa, demographic expansion, Y chromosome

Received 23 September 2010; revision revised 30 March 2011; accepted 12 April 2011

Introduction

The term Bantu refers to a family of languages which is

widespread in most of the sub-Saharan continent and is

currently spoken by almost 220 million people (Marten

2006). Despite their adoption by populations which are

settled in a very wide territory encompassing a large

portion of the continent from the equatorial belt to

Southern Africa, Bantu languages are characterized by a

nce: David Comas, Fax: +34 93 3160901;

[email protected]

well Publishing Ltd

high degree of similarity even among the most geo-

graphically distant communities (Greenberg 1955, 1972;

Oliver 1966a). As a result of almost a century of linguis-

tic and archaeological studies, the distribution of Bantu

languages is thought to be the effect of a population

expansion (commonly referred to as the Bantu expan-

sion) which started from the Benue River Valley,

between South East Nigeria and Western Cameroon

(Johnston 1919; Bakel 1981; Vansina 1984, 1995). This is

mainly supported by the fact that Bantoid languages,

regarded as being ancestral to the Bantu ones, are pres-

ently spoken in this area (Greenberg 1949; Guthrie 1962;

2694 V. MON TAN O E T A L.

Oliver 1966a; b; Lwanga-Lunyiigo 1976). A relatively

recent population growth and colonization (�3–5 kya)

of new territories is still accepted today by most schol-

ars as the most reasonable explanation for the geo-

graphical dispersal and relative homogeneity of Bantu

languages (Schoenbrun 2001). It has also been proposed

that the first steps of migration could have followed

two main routes which have been defined as the ‘Wes-

tern’ and ‘Eastern’ streams (Vansina 1984, 1995; Scho-

enbrun 2001). An alternative scenario was proposed by

Guthrie (1962). While agreeing with Greenberg and oth-

ers about the centre of origin of Bantu languages, he

proposed the Katanga region, in the South of the Demo-

cratic Republic of Congo, in the middle of the equato-

rial forest, as the area from where Bantu-speaking

populations spread towards Western and Eastern

Africa. However, some authors have highlighted the

reductionism of these hypotheses based on a single

huge population migration linked to the spread of lan-

guages, and have underlined the relevance of local

migration processes (Lwanga-Lunyiigo 1976; Ehret 2001;

Schoenbrun 2001).

Population genetic studies may clarify the dynamics

underlying the present distribution of Bantu-speaking

populations at both regional and sub-continental levels

(Mitchell 2010; Scheinfeldt et al. 2010). Unilinear trans-

mitted polymorphisms of the Y chromosome are partic-

ularly useful for this purpose, since they may be used

either to draw phylogeographic inferences or to detect

signatures of male driven demographic processes. As

an additional advantage, the widespread practice of pa-

trilocality among Bantu-speaking populations makes

the distribution of paternal lineages less prone to the

confounding effect of recent gene flow than maternal

lineages (Hammer et al. 2001; Destro-Bisol et al. 2004;

Wilder et al. 2004a,b; Berniell-Lee et al. 2009; Coia et al.

2009).

Despite its potential, Y chromosome variation has

been scantily explored in sub-Saharan Africa and has

been studied even less than the other unilinearly trans-

mitted marker, mitochondrial DNA (mtDNA) (Salas

et al. 2002; Pakendorf & Stoneking 2005; Destro-Bisol

et al. 2010). In fact, previous Y-chromosomal studies

have been carried out on a local geographic scale or

have investigated a limited number of geographically

dispersed Bantu-speaking populations (Beleza et al.

2005; Coia et al. 2005; Berniell-Lee et al. 2009), and

even the broadest datasets do not contain certain areas

of primary importance to test the hypotheses concern-

ing the Bantu expansion (Hammer et al. 2001; Under-

hill et al. 2001; Wood et al. 2005; De Filippo et al.

2011). Nonetheless, there is a substantial convergence

concerning the hypothesis that specific paternal lin-

eages, defined using single nucleotide polymorphisms

(SNPs) and short tandem repeats (STRs), could be a

genetic legacy of the Bantu expansion. This fact is sup-

ported by their distribution and prevalence among

Bantu speakers (Thomas et al. 2000; Underhill et al.

2000, 2001; Cruciani et al. 2002; Pereira et al. 2002;

Beleza et al. 2005; Wood et al. 2005; Berniell-Lee et al.

2009) and by estimates of time of expansion, as in the

case of haplogroups E1b1a7 (defined by M191) and

E1b1a (defined by M2), which have been dated back to

between 3.4 and 5.2 kya (Zhivotovsky et al. 2006;

Berniell-Lee et al. 2009). In general, previous studies

regarded the low level of variation occurring at

Y-chromosome, and other genetic systems among

Bantu-speaking populations as a signature of a recent

population expansion (Bandelt et al. 1995; Alves-Silva

et al. 2000; Jobling et al. 2004; Plaza et al. 2004; Berni-

ell-Lee et al. 2009; Tishkoff et al. 2009). On the whole,

previous genetic investigations have highlighted the

agreement between the genetic structure of Bantu-

speaking populations and some generic predictions of

linguistic theories (Jobling et al. 2004; Zhivotovsky

et al. 2004; Berniell-Lee et al. 2009). However, genetic

studies should be more fruitfully considered as an

independent tool to clarify anthropological issues and

explore their complexity, since they may provide infor-

mation that can be compared and, eventually, inte-

grated with data and inferences from other disciplines.

Here, we present a study of Y chromosome variation

in a broad dataset encompassing Nigeria, Cameroon,

Gabon and Congo, focusing on the haplogroup E1b1a

(M2) and its sub-branches, which are the most frequent

lineages in sub-Saharan Africa. Populations sampled

are either native groups settled in the area where Bantu

expansion is thought to have originated or in the

regions located in the putative origin of the Western

stream. The analysis of paternal lineages was based on

recently discovered SNPs (Wilder et al. 2004a; Sims

et al. 2007; Karafet et al. 2008), which make our level of

resolution higher than in previous studies on the

genetic legacy of Bantu expansion (Jobling et al. 2004;

Beleza et al. 2005; Wood et al. 2005; Berniell-Lee et al.

2009).

The availability of a large and tailored population

dataset together with a more-in-depth dissection of

genetic variation made it possible to perform an analy-

sis of the relationships between genetic variation, geo-

graphical, and linguistic factors in the Bantu area.

Furthermore, we used genetic data to draw demo-

graphic inferences on the peopling processes in Central

Western Africa. Both these approaches disclose greater

complexity than highlighted by previous research on

the genetic legacy of the Bantu expansion, showing dif-

ferent genetic patterns among the populations under

study and signatures of ancient demographic events.

� 2011 Blackwell Publishing Ltd

THE GENETIC L EG ACY OF T HE BANTU EXPANSI ON 2695

Materials and methods

Population sampling and Y chromosome genotyping

The dataset consists of a total of 505 unrelated male

individuals from 17 sub-Saharan African populations,

including both unstudied populations from Nigeria and

previously partially investigated groups from Camer-

oon, Gabon and Congo (Berniell-Lee et al. 2009; Coia

et al. 2009; Fig. 1a; Table S1). Pairwise distances

between sample collection sites (measured as air dis-

tances) range from 69.4 km for the nearest villages

(a)

(b)


(Gran Zambe and Kouambo, Cameroon) to 1 320 km

for the furthest villages (Idah in Nigeria and Ollebi in

Congo).

An appropriate consent form was signed by each

DNA donor. DNA extraction from a cheek swab or

from blood was performed with a phenol ⁄ chloroform

standard protocol (Gill et al. 1985) and extraction prod-

ucts were quantified with the Quantifiler� Human

DNA Quantification Kit (Applied Biosystems).

Twenty Y chromosome SNPs were genotyped in a

hierarchical manner with two different methods: a

probe hybridization approach with TaqMan� SNP

Fig. 1 (a) Map of sampling locations

with pie charts of haplogroup frequen-

cies for each population. (b) Maps of

the distributions of the main haplo-

groups of the Y chromosome. Circles

represent the geographic position of the

populations. The intensity of shades is

proportional to the values of interpo-

lated haplogroup frequencies.


Genotyping Assays (Applied Biosystems) and a multi-

ple single-base extension reaction approach with the

SNaPshot� Multiplex Kit (Applied Biosystems). Loci

analysed with TaqMan probes include M96(E),

M2(E1b1a), M191(E1b1a7), M207(R), M17(R1a1), P116

(E1b1a7a3), 50f2P(B2b). The real time PCRs were per-

formed with a 7900HT Fast Real-Time PCR System

(Applied Biosystems) using program default conditions

and adapting the number of cycles to probe perfor-

mances, from a minimum of 40 to a maximum of 60.

The rest of the SNPs were typed using the SNaPshot

technique performing three multiplexes [first: M91(A),

M60(B), M150(B2a); second: M75(E2), P2(E1b1),

M215(E1b1b), M154(E1b1a4); third: U175(E1b1a8), U174

(E1b1a7a), U209 (E1b1a8a), P9.2(E1b1a7a1), P115

(E1b1a7a2), U290(E1b1a8a1)]. The first PCR step was

performed with a QIAGEN Multiplex PCR kit.

Seventeen Y chromosome STRs were typed using the

AmpF‘STR� Yfiler� PCR Amplification Kit (Applied

Biosystems) designed for loci: DYS456, DYS389I,

DYS390, DYS389II, DYS458, DYS19, DYS385 a ⁄ b,

DYS393, DYS391, DYS439, DYS635, DYS392, Y GATA

H4, DYS437, DYS438, DYS448.

SNaPshot and Y-filer products were run in a 3130xl

Genetic Analyser (Applied Biosystems) and analysed

with GeneMapper software (Applied Biosystems) to

assign individual genotypes. Derived alleles for loci

M154, M215, P9.2 and M17 were not observed. Y chro-

mosome haplogroup classification was based on Karafet

et al. (2008).

Statistical analyses

A graphical representation of the geographical distribu-

tion of Y chromosome haplogroup frequencies was

drawn with Surfer 8.0 software (Golden Software Prod-

ucts).

A phylogenetic reconstruction of haplotype relation-

ships was inferred using the reduced median network

algorithm, whose output was used to calculate the med-

ian joining vectors, using the Network Software 4.5 (Ban-

delt et al. 1995, 1999). The network was built

introducing both SNP and STR data for each individual.

A weight of 99 was assigned to all the SNPs, while the

weight of each STR locus (ranging from 5 for DYS635 to

57 for DYS392) was calculated on the basis of its sample

variance according to Meyer et al. (1999). Sixteen indi-

viduals with missing loci were excluded. Loci DYS385

a ⁄ b and DYS389II were excluded due to their phyloge-

netic uncertainty. In fact, loci DYS385 a ⁄ b are amplified

together in the same fragment and cannot be electro-

phoretically separated, which makes a correct alignment

impossible. Similarly, locus DYS389II is co-amplified

with DYS389I and its size could be calculated only indi-

rectly, by subtracting the DYS389I allele (which is also

amplified separately) from the total fragment (see Gus-

mao et al. 2006 for details on forensic applications).

In order to describe intrapopulation diversity, we cal-

culated haplotype and haplogroup frequencies, haplo-

group and haplotype diversity, and mean number of

pairwise differences (MNPD), using Arlequin 3.11 (Ex-

coffier et al. 2005). The weighted intralineage mean

pairwise (WIMP) values were obtained subdividing the

dataset of each region into lineages and calculating the

MNPD within each lineage. In this way, it is possible to

estimate the weighted average of the mean pairwise dif-

ferences among lineages, using the formula of weighted

mean based on variance (Sokal & Rohlf 1995). The

weighted interpopulation mean pairwise (WPMP) is the

weighted average of the mean pairwise differences

among populations of the same region, calculated on

the basis of MNPD variance for each population.

Interpopulation genetic distances were obtained

according to Slatkin (1995) for STR haplotypes (Rst). A

graphical representation of genetic distance matrix was

performed through SPSS 15.0 (SPSS for Windows, Rel.

11.2006. Chicago: SPSS Inc.) with a metric multidimen-

sional scaling method.

In order to carry out a simultaneous exploration of

diversity among populations and the relative weight of

genetic variables, we conducted a principal component

analysis (PCA) for SNP haplogroup frequency data with

the R software package ade4 (Dray & Dufour 2007; R

Development Core Team 2008).

To detect signals of population structure, a hierarchi-

cal analysis of molecular variance (AMOVA) was carried

out grouping the populations according to both geo-

graphical and linguistic criteria, with Arlequin 3.11 soft-

ware (Excoffier et al. 2005). Geographical groups were

defined on the basis of political borders, with the excep-

tion of the only population from Congo which was

included in the Gabonese dataset. Linguistic groups are

based on Ethonologue linguistic classification

(Lewis2009. Ethnologue: SIL International. Online ver-

sion: http: ⁄ ⁄ www.ethnologue.com ⁄ ) and have been

divided into four main categories: (i) Benue-Congo nor

Bantoid neither Bantu (Idoma and Igala); (ii) Benue-

Congo Bantoid (Tiv and Bamileke); (iii) Benue-Congo

Bantoid Bantu family A (Bakaka, Bassa, Ewondo,

Ngoumba, Fang, Makina and Benga); (iv) Benue-Congo

Bantoid Bantu family B (Duma, Kota, Ndumu, Nzebi

and Bateke) (Table S1, Supporting information).

To investigate the potential relationship between geo-

graphic distances and genetic variation, a spatial princi-

pal component analysis (sPCA) was carried out for SNP

haplogroup frequencies using the algorithm imple-

mented in the R software package adegenet (Jombart

2008; Jombart et al. 2008; R Development Core Team



2008). Basically, the method explicitly summarizes the

spatial autocorrelation calculated on a set of allelic fre-

quencies using Moran’s index (Moran’s I), with infor-

mation regarding the genetic variance among entities

(individuals or populations) in order to detect the pres-

ence of spatial patterns. The spatial information used

for the computation of Moran’s I is stored in a symmet-

rical binary matrix where populations or individuals

are assigned as neighbours and non-neighbours, that is

1 or 0, respectively. In our case, the inverse matrix of

pairwise distances was considered, where all popula-

tions are neighbours and the spatial information is con-

verted into a matrix of weights which are proportional

to inverse spatial distances. Differently from the con-

ventional PCA, the independent components found by

sPCA are both positive and negative, since they opti-

mize the product between the genetic variance among

entities and their spatial autocorrelation. The most

informative components are the most positive (associ-

ated to positive spatial autocorrelation) and the most

negative (associated to negative spatial autocorrelation),

which contain the information about the global and the

local structure of the sample, respectively. A global

structure implies that each sampling location is geneti-

cally closer to neighbours than randomly chosen loca-

tions, as it happens with spatial groups, clines or

intermediate states. Conversely, a stronger genetic dif-

ferentiation among neighbours than among random

pairs of entities characterizes the local structure. The

component to take into consideration is the one with

the absolute highest eigenvalue. To evaluate the consis-

tency of the detected geographical structures versus a

random spatial distribution of the observed genetic var-

iance, a Monte-Carlo based test is applied (Jombart

et al. 2008). This test simulates a random distribution of

the genetic variability (H0 or null hypothesis) on the

connection network and calculates a p-value depending

on the dataset. The simulated distribution represents

the correlation of the randomized genetic variables with

the vectors of the Moran’s I predicting for the global or

local structure. If the value associated to the observed

pattern is higher than the p-value, it means that the spa-

tial distribution of the genetic variance is not random

and the null hypothesis can be rejected. We applied the

test with 100 000 iterations.

The BATWING software was used to estimate the fol-

lowing demographic and evolutionary parameters:

ancestral effective population size, time of the begin-

ning of the population demographic expansion, and

time to the most recent common ancestor (Wilson et al.

2003). The software is based on the coalescent theory

and can test three different demographic models with a

Bayesian approach: constant population size, growing

population size, and constant population size followed


by demographic growth. The last one seems to be the

most reasonable for populations that have undergone

an agricultural revolution. Moreover, it has been dem-

onstrated that this model is the most appropriate for

African populations (Laval et al. 2010). Consequently,

the whole data set was tested using the above-men-

tioned model. Since the BATWING coalescent model was

not designed to take gene flow into account (Wilson

et al. 2003), which is likely to be intensive among the

populations analysed (Destro-Bisol et al. 2004), we

decided not to estimate the time of the splitting of the

populations. Prior distributions were established to

cover a range of expectations which is concordant with

human population history (Wilson et al. 2003). For the

effective population size, a lognormal distribution (9, 1)

was used, whereas for the alpha and beta priors,

gamma distributions of (1, 200) and (0.5, 1), respec-

tively, were used. To obtain the most reliable evaluation

of the posterior mutation rate distribution, only 12 te-

tranucleotide loci (DYS456, DYS389I, DYS390, DYS458,

DYS19, DYS393, DYS391, DYS439, DYS635, GATA_H4,

DYS437, DYS438) were used, and a width gamma

mutation prior distribution of (7 7500) was assigned to

all STR loci, with a mean equal to 9.3 · 10-4, covering a

range between 10-3 and 10-4 in accordance with the

expected values of the Y chromosome STR mutation

rate of both observed and effective estimates (see YHR-

D.ORG.3.0 database for a summary of the main publica-

tions about Y chromosome STRs mutation rates;

Zhivotovsky et al. 2004). This prior distribution is wider

than the ones used in previous studies which were

based on meioses, where the variance is very narrow

(Balaresque et al. 2010; Shi et al. 2010). SNP information

was integrated for the phylogenetic reconstruction to

discriminate possible STR haplotype homoplasies, but it

was not considered for posterior estimates. Chain con-

vergence was evaluated with three independent runs

(starting from different seeds) using two different diag-

nostics implemented in the R package coda (Plummer

et al. 2006; R Development Core Team 2009): the Gel-

man diagnostic (Gelman & Rubin 1992) and the Geweke

diagnostic (Geweke 1992). Number of samples was

2 · 106 with treebetN = 10 and Nbetsamp = 20. The

mode values of the posterior distributions were calcu-

lated through an R software package modeest (Poncet

2009; R Development Core Team 2009). To test the pres-

ence of genetic signatures of the ‘Western stream’ of the

Bantu expansion, the demographic parameters were

inferred in the two spatial groups of populations identi-

fied by the sPCA analysis (see Fig. 4c). Given the con-

trast between the sign of sPCA score for the Ndumu

from Gabon (see Fig. 4c), although close to zero, and

that of its neighbours, we repeated the analysis for both

groups with and without this population.


Results

Intrapopulation diversity of paternal lineages inCentral Western Africa

The analysis of 20 Y chromosome biallelic markers has

shown that Central African samples are characterized

by the presence of several sub-branches within haplo-

group E (M96). Haplogroups E1b1a7a* (U174), E1b1a8a

(xE1b1a8a1) (U209), and E1b1a8a1 (U290) are the most

frequent, accounting for 75% of our data set (Table 1

and Fig. 1A). Haplogroups A, B and R are also found

at lower frequencies. As previously reported (Cruciani

et al. 2002; Berniell-Lee et al. 2009), haplogroups B2a

(M150) and R (xR1a) (M207) occur at frequencies from

moderate to high in Central Africa (from 5.4% to 40%).

The geographical haplogroup distribution is shown in

Fig. 1A, and its interpolation in Fig. 1b. Since the sam-

pling does not evenly cover the area, the readers should

be aware that the representation is prone to over-inter-

pretation. Nonetheless, we believe that these maps are a

useful tool to visualize the phylogeographic patterns

inferred from the data which is presently available.

Within the haplogroup E (M96), the green component

corresponding to E1b1a7a* (U174) is prevalent in Nige-

ria and Gabon (v2 = 18.33, 0.05 > P > 0.001), while the

blue component representing E1b1a8a (U209) is signifi-

cantly more frequent in Cameroon (v2 = 32.64,

P < 0.001). It is worth noting that the sub-clade

E1b1a7a3 (P116) was only detected in Gabon and in one

population from Cameroon (Bassa), whereas E1b1a7a2

(P115) was only observed among Fang (both in light

green, see Fig. 1B). A phylogenetic reconstruction for

all the haplogroups is provided in Fig. S1 (Supporting

information).

The intrapopulation haplogroup diversity indices

range from 0.561 to 0.847 (Table 2), attaining values

which are comparable to or slightly higher than those

reported in previous studies (Beleza et al. 2005; Rosa

et al. 2007), as is to be expected given the increased res-

olution of our SNP panel. Nigerian samples exhibit the

lowest values of haplogroup diversity, which gradually

increases in Cameroonian and Gabonese samples. Con-

cerning Y chromosome STR haplotypes, the values of

intra population haplotype diversity are greater than

97% in all populations with the exception of Bakaka

and Ewondo, thus achieving in most cases the power of

discrimination expected for forensic markers. The

WIMP value for Nigeria (2.914) is markedly lower than

the ones obtained for Cameroon and Gabon (4.264 and

4.219, respectively, Table 2). This result is due to the

presence of a single predominant haplogroup (E1b1a7

(U174)) in Nigerians, suggesting a reduced variation in

the ancestral population and limited gene flow from

other regions. However, the distributions of MNPD and

WIMP values obtained for Nigeria are still partially

overlapping. Furthermore, regional WPMP and MNPD

values are similar in Gabon and Nigeria, whereas

WPMP is lower in Cameroon due to the greater hetero-

geneity among populations in terms of haplogroup

composition, although the difference from the MNPD

value is not significant.

Interpopulation diversity of paternal lineages inCentral Western Africa

The Multidimensional scaling (MDS) plot of Rst genetic

distances (Fig. 2) stresses the differentiation and hetero-

geneity of Cameroonian samples compared to the rest

of populations. In contrast, no statistically significant

genetic distance was observed between Nigerian and

Gabonese populations (Table S2, Supporting informa-

tion). This is particularly evident for the Bakaka, Ew-

ondo and Ngoumba who behave as outliers. However,

Nigerians and, to a lesser extent, Gabonese, group

together. As expected on the basis of their common eth-

nic affiliation, no significant differentiation can be

observed between Fang from Cameroon and Gabon

(Table S2). The only Congolese population (Bateke) is

close to groups from Gabon, reflecting their geographi-

cal proximity.

The differentiation among Cameroon populations and

the relative homogeneity of Nigerians and Gabonese is

confirmed by the PC plot (Fig. 3). The Nigerian popula-

tions are grouped together until the fifth component

(data not shown), reflecting their marked similarity in

haplogroup composition. Ngoumba are less distant

from other populations than in the MDS plot, probably

due to the fact that the haplogroup B, which is particu-

larly frequent in this population, does not give a high

contribution to the first two principal components (see

loading scores, Fig. S2, Supporting information).

Accordingly, their diversity from the rest of the dataset

is better highlighted by the third and fourth PCs (data

not shown). Conversely, the outlier position of the Ew-

ondo is further stressed, due to the prevalence of

E1b1a8a1(U290), which is the haplogroup that gives the

highest contribution to the second PC (see loading

scores, Fig. S2).

An AMOVA was performed to detect possible linguistic

and ⁄ or geographical structuring of genetic variation

(Table 3). A significant genetic heterogeneity was found

when all populations were taken as a single group

(8.17% for SNPs and 5.35% for STRs). The results

obtained for each geographical group indicate that

Cameroon is the main contributor to the observed het-

erogeneity, as predicted by PCA and genetic distances.

This is confirmed when using a jacknife procedure, by


Tab

le1

Y-c

hro

mo

som

eh

aplo

gro

up

freq

uen

cies

inth

ese

ven

teen

po

pu

lati

on

san

aly

sed

Nig

eria

Cam

ero

on

Co

ng

oG

abo

n

To

tal

TIV

IDO

IGA

BA

KE

WO

BA

SN

GO

FA

NC

BA

MB

AT

NB

EN

DU

MK

OT

MA

KN

ZE

ND

UF

AN

G

A(M

91)

11

21

16

B(M

60)

11

13

B2a

(M15

0)1

21

16

12

16

21

24

B2b

(50f

2P)

11

E1b

1a(M

2)4

21

21

21

11

15

E1b

1a7(

M19

1)2

11

11

6

E1b

1a7a

(U17

4)*

3426

2515

73

46

136

1311

915

212

420

5

E1b

1a7a

2(P

115)

12

3

E1b

1a7a

3(P

116)

83

32

42

426

E1b

1a8(

U17

5)1

1

E1b

1a8a

(U20

9)4

11

242

233

103

49

53

105

210

9

E1b

1a8a

1(U

290)

26

91

152

31

76

13

15

35

171

E2(

M75

)3

12

21

21

12

217

R(M

207)

22

11

11

46

18

TO

TA

L52

4040

4326

4115

1232

1922

3221

3225

3320

505

Lo

ciM

96an

dP

2ar

eb

asal

no

des

for

Eh

aplo

gro

up

and

are

no

tre

po

rted

.L

ist

of

abb

rev

iati

on

sfo

llo

win

gth

eo

rder

of

the

tab

le:

Tiv

(Tiv

),Id

o(I

do

ma)

,Ig

a(I

gal

a),

Bak

(Bak

aka)

,

Ew

o(E

wo

nd

o),

Bas

(Bas

sa),

Ng

o(N

go

um

ba)

,F

anC

(Fan

gC

amer

oo

n),

Bam

(Bam

ilek

e),

Bat

N(N

oth

Bat

eke)

,B

en(B

eng

a),

Du

m(D

um

a),

Ko

t(K

ota

),M

ak(M

akin

a),

Nze

(Nze

bi)

,

Nd

u(N

du

mu

),F

anG

(Fan

gG

abo

n).



Ta

ble

2In

trap

op

ula

tio

nd

iver

sity

ind

ices

for

Ych

rom

oso

me

dat

a

Hap

log

rou

ps

(SN

Ps)

Hap

loty

pes

(ST

Rs)

NN

hap

log

rou

ps

Hap

log

rou

p

div

ersi

tyN

hap

loty

pes

Hap

loty

pe

div

ersi

tyM

NP

DM

NP

D⁄r

egio

nW

PM

PW

IMP

Nig

eria

TIV

529

0.56

4(±

0.07

9)43

0.99

0(±

0.00

6)6.

41(±

3.08

6)6.

541

(±3.

112)

6.46

3(±

0.55

5)2.

914

(±1.

401)

IDO

MA

408

0.56

1(±

0.08

6)37

0.99

6(±

0.00

6)6.

62(±

3.19

2)

IGA

LA

406

0.56

6(±

0.07

5)37

0.99

6(±

0.00

6)6.

37(±

3.08

4)

Cam

eroo

n

BA

KA

KA

435

0.57

7(±

0.05

1)22

0.95

2(±

0.01

5)4.

38(±

2.20

4)6.

131

(±2.

931)

5.61

6(±

0.86

4)4.

264

(±1.

158)

EW

ON

DO

266

0.64

6(±

0.07

5)19

0.96

0(±

0.02

5)5.

49(±

2.73

1)

BA

SS

A41

70.

651

(±0.

072)

340.

984

(±0.

012)

6.09

(±2.

961)

NG

OU

MB

A15

40.

761

(±0.

066)

140.

990

(±0.

028)

8.04

(±3.

961)

FA

NG

C12

50.

727

(±0.

113)

121.

000

(±0.

034)

5.85

(±3.

007)

BA

MIL

EK

E32

40.

707

(±0.

039)

310.

998

(±0.

009)

5.93

(±2.

908)

Con

go

BA

TE

KE

N19

60.

801

(±0.

055)

170.

988

(±0.

021)

6.46

(±3.

197)

–––

Gab

on

BE

NG

A22

60.

632

(±0.

104)

170.

974

(±0.

022)

5.93

(±2.

944)

6.62

1(±

3.14

2)6.

519

(±0.

810)

4.21

9(±

1.16

4)

DU

MA

329

0.80

2(±

0.04

7)27

0.98

7(±

0.01

2)6.

77(±

3.27

4)

KO

TA

216

0.74

2(±

0.06

8)19

0.99

0(±

0.01

8)6.

39(±

3.15

6)

MA

KIN

A32

60.

729

(±0.

062)

300.

996

(±0.

009)

6.76

(±3.

270)

NZ

EB

I25

80.

810

(±0.

063)

240.

996

(±0.

012)

6.37

(±3.

122)

ND

UM

U33

90.

822

(±0.

047)

320.

998

(±0.

008)

7.13

(±3.

431)

FA

NG

G20

50.

847

(±0.

047)

180.

989

(±0.

019)

6.50

(±3.

893)

Fo

urt

een

Y-c

hro

mo

som

eS

TR

sh

ave

bee

nu

sed

for

the

esti

mat

ion

s.

Lo

ciD

YS

389I

Ian

dD

YS

385

a⁄b

wer

eex

clu

ded

fro

mth

ees

tim

ates

bec

ause

of

thei

rp

hy

log

enet

icu

nce

rtai

nty

,as

reco

mm

end

edb

yG

usm

aoet

al.

(200

6).

MN

PD

,m

ean

nu

mb

ero

f

pai

rwis

ed

iffe

ren

ces;

WP

MP

,w

eig

hte

din

terp

op

ula

tio

nm

ean

pai

rwis

eu

sin

gre

lati

ve

var

ian

ce;

WIM

P,

wei

gh

ted

inte

rlin

eag

em

ean

pai

rwis

eu

sin

gre

lati

ve

var

ian

ce.



Fig. 2 Multidimensional scaling of the genetic distances of the

populations. The stress value (0.203) is acceptable according to

Sturrock & Rocha (2000).

Fig. 3 Principal component analysis based on haplogroup fre-

quencies.


which we observed that the percentage of molecular

variance explained at population level substantially

decreases in Cameroon after excluding the Ngoumba

population (from 10.50% to 4.18% P = 0.027 for SNPs;

from 11.69% to 8.33% P < 0.001 for STRs). The removal

of any other population does not lead to comparable

reductions (data not shown). The AMOVA using the geo-

graphical classification (Nigeria, Cameroon, Gabon and

Congo) shows significant variance among groups

(5.90% for SNPs, 1.97% for STRs). In this latter case

however, the proportion of variation due to differences

among groups is lower than that found among popula-

tions within groups. The significant percentage of the


variance detected among populations within groups

using both SNPs and STRs is due to the presence of

Cameroonian populations, as shown by regional AMOVAs

(Table 3). Conversely, no significant differentiation

among groups of populations was found when popula-

tions were grouped according to their linguistic affilia-

tion, even after removing the group including Idoma

and Igala from the analysis, which is linguistically het-

erogeneous (data not shown). This suggests a lack of

correlation between paternal lineage distribution and

linguistic diversity.

In order to obtain further insights into the geographic

distribution of the genetic diversity, a sPCA was per-

formed using haplogroup frequencies (Jombart et al.

2008). The plots identify two groups of populations

(Fig. 4), Nigeria, Bakaka and Bamileke from Cameroon

on the one hand, and the remaining populations on the

other, with the exception of Ndumu from South Gabon

which shows a positive score (Fig. 4c). The strongest

genetic differentiation is found at the border between

these two geographic areas (as indicated by the increas-

ing density of white lines in Fig. 4b). The highest eigen-

value obtained is the most positive one which is

associated to the global structure. According to the test

of significance, the geographical distribution of the

genetic variability was found to be compatible with a

random global structure, the P-value of the Monte-Carlo

test being 0.156 and the observed value 0.119 (see

Fig. 4d).

To infer demographic parameters, 16 individuals with

missing data were excluded from the dataset (giving a

total of 489 samples), while Bateke and populations

from Gabon were pooled on the basis of their geo-

graphical closeness and lack of statistically significant

genetic diversity. It should be noted that our demo-

graphic estimates are associated with wide and partially

overlapping confidence intervals, a problem often

encountered when applying Bayesian methods. How-

ever, the reliability of our results is supported by the

convergence for the three runs we performed on each

dataset and further strengthened by a previous study

showing that the number of loci we used is sufficient to

achieve correct point estimates, although the variance

associated to the posterior distribution is high (Shi et al.

2010). The posterior mutation rate estimate agrees with

the one reported by Zhivotovsky et al. (2004) (6.97 · 10-4;

Fig. S3, Supporting information). A time since expan-

sion of �8.0 kya for the whole dataset was obtained,

with an initial effective size of �2800 individuals.

Approximate mode, median and mean posterior values

for the main parameters estimated are shown in

Table 4. The same simulation was performed on the

two sPCA groups of populations. Estimates for the spa-

tial group including Tiv, Idoma, Igala, Bakaka and

Simulations of spatial autocorrelation

freq

Monte carlo test

(a) (b)

(c) (d)

Fig. 4 Spatial Principal Component Analysis based on haplogroup frequencies. The represented component is the most positive one,

containing the information regarding the global pattern. (a) Relative geographical positions of populations under study. The reticula-

tion presented was chosen only for graphical reasons. This is because the matrix of distances used in the sPCA analysis would have

connected all possible population pairs, complicating the visualization of the objects within the figure. (b) Graphical interpolation of

population scores. The darkest regions represent positive scores relative to the first component, while the whitest regions represent

negative ones. The proximity of white lines is proportional to the degree of genetic differentiation. (c) Single population scores are

represented with black ⁄ white squares, with the black associated to positive values and white to negative ones. Square size is propor-

tional to the absolute value standing for the degree of differentiation. (d) On the abscissa, values of spatial autocorrelation for ran-

domized allelic frequencies obtained through simulations (100 000 permutations); on the coordinate, frequency of class values.

Table 3 Analyses of the molecular var-

iance (AMOVA)Among

groups

Among

populations Within populations

YSNPs YSTRs YSNPs YSTRs YSNPs YSTRs

All samples 8.17** 5.35** 91.83** 94.65**

Nigeria )0.43† 0.90 100.43** 99.10**

Cameroon 10.50** 11.69** 89.50** 88.31**

Gabon 0.97 1.57 99.03** 98.43**

Linguistic groups 0.17 0.26 8.04** 5.05** 91.79** 94.70**

Geographical groups 5.90* 1.97* 3.88** 3.82** 90.22** 94.21**

Values are in percentage. All analyses have been performed using either haplogroup

(SNPs) or haplotype (STRs) information (see Materials and Methods for further details on

linguistic group assignation).

*P < 0.05; **P < 0.001.†When haplotypes randomly drawn from different populations have a higher probability

of being identical compared to haplotypes taken from the same population, the AMOVA

algorithm may produce small negative values (Excoffier et al. 1992).



Table 4 Posterior estimations of demographic parameter values obtained using BATWING

NA NA (95% CI) t0 t0 (95% CI) r r (95% CI) T T (95% CI)

All populations

Mode 2 800 1 500–6 700 7 970 2 400–48 000 0.0065 0.0025–0.0124 50 000 16 000–233 000

Median 3 140 10 180 0.0068 59 000

Mean 3 500 12 898 0.0072 71 600

Group 1

Mode 1 804 905–4 600 10 550 2 400–83 400 0.0046 0.0024–0.0092 45 000 12 800–254 600

Median 1 991 13 600 0.0049 54 500

Mean 2 226 19 300 0.0052 67 600

Group 2

Mode 3 360 2 100–8 300 6 100 2 024–25 300 0.0091 0.0047–0.0179 61 200 23 000–256 600

Median 3 644 6 610 0.0096 70 700

Mean 4 023 7 990 0.0010 84 200

NA, effective ancestral population size; t0, time to start of population growth; r, population growth rate; T, time to the most recent

common ancestor. Time is given in years. Group 1 corresponds to populations with a positive score in sPCA analysis, with the

exception of Ndumu population from Gabon (see Methods and Results for further details). Group 2 includes populations presenting

a negative score in sPCA analysis.


Bamileke point to a time since expansion of 10.55 kya,

while the most likely effective population size was

around 1800 individuals. A more recent time since

expansion (6.10 kya) and an almost double effective

population size (�3 800) were obtained for the other

group composed by all Gabonese and some Cameroon

populations (Table 4, Fig. 5).

As a methodological choice, the Ndumu population

was excluded from the analysis, due to the contrast

between the sign of their sPCA score, although close to

zero, and that of its neighbours. In fact, it seemed unli-

kely that a demographic expansion which occurred in

the Bantoid region could have also involved this distant

population. However, grouping them with the popula-

tions settled in the forest would have been in contrast

with the use of sPCA as a method to define groups on

which perform demographic inferences. It is anyway

reassuring that even including Ndumu in the black

(10.55 vs. 10.79) or in the white squared group (6.10 vs.

5.92 kya), the estimates of the time since expansion chan-

ged only slightly. Finally, to understand the demographic

history of this population, we analysed it separately,

obtaining a time since the expansion of �4.8 kya, in

agreement with the trend shown by the forest region.

Fig. 5 A physical map of the region under study with sPCA

score for each population. In red, the Benue River Valley

region. In purple, the area of distribution of Bantoid languages.

In green, the upper bound of the Equatorial rainforest (from

Bartholome et al. 2002).

Discussion

A male perspective on the genetic structure ofBantu-speaking populations

As a contribution to the knowledge of the human pre-

history of the African continent south of the Sahara des-

ert, we surveyed a number of populations settled in a

broad transect encompassing the area where the Bantu


expansion is supposed to have originated (Benue River

Valley) and part of the western stream (Cameroon,

Congo and Gabon). In order to better exploit the poten-

tial usefulness of Y-chromosomal polymorphisms for


the analysis of the evolutionary history of Bantu-speak-

ing populations, we analysed both SNP and STR poly-

morphisms. The substantial agreement among the MDS

using genetic distances, PCA based on haplogroup fre-

quencies, and AMOVA carried out using the two types of

polymorphisms indicates that our results provide an

adequate and robust picture of Y-chromosomal diver-

sity and there is no substantial ascertainment bias asso-

ciated with the use of SNPs alone (Wilder et al.

2004a,b).

Our results do not show a clear relationship between

genetic variation and linguistic diversity. This is well

exemplified by the Nigerian populations, where the low

heterogeneity among the three populations surveyed

(as coherently shown by MDS, PC, AMOVA and

WIMP ⁄ MNPD ratio) contrasts with their different lan-

guages, i.e. Bantoid, Yoruboid and Idomoid (see

Table S1). At the same time, we observed a high level

of genetic diversity among Cameroonian populations

despite the fact they have a common linguistic affilia-

tion (Bantu), the only exception being the Bamileke

who speak a Bantoid language. This is consistent with

previous regional studies on Y chromosome diversity

carried out in sub-Saharan Africa or in other continents

which failed to detect a robust correlation between

genetic and linguistic distances (Lane et al. 2002; Coia

et al. 2009; Mona et al. 2009; Veeramah et al. 2010).

However, cases have been shown where linguistic affili-

ation proved to be a good predictor of genetic diversity

both in Africa and elsewhere (Poloni et al. 1997; Hassan

et al. 2008; Mirabal et al. 2009; Cruciani et al. 2010).

This study adds new information to the current

knowledge of co-evolution between genetic and cultural

traits in sub-Saharan populations. In fact, the increased

level of resolution of the SNP panel used in this study

highlights previously undetected variation within E1b1a

(M2), the diagnostic haplogroup of Bantu-speaking pop-

ulations (Jobling et al. 2004; Beleza et al. 2005; Wood

et al. 2005; Berniell-Lee et al. 2009). In this way, we

were able to detect some noteworthy differences within

and among Bantu-speaking populations, mostly due to

haplogroups E1b1a7a (U174), E1b1a8a (U209) and

E1b1a8a1 (U290), which contribute to their high level of

interpopulation differentiation and to the presence of

distinct regional patterns of genetic variation. All these

findings contradict the current view of Bantu speakers

as a homogeneous group of populations whose gene

pools are mostly if not exclusively the result of a rela-

tively recent population expansion (Cavalli-Sforza et al.

1994; Berniell-Lee et al. 2009). In fact, the strongest sig-

nal of diversity is given by Cameroonian populations.

The presence of non-Bantu ethnic groups in this coun-

try raises the possibility that the diversity of Cameroo-

nian populations from other Bantus could be the result

of differential admixture. However, such a scenario is

in contrast with previous studies on Y-chromosome and

nuclear loci which do not support occurrence of gene

flow between the Bantu speakers of South Cameroon

and the Afro-Asiatic and Adamawa populations from

the northern part of the country (Coia et al. 2009; Tishk-

off et al. 2009).

The lack of statistical support for the global structure

observed in the sPCA indicates that genetic affinity is

not consistently greater between neighbouring than dis-

tant populations. This is particularly evident for the

populations settled to the South of the Cameroonian

mountain range (Fig. 5), and could be the consequence

of the low male mobility due to the patrilocal tradition.

However, focusing on a narrower area, the same analy-

sis suggests a genetic change in Central Cameroon,

which approximately coincides with and could be

related to the presence of high mountain ranges (Ba-

menda, Bamileke, and Mambilla highlands, or western

highlands with a mean height of 2000 m; Fishpool &

Evans 2001). Further population sampling and addi-

tional genetic information are needed to confirm this

local pattern.

Demographic dynamics along the western stream of theBantu expansion

In order to gain insights into the past demographic

dynamics of the western stream of the Bantu expansion,

we used a Bayesian coalescent approach. Our analysis

differs from previous studies on Bantu-speaking popu-

lations in that we performed demographic inferences

based on population data instead of single lineages

(Zhivotovsky et al. 2004; Berniell-Lee et al. 2009). This

choice was based on previous observations suggesting

that the frequency of Y-haplogroups might vary sub-

stantially across generations due to fluctuations in the

effective population size among lineages (Zhivotovsky

et al. 2006). Such perturbations, which could be due to

both stochastic and selective processes (Pritchard et al.

1999), could act as confounding factors for evolutionary

inferences based on single lineages.

Our results point to a general pre-agricultural expan-

sion time of �8.0 kya in Central ⁄ Western Africa. This is

in accordance with previous studies on Y chromosome

variation in sub-Saharan Africa which have detected

signatures of pre-Neolithic expansions (Pritchard et al.

1999; Shi et al. 2010). However, some differences in time

estimates can be found across datasets. Our data points

to a more recent time frame (�8.0 kya) compared to

previous results obtained at a continental level

(�15.0 kya, Pritchard et al. 1999). This discrepancy could

be explained by the presence in their dataset of popula-

tions such as Bantu farmers and hunter-gatherers, which



are known to have undergone an ancient separation and

experienced different demographic histories (Excoffier &

Schneider 1999; Patin et al. 2009; Batini et al. 2011). In

this regard, it is worth underlining that we obtained con-

siderable differences even in the local demographic his-

tories of populations which are more closely related than

those studied by Pritchard et al. (1999).

Concerning the hypotheses of the expansion of Bantu

languages, linguistic and archaeological knowledge sug-

gest the area between South East Nigeria and West

Cameroon as the origin of the Bantu expansion with a

time frame of 3–5 kya (Greenberg 1955, 1972; Oliver

1966a; Vansina 1984, 1995, 2006). However, our paternal

lineage estimates show older signatures of demographic

expansion. The results for populations from Nigeria

and part of those from Cameroon, indeed, suggest that

a population expansion occurred in the Bantoid area

before the diffusion of Bantu languages (Fig. 5). None-

theless, signatures of a more recent demographic expan-

sion that could be related to the spread of Bantu

languages were detected in the forest area. These results

seem to provide support to the hypothesis of Guthrie

(1962) and Oliver (1966b) who postulated an early diffu-

sion of Bantu languages into the forest. According to

these authors, such an event may have been followed

by a demographic expansion and migration towards

eastern and western directions. In any case, the Bantu

language spread might not have been a direct conse-

quence of a single huge population migration (Lwanga-

Lunyiigo 1976; Ehret 2001; Schoenbrun 2001), since

population movements within sub-Saharan Africa were

probably much more complex and stepwise during the

last millennia.

In conclusion, the signatures we detected in the male

gene pool of the populations of Western Central Africa

depict an evolutionary scenario which is more complex

than suggested or implied by previous research. Our

study reveals so far undetected diversity for lineages

associated to the Bantu expansion, while pointing to a

high level of interpopulation genetic heterogeneity and

highlighting substantial differences in demographic his-

tory from one region to another. Undoubtedly, most of

the points discussed here require further investigations

based on increased samplings and using additional

genetic markers. Nonetheless, we hope that our study

may represent a first step towards a better understand-

ing of the complex genetic and demographic back-

ground behind the spread of the Bantu languages.

Author contributions

Study conception: V.MO., G.D.B., D.C. Field work: V.MO.,

V.MA, O.A. Molecular analysis: V.MO., G.F., C.B. Statisti-

cal analysis: V.MO. Manuscript preparation: V.MO.,


G.D.B., D.C. All co-authors have reviewed the manuscript

prior to submission.

Acknowledgements

This study was made possible thanks to the contribution of all

the DNA donors from sub-Saharan Africa. The laboratory of

Molecular Anthropology of Rome and the University of Ibadan

(Nigeria) collaborated for the sampling in the Benue River Val-

ley. This study was supported by Spanish Ministry grant

CGL2007-61016 ⁄ BOS and Generalitat de Catalunya grant

2009SGR1101. A special thank you must go to T. Jombart and

I. Wilson for their precious feedback concerning the methods

they developed. We are grateful to the anonymous reviewers

for their patience and suggestions which helped us improve

this work.

References

Alves-Silva J, da Silva Santos M, Guimaraes PE et al. (2000)

The ancestry of Brazilian mtDNA lineages. American Journal

of Human Genetics, 67, 444–461.

Bakel M (1981) The ‘‘Bantu’’ expansion: demographic models.

Current Anthropology, 22, 688–691.

Balaresque P, Bowden GR, Adams SM et al. (2010) A

predominantly neolithic origin for European paternal

lineages. PLoS Biology, 19, e1000285.

Bandelt HJ, Forster P, Sykes BC, Richards MB (1995)

Mitochondrial portraits of human populations using median

networks. Genetics, 141, 743–753.

Bandelt HJ, Forster P, Rohl A (1999) Median-joining networks

for inferring intraspecific phylogenies. Molecular Biology and

Evolution, 16, 37–48.

Bartholome E, Belward AS, Achard F et al. (2002) GLC 2000:

Global Land Cover Mapping for the Year 2000. EUR 20524 EN.

European Commission, Luxembourg.

Batini C, Lopes J, Behar DM et al. (2011) Insights into the

demographic history of African Pygmies from complete

mitochondrial genomes. Molecular Biology and Evolution, 28,

1099–1110.

Beleza S, Gusmao L, Amorim A, Carracedo A, Salas A (2005)

The genetic legacy of western Bantu migrations. Human

Genetics, 117, 366–375.

Berniell-Lee G, Calafell F, Bosch E et al. (2009) Genetic and

demographic implications of the Bantu expansion: insights

from human paternal lineages. Molecular Biology and

Evolution, 26, 1581–1589.

Cavalli-Sforza LL, Menozzi P, Piazza A (1994) The History and

Geography of Human Genes. Princeton University Press,

Princeton.

Coia V, Destro-Bisol G, Verginelli F et al. (2005) Brief

communication: mtDNA variation in North Cameroon: lack

of Asian lineages and implications for back migration from

Asia to sub-Saharan Africa. American Journal of Physical

Anthropology, 128, 678–681.

Coia V, Brisighelli F, Donati F et al. (2009) A multi-perspective

view of genetic variation in Cameroon. American Journal of

Physical Anthropology, 140, 454–464.

Cruciani F, Santolamazza P, Shen P et al. (2002) A back

migration from Asia to sub-Saharan Africa is supported by


high-resolution analysis of human Y-chromosome

haplotypes. American Journal of Human Genetics, 70, 1197–

1214.

Cruciani F, Trombetta B, Sellitto D et al. (2010) Human Y

chromosome haplogroup R-V88: a paternal genetic record of

early mid Holocene trans-Saharan connections and the

spread of Chadic languages. European Journal of Human

Genetics, 18, 800–807.

De Filippo C, Barbieri C, Whitten M et al. (2011) Y-

chromosomal variation in Sub-Saharan Africa: insights into

the history of Niger-Congo groups. Molecular Biology and

Evolution, 28, 1255–1269.

Destro-Bisol G, Donati F, Coia V et al. (2004) Variation of

female and male lineages in sub-Saharan populations: the

importance of sociocultural factors. Molecular Biology and

Evolution, 21, 1673–1682.

Destro-Bisol G, Jobling MA, Rocha J et al. (2010) Molecular

anthropology in the genomic era. Journal of Anthropological

Sciences, 88, 93–112.

Dray S, Dufour AB (2007) The ade4 package: implementing the

duality diagram for ecologists. Journal of Statistical Software,

22, 1–20.

Ehret C (2001) Bantu expansions: re-envisioning a central

problem of early African history. International Journal of

African Historical Studies, 34, 5–40.

Excoffier L, Schneider S (1999) Why hunter-gatherer

populations do not show signs of pleistocene demographic

expansions. Proceedings of the National Academy of Sciences,

USA, 96, 10597–10602.

Excoffier L, Smouse PE, Quattro JM (1992) Analysis of

molecular variance inferred from metric distances among

DNA haplotypes: application to human mitochondrial DNA

restriction data. Genetics, 131, 479–491.

Excoffier L, Laval G, Schneider S (2005) Arlequin (version 3.0):

an integrated software package for population genetics data

analysis. Evolutionary Bioinformatics Online, 1, 47–50.

Fishpool LDC, Evans MI. (2001). Important bird areas in Africa

and associated islands: priority sites for conservation. In:

Birdlife Conservation Series No. 11 (eds Fishpool LDC and

Evans MI), pp. 133–159. Pisces Publications and BirdLife

International, Newbury and Cambridge.

Gelman A, Rubin DB (1992) Inference from iterative simulation

using multiple sequences. Statistical Science, 7, 457–472.

Geweke J (1992) Evaluating the accuracy of sampling-based

approaches to calculating posterior moments. In: Bayesian

Statistics 4 (eds Bernado JM,Berger JO, Dawid AP and Smith

AFM), pp. 169–193. Clarendon Press, Oxford.

Gill P, Jeffreys AJ, Werrett DJ (1985) Forensic application of

DNA ‘fingerprints’. Nature, 318, 577–579.

Greenberg JH (1949) Studies in African linguistic classification:

I. The Niger-Congo Family. Southwestern Journal of

Anthropology, 5, 79–100.

Greenberg JH (1955) Studies in African Linguistic Classification.

Compass Press, New Haven, Connecticut.

Greenberg JH (1972) Linguistic evidence regarding Bantu

Origins. Journal of African History, 13, 189–216.

Gusmao L, Butler JM, Carracedo A et al. (2006) DNA

Commission of the International Society of Forensic Genetics

(ISFG): an update of the recommendations on the use of Y-

STRs in forensic analysis. Forensic Science International, 157,

187–197.

Guthrie M (1962) Some developments in the prehistory of the

Bantu languages. Journal of African History, 3, 273–282.

Hammer MF, Karafet TM, Redd AJ et al. (2001) Hierarchical

patterns of global human Y-chromosome diversity. Molecular

Biology and Evolution, 18, 1189–1203.

Hassan HY, Underhill PA, Cavalli-Sforza LL et al. (2008) Y-

chromosome variation among Sudanese: restricted gene

flow, concordance with language, geography, and history.

American Journal of Physical Anthropology, 137, 316–323.

Jobling MA, Hurles ME, Tyler-Smith C (2004) Human

Evolutionary Genetics, Garland Science, New York and

Abingdon.

Johnston HH (1919) A Comparative Study of the Bantu and Semi-

Bantu Languages, vol. 2. Clarendon Press, Oxford.

Jombart T (2008) Adegenet: a R package for the multivariate

analysis of genetic markers. Bioinformatics, 24, 1403–1405.

Jombart T, Devillard S, Dufour AB, Pontier D (2008) Revealing

cryptic spatial patterns in genetic variability by a new

multivariate method. Heredity, 101, 92–103.

Karafet TM, Mendez FL, Meilerman MB et al. (2008) New

binary polymorphisms reshape and increase resolution of

the human Y chromosomal haplogroup tree. Genome

Research, 18, 830–838.

Lane AB, Soodyall H, Arndt S et al. (2002) Genetic

substructure in South African Bantu-speakers: evidence from

autosomal DNA and Y-chromosome studies. American

Journal of Physical Anthropology, 119, 175–185.

Laval G, Patin E, Barreiro LB, Quintana-Murci L (2010)

Formulating a historical and demographic model of recent

human evolution based on resequencing data from

noncoding regions. PLoS ONE, 5, e10284.

Lewis M, Paul (ed.), (2009) Ethnologue: Languages of the World,

Sixteenth edition, Dallas, Tex, SIL International. Online

version: http://www.ethnologue.com/

Lwanga-Lunyiigo S (1976) The Bantu problem reconsidered.

Current Anthropology, 17, 282–286.

Marten L (2006) Bantu classification, Bantu Trees and

phylogenetic methods. In: Phylogenetic Methods and the

Prehistory of Languages (eds Peter F, Colin R), pp. 43–55.

McDonald Institute for Archaeological Research, Cambridge.

Meyer S, Weiss G, von Haeseler A (1999) Pattern of nucleotide

substitution and rate heterogeneity in the hypervariable

regions I and II of human mtDNA. Genetics, 152, 1103–1110.

Mirabal S, Regueiro M, Cadenas AM et al. (2009) Y-

chromosome distribution within the geo-linguistic landscape

of northwestern Russia. European Journal of Human Genetics,

17, 1260–1273.

Mitchell P (2010) Genetics and southern African prehistory: an

archaeological view. Journal of Anthropological Sciences, 88,

73–92.

Mona S, Grunz KE, Brauer S et al. (2009) Genetic admixture

history of Eastern Indonesia as revealed by Y-chromosome

and mitochondrial DNA analysis. Molecular Biology and

Evolution, 26, 1865–1877.

Oliver R (1966a) An inquiry into some problems of early Bantu

history. African Affairs, 65, 245–258.

Oliver R (1966b) The problem of the Bantu expansion. Journal

of African History, 7, 361–376.

Pakendorf B, Stoneking M (2005) Mitochondrial DNA and

human evolution. Annual Review of Genomics Human Genetics,

6, 165–183.



Patin E, Laval G, Barreiro LB et al. (2009) Inferring the

demographic history of African farmers and pygmy hunter-

gatherers using a multilocus resequencing data set. PLoS

Genetics, 5, e1000448.

Pereira L, Gusmao L, Alves C et al. (2002) Bantu and European

Y-lineages in Sub-Saharan Africa. Annals of Human Genetics,

66, 369–378.

Plaza S, Salas A, Calafell F et al. (2004) Insights into the

western Bantu dispersal: mtDNA lineage analysis in Angola.

Human Genetics, 115, 439–447.

Plummer M, Best N, Cowles K, Vines K (2006) CODA:

convergence diagnosis and output analysis for MCMC. R

News, 6, 7–11.

Poloni ES, Semino O, Passarino G et al. (1997) Human genetic

affinities for Y-chromosome P49a,f ⁄ TaqI haplotypes show

strong correspondence with linguistics. American Journal of


Poncet P (2009) modeest: Mode Estimation. R package version 1.09.

Pritchard JK, Seielstad MT, Perez-Lezaun A, Feldman MW

(1999) Population growth of human Y chromosomes: a study

of Y chromosome microsatellites. Molecular Biology and

Evolution, 16, 1791–1798.

R Development Core Team (2008) R: A language and

environment for statistical computing, R Foundation for

Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0,

URL http://www.R-project.org.

Rosa A, Ornelas C, Jobling MA et al. (2007) Y-chromosomal

diversity in the population of Guinea-Bissau: a multiethnic

perspective. BMC Evolutionary Biology, 27, 7–124.

Salas A, Richards M, De la Fe T et al. (2002) The making of the

African mtDNA landscape. American Journal of Human

Genetics, 71, 1082–1111.

Scheinfeldt LB, Soi S, Tishkoff SA (2010) Colloquium paper:

working toward a synthesis of archaeological, linguistic, and

genetic data for inferring African population history.

Proceedings of the National Academy of Sciences, USA,

107(Suppl 2), 8931–8938.

Schoenbrun D (2001) Representing the Bantu expansions:

What’s at stake? International Journal of African Histological

Studies, 34, 1–4.

Shi W, Ayub Q, Vermeulen M et al. (2010) A worldwide

survey of human male demographic history based on Y-SNP

and Y-STR data from the HGDP-CEPH populations.

Molecular Biology and Evolution, 27, 385–393.

Sims LM, Garvey D, Ballantyne J (2007) Sub-populations

within the major European and African derived haplogroups

R1b3 and E3a are differentiated by previously

phylogenetically undefined Y-SNPs. Human Mutation, 28, 97.

Slatkin M (1995) A measure of population subdivision based

on microsatellite allele frequencies. Genetics, 139, 457–462.

Sokal RR, Rohlf FJ (1995) Biometry, WH. Freeman and

Company, New York.

Sturrock K, Rocha J (2000) A multidimensional scaling stress

evaluation table. Field Methods, 12, 49–60.

Thomas MG, Parfitt T, Weiss DA et al. (2000) Y chromosomes

traveling south: the cohen modal haplotype and the origins

of the Lemba – the ‘‘Black Jews of Southern Africa’’.

American Journal of Human Genetics, 66, 674–686.

Tishkoff SA, Reed FA, Friedlaender FR et al. (2009) The genetic

structure and history of Africans and African Americans.

Science, 324, 1035–1044.


Underhill PA, Shen P, Lin AA et al. (2000) Y chromosome

sequence variation and the history of human populations.

Nature Genetics, 26, 358–361.

Underhill PA, Passarino G, Lin AA et al. (2001) The

phylogeography of Y chromosome binary haplotypes and

the origins of modern human populations. Annals of Human

Genetics, 65, 43–62.

Vansina J (1984) Western Bantu expansion. Journal of African

History, 25, 129–145.

Vansina J (1995) New linguistic evidence and ‘The Bantu

Expansion’. Journal of African History, 36, 173–195.

Vansina J (2006) Linguistic evidence for the introduction of

ironworking in Bantu-speaking Africa. History in Africa, 33,

321–361.

Veeramah KR, Connell BA, Pour NA et al. (2010) Little genetic

differentiation as assessed by uniparental markers in the

presence of substantial language variation in peoples of the

Cross River region of Nigeria. BMC Evolutionary Biology, 10, 92.

Wilder JA, Kingan SB, Mobasher Z, Pilkington MM, Hammer

MF (2004a) Global patterns of human mitochondrial DNA

and Y-chromosome structure are not influenced by higher

migration rates of females versus males. Nature Genetics, 36,

1122–1125.

Wilder JA, Mobasher Z, Hammer MF (2004b) Genetic evidence

for unequal effective population sizes of human females and

males. Molecular Biology and Evolution, 21, 2047–2057.

Wilson IJ, Weale ME, Balding DJ (2003) Inferences from DNA

data: population histories, evolutionary processes and

forensic match probabilities. Journal of Royal Statistical Society,

166, 155–201.

Wood ET, Stover DA, Ehret C et al. (2005) Contrasting patterns

of Y chromosome and mtDNA variation in Africa: evidence

for sex-biased demographic processes. European Journal of


Zhivotovsky LA, Underhill PA, Cinnioglu C et al. (2004) The

effective mutation rate at Y chromosome short tandem

repeats, with application to human population-divergence

time. American Journal of Human Genetics, 74, 50–61.

Zhivotovsky LA, Underhill PA, Feldman MW (2006) Difference

between evolutionarily effective and germ line mutation rate

due to stochastically varying haplogroup size. Molecular

Biology and Evolution, 23, 2268–2270.

V.MO. is mainly interested in the application of multivariate

and bayesian methods to the study of population genetic struc-

ture and demography in human as well as non human popula-

tions. Her current work focuses on the co-evolutionary

processes at the community level from the comparison of inter-

species phylogenies to the interaction of interspecies popula-

tion dynamics. G.F. is interested in forensic genetic, human

population genetic, species identification (botany and animal)

and the study of SNPs related to phenotypic traits. V.MA.

research experience concerns the parallel on language and bio-

logical evolution in human populations and the study of mole-

cular conservation biology of insects. C.B. main interests are

focused on ancient history of human populations through the

study of genetic variation with an effort in integrating human

evolutionary genetics within the broader context of anthropolo-

gical studies. O.A. is an entomologist mainly working on the

evolution of anopheles vectors. G.D.B. research interests are


related to the microevolutionary history of populations living

south of the Sahara desert and the effects of socio-cultural fac-

tors on genetic structure in human groups. D.C. research is

focused on the human genome diversity analysis in order to

infer the (genomic and population) processes that have mod-

elled the current human variability and try to establish their

(population and epidemiological) consequences.

Data accessibility

Individual SNP and STR genotypes are available in:

http: ⁄ ⁄ dx.doi.org ⁄ 10.5061 ⁄ dryad.9112.

Supporting information

Additional supporting information may be found in the online

version of this article.

Fig. S1 Network of individuals integrating SNP and STR hap-

lotype information. A) Phylogenetic network with individuals

assigned to haplogroups. B) Phylogenetic network with indi-

viduals assigned to populations.

Fig. S2 Loading scores of variables to: A) first principal compo-

nent and B) second principal component of the PCA shown on

Figure 3.

Fig. S3 Posterior distributions of mutation rate estimated with

Batwing software for 12 tetranucleotide loci (DYS456, DYS389I,

DYS390, DYS458, DYS19, DYS393, DYS391, DYS439, DYS635,

GATA_H4, DYS437, DYS438). Each curve corresponds to the

estimate obtained for the whole dataset (black), group1 (red),

group2 (green). The comprehensive mode value is 0.00066 with

a 0.05 to 0.95 range of 0.00036–0.00122.

Table S1 List of the populations with the principal sampling

location and its geographic coordinates.

Table S2 Matrix of genetic distances.

Please note: Wiley-Blackwell are not responsible for the content

or functionality of any supporting information supplied by the

authors. Any queries (other than missing material) should be

directed to the corresponding author for the article.


Date post:	18-Jan-2021
Category:	Documents
Upload:	others
View:	0 times
Download:	0 times

The Bantu expansion revisited: a new analysis of Y ... · Southern Africa, Bantu languages are...

Documents