It's been a while since the last Clusters Galore analysis, so I've decided to use my recently assembled dataset and run such an analysis over the individuals who belonged to the Six main West Eurasian components.
Hence, at the beginning, I identified 945 individuals in my set who had more than 95% combined admixture proportions in the Six. Subsequently, I ran MDS on this set, keeping 50 dimensions.
One of the open issues in Clusters Galore analysis is how to choose how many MDS dimensions to retain. So far, I've applied a heuristic by choosing the number of MDS dimensions that maximizes the number of inferred clusters by MCLUST. However, when I actually inspect the MDS plots, it often turns out that meaningful information seems present at even higher number of MDS dimensions. As a result, I've decided to pick the number of dimensions in the following manner.
The main idea is that data points in uninformative MDS dimensions will appear as largely Gaussian noise. So, we can use a test of normality (I've chosen the Shapiro-Wilk test) to detect dimensions that appear not to be noise. Below is the p-value of this test for different MDS dimensions:
Up to 22 dimensions, there is a strong non-Gaussian signal (all p-values less than 0.001). Hence, I would use the first 22 dimensions in MCLUST analysis. With these dimensions, the number of inferred clusters was estimated as 35. So, this is something like a 6-fold increase in resolution over the Six components inferred by ADMIXTURE.
The cluster totals for the different populations can be seen in the spreadsheet.
Important Caveat: Some populations (e.g., Finnish_D, or Turkish_D) have a great number of individuals who do not meet the "95% in the Six" inclusion threshold. Hence, results are not representative for them, and simply indicate the cluster assignment of their subsets that do meet the threshold. You can check whether individuals have been removed from the original dataset by comparing sample sizes in the Clusters Galore spreadsheet with the K12a one.
Here are some observations on the 35 cluster. I will mention the modal population (or region) for each one:
- Ashkenazi
- Scandinavian
- French
- British Isles
- Armenian
- S Italian/Sicilian
- Kurd
- Greek
- Cypriot
- Balto-Slavic
- Hungarian
- Balkan
- Sephardic
- Spanish
- Iberian
- North Italian/Tuscan
- Morocco Jews (main)
- Saudis
- Georgian/Abkhazian
- Basque
- Bedouin
- Druze #1
- Druze #2
- Druze (main)
- Mozabite (main)
- Mozabite #1
- Orkney
- Sardinian
- Azerbaijan Jews
- Iran/Iraq Jews
- Lezgins
- Morocco Jews #1
- Samaritan
- Yemen Jews
- Abkhazian