The PCA Matrix

The PCA matrix is a <m>n*p</m> matrix, where <m>n</m> is the number of genotypes and <m>p</m> is the number of principal components. This matrix can be used to replace the q-matrix|Q. The P matrix is used to control for population structure among the genotypes and ideally the number of principal components used reflects the number of populations in the sample. In Matapax, the P matrix is calculated using the NIPALS algorithm and uses only the first three principal components.

There are a couple of ways to obtain the number of populations.

  • The software STRUCTURE uses a method based on k-clustering to obtain an equivalent matrix to the P-matrix called the Q-matrix. Although these matrices describe equivalent things, the Q-matrix is much more computationally intensive. The number of populations is determined by optimising the log probabilities produced when running STRUCTURE. The exact procedure is outlined in the documentation for STRUCTURE.
  • The number of populations can also be determined by observing the first two principal components informed by the geographical information of the sample. Although this approach requires a great deal of biological knowledge, it is faster and potentially more accurate than the previous method. This approach is outlined in the supplementary material of Horton et. al. (2012).