The preceding visual representation of the biotech startups of 2023 are grouped — clustered — according to their focus areas. Each cluster color corresponds to a specific domain:
- Orange for advanced molecular techniques (Cluster 0).
- Blue for cell and gene therapies (Cluster 1).
- Green for AI-driven drug discovery (Cluster 2).
- Red for epigenetics and genomic medicine (Cluster 3).
A note on the numbers in the chart
The numbers you see on x- and y-axis of the scatter plot were derived from a technique called Principal Component Analysis (PCA). These numbers reflect the two most significant directions for data variation for the startups. The coordinates are similar to x- and y- coordinates on a map, but instead of representing locations, they capture the characteristics that distinguish each startup in the biotech landscape. Similar to how two cities can be geographically close, two neighboring startups on this plot share more similarities in their data features.
To arrive at the four clusters, we applied a series of data analytics and machine learning techniques. The first step was identifying promising companies based largely on funding milestones and presence of prominent investors. The companies’ novelty of focus also played a role in the analysis. A technique known as K-means clustering helped break the startups into clusters. The “K” in “K-means” refers to the number of clusters. In our case, we chose four distinct clusters. The “means” in “K-means” refers to the method of determining the center of each cluster. Specifically, the algorithm, which is common in unsupervised machine learning, assigns each data point to the nearest cluster center and then recalculates the center as the mean of all of the data points in a given cluster. The algorithm iterates until the cluster assignments stabilize. After finalizing the clusters, the technique grouped the startups based on similarity based on the selected data features. The result is the breakdown above.
More on vectors
The coordinates displayed when hovering over a company correspond to vector positions obtained from their data features. Vectors, which capture direction and magnitude, are useful in machine learning and data science, where they extract patterns, relationships and semantics from data. For example, Google developed Word2Vec to represent words as vectors. That is, a vector in Word2Vec can illuminate semantic relationships between words. Using Word2Vec embeddings, the operation “France” – “Paris” + “Berlin” might yield “Germany.”
Another example of vectors in neural network architecture is FaceNet, which represents facial images as vectors. The technique works even a face is rotated or exhibits different expressions or angles because the features remain close in the vector space.
You’ll find more on the methodology of the biotech startup feature here.
Filed Under: Biologics, Cell & gene therapy, Data science