3 minute read

Data science methods — whether referred to as statistics, machine learning, or even artificial intelligence — provide powerful tools for identifying and describing high-level patterns in complex datasets. But how effective are these methods when applied to the cryptocurrency market? Can they be used to review, understand, and potentially predict crypto market trends? In the followings we will apply classic data science techniques to crypto trading data — methods that, while not used very often in the crypto world, have proven highly valuable in other fields.

Cryptocurrencies come in all shapes and sizes: they differ quite a bit in how much they’re traded, how their prices move up and down, their overall market value, how they respond to market shifts and outside events. Each coin has its own personality, so to speak, when it comes to market behavior.

A subset of data science methods, the unsupervised learning techniques, can identify groups and clusters within the data without the need for a predefined target variable. These methods are capable of revealing natural groupings, densifications, providing insights into the underlying structure of the dataset regarding similarities.


We apply Principal Component Analysis (PCA), a powerful dimensionality reduction technique, to cryptocurrency market data to identify groups of similar coins based on their market behavior.


We need reliable crypto data at first: we will use Stereotic as the data source (the data is publicly available on Kaggle, or can be parsed directly from Stereotic). While there are more advanced options available, we’ll stick with the widely-used price candlestick data and trading volumes for the top 100 coins, collected at 60-second intervals since January 3, 2023.


Next, we’ll engineer simple, but powerful features: basic statistics and windowing aggregates. E.g.

  • how many days the token has been traded,
  • daily or hourly volume or open price average, standard deviation,
  • basic statistics of 60-minute rolling windows (average, standard deviation, min, max, fano factor)
  • ratio of statistics of various rolling time windows: the average price of the last hour compared to the previous hour, standard deviation of the price in the day compared to the day before etc.
  • the windowing aggregates then can be further aggregated for time-independent aggregates: e.g. the average of the price ratio (price change) of consecutive hours

These features — using around 100 different ones per coin this time — effectively describe the price and volume movements of the coins. Here’s the point: PCA projects the 100 dimensions to 2 easily-depictable (though otherwise meaningless) dimensions, bringing together similar items and distancing those that differ. With the dots being the tokens, colors and dot sizes representing average trading volumes:

And here we go: we can identify distinct groups and individual tokens that differ from the majority. It’s no surprise that Bitcoin is a massively unique entity, and Ethereum and the stablecoins are also quite distinct instances. But what about NOT, TIA or BONK? I would suggest examining them in detail. Additionally, if we zoom in:

Why are PEPE and BNB so close? Is SOL really unique, like DOGE on the other side?


Checking today’s prices of some of these interesting coins on Stereotic reveals a potential use of their diversity: today, Bitcoin hit the $58K mark, while altcoins are generally bearish too. However, BONK escaped the same downtrend with a much smaller decrease. This might be related to it’s unique features, and can be very useful when actively managing a crypto portfolio.

These PCA visualization groups might already be useful, providing unsupervised insights on what to check. Next time, we will go further and deploy other data science methods to uncover and describe more similarities, as well as to find and evaluate recurring patterns.

Updated: