Art Blocks sale and price prediction

9 min readSep 19, 2021

Given the recent surge in NFT sales and prices, especially in generative NFTs such as those from Art Blocks, it would be interesting to see if the number of sales and the price of a collection can be predicted. This article makes use of some simple statistical & machine learning models to predict whether a collection from Art Blocks minted before August 1st 2021 will have a resale in the secondary market in August; and how much the collection will be sold for in August. Credits to Flipside Crypto for providing the data needed to build these prediction models.

In general, there are two levels of predictions you can do:

Collection level prediction: To predict which collection would be sold in August and how much it would be sold for. i.e. among all the collections / projects in Art Blocks, based on the different features of each collection such as duration of the mint, number of tokens minted or average sale price, which collection will have sales in August and for how much.
Token level prediction: To predict which specific token within the same collection will have sales in August and how much would it be sold for. i.e. within the Chromie Squiggle collection, based on the different traits and features, which one specific token will have sales in August and for how much.

Since each collection is different and will have different token level features and traits, the token sale or price prediction needs to be built for each individual collection. This article will only focus on the collection level prediction for all Art Blocks projects.

Data

Art Blocks collection data consolidated by Flipside Crypto is transformed into two structures:

Time-series data: contains time-dependent information of the collection i.e. sales volume and price. This data structure is used for regression models.

Non-time series data: contains static non-time dependent information of the collection i.e. artist name, aspect ratio, curation status etc. This data structure is used for machine learning models i.e. Decision Tree, Random Forest, Gradient Boosting.

Example of the non-time series collection data

Data fields directly from the Flipside Crypto database i.e. artist, aspect_ratio, curation_status etc. along with the fields created by myself as shown below are used to train the predictive models.

COUNT_TOKEN: total number of tokens of the collection

DAYS_SINCE_MINT: number of days between mint date (created_at_timestamp) and September 1st 2021.

FEATURE_NUMBER: number of features of the collection

TRAITS_NUMBER: number of traits of the collection

MINT_CURRENCY: same as tx_currency

MINT_DURATION: number of minting days

AUGUST_SALE_COUNT: number of sales in August 2021 from the collection

AUGUST_SALE_PRICE: average sale price in August 2021 from the collection

YEAR_MONTH: the year and month of block_timestamp

SALE_COUNT: number of sales of the collection in the particular month

PRICE_USD: same as price_usd in nft_events table

PRICE_RANGE: difference between the minimum and maximum price_usd of the collection in the particular month

Which collection will have a resale next month?

The following models will make use of the historic data from November 2020 until end of July 2021, and try to predict which existing collection will have a resale in August 2021. The result of the prediction is a binary outcome (0 for no resale, 1 for resale). Three models are used here — Logistic regression, Decision Tree and Random Forest.

Logistic Regression

The logistic regression correctly predicts 99% of the resales, 83% of the no resales and mis-predict 2 resales into non-resales, 1 non-resale into resale. The predictions are pretty good! The three mis-predictions are:

Confusion matrix from Logistic Regression

The most important variables that contirbute to the predictions are — mint duration and the number of traits a collection has.

Feature importance from Logistic Regression

Decision Tree Classifier

Without having to build a very deep tree, the Decision Tree with a tree depth of 4 already predicts correctly 100% of the resales and no-resales. The tree looks quite simple and shows the fewer the tokens and the shorter the minting duration a collection has, the less likely the collection will have a resale.

Decision tree to predict resale vs. no-resale of a collection

Feature importance from Decision Tree Classifier

Random Forest Classifier

Random Forest is similar to Decision Tree, except that it uses ensemble method to create sub-samples to build many decision trees to better train the model. Random Forest also predicts quite well with 100% accuracy score, but the downside is it’s not easily interpretable i.e. you cannot plot one single tree to figure out what the tree path is because there are hundreds and thousands of trees in a Random Forest tree.

Since a simple decision tree already predicts 100% accuracy, and Random Forest is not easy to interpret, it’s only used here as a benchmark. The most important features are the same as the Decision Tree:

mint duration
number of tokens a collection has

Confusion matrix from Random Forest Classifier

Feature importance from Random Forest Classifier

In conclusion, Decision Tree seems to be the best model to use here in terms of performance and interpretability. Here is a summary of the model performance for each model:

Summary model performance of predicting resale vs. no resale

What is the sale price of a collection next month?

Similar tree-based regression models are used to predict the price of a collection in August 2021. You might ask why time-series regressions are not used . The issue here is that a time-series model usually predicts a price trend of a particular item (i.e. a single stock price), in this case a collection. We have over 100 unique collections, all with different price history, sale volumes and traits etc. So each collection will need its own model to fully incorporate the price trend in the prediction (which is a lot of models!).

Also, the time history is quite short for each collection. With less than 1 year of history and less than 10 monthly data points for each collection, the time-series model is not likely to give reliable and robust results. Using daily data would increase the sample size, but not all collections have a sale everyday. A time-series model often require equal time intervals, which means the days with missing sale data due to no sale will have to be approximated with interpolation methods.

Based on these reasons, and the fact that there are a lot of categorical variables in the data, tree-based regression models seem to be more suitable than time-series models.

Decision Tree Regression

In order to to use the time-series data in a non-time-series tree-based model, the monthly average prices and monthly sale numbers are used as features in the model along with the other features. If there are k unique collections and each collection has m static features (i.e. aspect ratio, number of traits etc.) and n time-series features (i.e. Jan, Feb, Mar average sale price and number), the model will be trained on a feature matrix with the size of k by (m+n).

The initial Decision Tree Regression trained, without tuning any hyper-parameter, reaches an R-squared of 100%, but a tree depth of 18. The tree is very deep and in order to avoid over-fitting, different trials of tree depths vs. R-squared are plotted. The optimal tree depth is chosen as the elbow point of the plot, which is 5.

Decision tree regression hyper-parameter tuning — max tree depth

Now the tree is rebuilt with a constraint of maximum tree depth of 5. The R-squared drops a little bit down to 99.7% but the tree is much shorter.

Decision tree regression to predict August sale price

The most important features in the prediction of August sale price are:

July’s sale price
July’s sale number
December’s sale number
curation status (i.e. curated, playground, factory)

Feature importance of Decision Tree Regression

The scatter plot shows how close the predicted price is to the actual. The perfect prediction will form a 45 degrees diagonal line. In the plot it shows most of the points are on the diagonal line, except for some low price predictions.

Predicted vs. actual sale price in August from Decision Tree Regression

Gradient Boosting

Gradient Boosting Regressor is similar to Decision Tree, except that it uses ensemble method to learn from the previous step’s error and build the next step in the decision trees in order to train the model better. Since a simple decision tree already has a very high R-squared of 99.7% and it’s easily interpretable, Gradient Boosting is only shown as an alternative choice for benchmarking here.

Gradient Boosting Regressor as expected improves the performance to 99.9% R-squared due to its greedy search algorithm nature. The top 3 most importance features are:

July’s sale price
July’s sale number
March’s sale price

Feature importance from Gradient Boosting Regressor

The predicted price vs. actual in the scatter plot shows almost a perfect line, except for a couple of underpredicted outliers when looking at the log-scale plot.

Predicted vs. actual sale price in August from Gradient Boosting Regressor

Summary Notes

(1) For the prediction of whether a collection will be resold in August, all the models show that the most important features are:

the number of tokens a collection has: the fewer tokens, the less likely of resale
the duration of the minting event: the shorter the duration, the less likely of resale

The smaller number of tokens and shorter minting duration lead to a lower chance of resale in August.

(2) For the prediction of the resale price in August, all 3 tree-based regression models have a high R-squared more than 97%. The most important feature is July’s sale price.

Decision Tree is suitable for predicting both events — (1) and (2). The model performance is good with an accuracy score of 100% for event (1) and 99.7% for event (2).

Limitations & Future Improvements

Although the model performs very well, this might be due to the fact that there are only very few collections that were not resold in August. In such an unbalanced class of too many resales, the model could simply predict the most frequent class to get a high accuracy score. There are also some other limitations in the models, which are summarised along with the unbalanced class issue below:

Unbalanced class of too many resales, model tends to predict the most frequent class to achieve high accuracy score.
Not enough data in no-resale category to do a K-fold cross validation or training/testing split; so the models cannot be tested on out-of-sample data.
Insufficient sales history to build a time-series model.
Treating monthly data as features in the model would require recalibration every month.

The above limitations can all be mitigated with more data with longer history, which can be achieved as time goes.

Python code for the analysis can be found here.