Given the recent surge in NFT sales and prices, especially in generative NFTs such as those from Art Blocks, it would be interesting to see if the number of sales and the price of a collection can be predicted. This article makes use of some simple statistical & machine learning models to predict whether a collection from Art Blocks minted before August 1st 2021 will have a resale in the secondary market in August; and how much the collection will be sold for in August. Credits to Flipside Crypto for providing the data needed to build these prediction models.
In general, there are two levels of predictions you can do:
- Collection level prediction: To predict which collection would be sold in August and how much it would be sold for. i.e. among all the collections / projects in Art Blocks, based on the different features of each collection such as duration of the mint, number of tokens minted or average sale price, which collection will have sales in August and for how much.
- Token level prediction: To predict which specific token within the same collection will have sales in August and how much would it be sold for. i.e. within the Chromie Squiggle collection, based on the different traits and features, which one specific token will have sales in August and for how much.
Since each collection is different and will have different token level features and traits, the token sale or price prediction needs to be built for each individual collection. This article will only focus on the collection level prediction for all Art Blocks projects.
Data
Art Blocks collection data consolidated by Flipside Crypto is transformed into two structures:
- Time-series data: contains time-dependent information of the collection i.e. sales volume and price. This data structure is used for regression models.
- Non-time series data: contains static non-time dependent information of the collection i.e. artist name, aspect ratio, curation status etc. This data structure is used for machine learning models i.e. Decision Tree, Random Forest, Gradient Boosting.
Data fields directly from the Flipside Crypto database i.e. artist, aspect_ratio, curation_status etc. along with the fields created by myself as shown below are used to train the predictive models.
COUNT_TOKEN: total number of tokens of the collection
DAYS_SINCE_MINT: number of days between mint date (created_at_timestamp
) and September 1st 2021.
FEATURE_NUMBER: number of features of the collection
TRAITS_NUMBER: number of traits of the collection
MINT_CURRENCY: same as tx_currency
MINT_DURATION: number of minting days
AUGUST_SALE_COUNT: number of sales in August 2021 from the collection
AUGUST_SALE_PRICE: average sale price in August 2021 from the collection
YEAR_MONTH: the year and month of block_timestamp
SALE_COUNT: number of sales of the collection in the particular month
PRICE_USD: same as price_usd
in nft_events table
PRICE_RANGE: difference between the minimum and maximum price_usd of the collection in the particular month
Which collection will have a resale next month?
The following models will make use of the historic data from November 2020 until end of July 2021, and try to predict which existing collection will have a resale in August 2021. The result of the prediction is a binary outcome (0 for no resale, 1 for resale). Three models are used here — Logistic regression, Decision Tree and Random Forest.
Logistic Regression
The logistic regression correctly predicts 99% of the resales, 83% of the no resales and mis-predict 2 resales into non-resales, 1 non-resale into resale. The predictions are pretty good! The three mis-predictions are:
The most important variables that contirbute to the predictions are — mint duration and the number of traits a collection has.
Decision Tree Classifier
Without having to build a very deep tree, the Decision Tree with a tree depth of 4 already predicts correctly 100% of the resales and no-resales. The tree looks quite simple and shows the fewer the tokens and the shorter the minting duration a collection has, the less likely the collection will have a resale.
Random Forest Classifier
Random Forest is similar to Decision Tree, except that it uses ensemble method to create sub-samples to build many decision trees to better train the model. Random Forest also predicts quite well with 100% accuracy score, but the downside is it’s not easily interpretable i.e. you cannot plot one single tree to figure out what the tree path is because there are hundreds and thousands of trees in a Random Forest tree.
Since a simple decision tree already predicts 100% accuracy, and Random Forest is not easy to interpret, it’s only used here as a benchmark. The most important features are the same as the Decision Tree:
- mint duration
- number of tokens a collection has
In conclusion, Decision Tree seems to be the best model to use here in terms of performance and interpretability. Here is a summary of the model performance for each model:
What is the sale price of a collection next month?
Similar tree-based regression models are used to predict the price of a collection in August 2021. You might ask why time-series regressions are not used . The issue here is that a time-series model usually predicts a price trend of a particular item (i.e. a single stock price), in this case a collection. We have over 100 unique collections, all with different price history, sale volumes and traits etc. So each collection will need its own model to fully incorporate the price trend in the prediction (which is a lot of models!).
Also, the time history is quite short for each collection. With less than 1 year of history and less than 10 monthly data points for each collection, the time-series model is not likely to give reliable and robust results. Using daily data would increase the sample size, but not all collections have a sale everyday. A time-series model often require equal time intervals, which means the days with missing sale data due to no sale will have to be approximated with interpolation methods.
Based on these reasons, and the fact that there are a lot of categorical variables in the data, tree-based regression models seem to be more suitable than time-series models.
Decision Tree Regression
In order to to use the time-series data in a non-time-series tree-based model, the monthly average prices and monthly sale numbers are used as features in the model along with the other features. If there are k unique collections and each collection has m static features (i.e. aspect ratio, number of traits etc.) and n time-series features (i.e. Jan, Feb, Mar average sale price and number), the model will be trained on a feature matrix with the size of k by (m+n).
The initial Decision Tree Regression trained, without tuning any hyper-parameter, reaches an R-squared of 100%, but a tree depth of 18. The tree is very deep and in order to avoid over-fitting, different trials of tree depths vs. R-squared are plotted. The optimal tree depth is chosen as the elbow point of the plot, which is 5.
Now the tree is rebuilt with a constraint of maximum tree depth of 5. The R-squared drops a little bit down to 99.7% but the tree is much shorter.
The most important features in the prediction of August sale price are:
- July’s sale price
- July’s sale number
- December’s sale number
- curation status (i.e. curated, playground, factory)
The scatter plot shows how close the predicted price is to the actual. The perfect prediction will form a 45 degrees diagonal line. In the plot it shows most of the points are on the diagonal line, except for some low price predictions.
Gradient Boosting
Gradient Boosting Regressor is similar to Decision Tree, except that it uses ensemble method to learn from the previous step’s error and build the next step in the decision trees in order to train the model better. Since a simple decision tree already has a very high R-squared of 99.7% and it’s easily interpretable, Gradient Boosting is only shown as an alternative choice for benchmarking here.
Gradient Boosting Regressor as expected improves the performance to 99.9% R-squared due to its greedy search algorithm nature. The top 3 most importance features are:
- July’s sale price
- July’s sale number
- March’s sale price
The predicted price vs. actual in the scatter plot shows almost a perfect line, except for a couple of underpredicted outliers when looking at the log-scale plot.
Summary Notes
(1) For the prediction of whether a collection will be resold in August, all the models show that the most important features are:
- the number of tokens a collection has: the fewer tokens, the less likely of resale
- the duration of the minting event: the shorter the duration, the less likely of resale
The smaller number of tokens and shorter minting duration lead to a lower chance of resale in August.
(2) For the prediction of the resale price in August, all 3 tree-based regression models have a high R-squared more than 97%. The most important feature is July’s sale price.
Decision Tree is suitable for predicting both events — (1) and (2). The model performance is good with an accuracy score of 100% for event (1) and 99.7% for event (2).
Limitations & Future Improvements
Although the model performs very well, this might be due to the fact that there are only very few collections that were not resold in August. In such an unbalanced class of too many resales, the model could simply predict the most frequent class to get a high accuracy score. There are also some other limitations in the models, which are summarised along with the unbalanced class issue below:
- Unbalanced class of too many resales, model tends to predict the most frequent class to achieve high accuracy score.
- Not enough data in no-resale category to do a K-fold cross validation or training/testing split; so the models cannot be tested on out-of-sample data.
- Insufficient sales history to build a time-series model.
- Treating monthly data as features in the model would require recalibration every month.
The above limitations can all be mitigated with more data with longer history, which can be achieved as time goes.
Python code for the analysis can be found here.