Survival Guide for Steam Games!

Motivation and Goal

Steam is a digital distribution platform developed by Valve Corporation for purchasing and playing video games. In 2017, steam game sales revenue reached 4.3 billion US dollars, and it has roughly 22.5 million users in US. Every year, there are more than 2000 games released in steam. However, not all of them are successful as PlayerUnknown's Battlegrounds, which can make more than 50 million copies sold and have more than 400 million players all over the world. The average sales of steam games is only 32,000 units! Thus, one big question for game developers is, can we predict the success of a game before its release?

The goal of this project is to develop a machine learning model, which can predict the number of owners for steam games. The number of owners can be considered as an indicator of game sales and success.

This is a very challenging task, but it's also very interesting. Hope the analysis here can be helpful and provide some insights for new video game development.

Data Analysis

With the help of open dataset, we can first do some basic data analysis! The datasets I used are from Kaggle and provided by Nik Davis. They are generated from Steam Store and SteamSpy APIs, and include almost 30k games released before May 2019.

1. Owner Number

(Left) The distribution of owner numbers; (Right) The percentage of game with different number of owners — Figure 1. Owner Numbers for Different Steam Games

Owner number is the target of our model, and it can be considered as an indicator for game sales and success. In our dataset, owner number range for each game has been provided. As shown in above figure, the number of owners for different games are very diverse. Less than 1% very successful games such as Dota 2, Counter-Strike: Global Offensive, and PlayerUnknown's Battlegrounds, can reach higher than 100 million owners, while around 70% of games can only have less than 20,000 owners!

2. Explore the Features

To find the secret behind the steam game success, I explored the game type features and numerical features.

There are three categorical features related to the game type: categories, genres, steamspy_tag. The first two features are from steam store and used to describe game categories and genres, while the last one is from steamspy and is voted by community. Steamspy_tag is similar as genres, but it has much more types (genres has only 29 different types, but Steamspy_tag has 339 different types).

To analyze the game type effect on game success, I checked frequencies of categories and Steamspy_tag for total games and successful games (owner number > 1M), and used wordcloud to draw figures to see if there is any significant change in game type.

(Left) Total Steam Games; (Right) Successful Games — Figure 2. Word Cloud for Categories

Categories have 29 different types. Most games in steam belong to single-player and they all have steam achievements. This trend has been totally kept for successful games. The only slight difference is that successful games has lager portion in multi-player, which makes sense since multi-player games require more players and might have more owners.

Stemspy_tag have 339 different types and these types can be used to describe game in detail. Most frequent tags include Indie, Action, Casual, Adventure, Strategy, Simulation, and Early Access. Compared with total games, for the success games, Action, Free to Play, Multiplayer, FPS(First Person Shooter), Open World, and Strategy become most frequent tags. Among them, Free to Play is related to the price of game, Multiplayer, FPS, Open World are more about game setting, while Action, Strategy are for game genre. These tags can be appeared at the same type. For example, steamspy tags for Dota 2 are Action, Free to Play, and Strategy. The percentages of Indie, Casual, Adventure, Simulation, and Early Access for successful games have dropped. It seems that if a game with these tags can be more dangerous and will be more unlikely successful compared a game with Action tag.

Except categorial features, some features, including required_age ( 0 is no requirement), achievements (in-game achievements number), percentage of positive ratings, price, release month, and release day can be considered as numerical features, and I also checked the Spearman's ranking correlation coefficient between them and target.

Figure 4. Ranking Correlation between Numerical Features

Here, since dataset only provides the range of owners number for each game, used median (also average here) of min and max of each range to stand for the target owner number.

All of the numerical features have no significant correlation with the number of owners, and the highest correlation is achieved by achievements, which is 0.15.

Also, there is no correlation between each pair of numerical features, and highest coefficient 0.19 has been achieved by price and achievements, which is still very low.

One thing should be noted is the percentage of positive ratings is got from the reviews. However, it can also be got through reviews from certain pre-testing or inner-testing for small group of users before final release.

Model Development

Based on the categorical and numerical features, I attempted to use machine learning methods to predict the number of owners for each game. I designed it as a classification task, and the goal is to predict which range of owners the game should be in.

Beside the features I mentioned above, I also generated two new features: developer_famous, publisher famous, to indicate the effect of developer/publish reputation on game sales. (To determine whether a developer/publisher is famous before the target game release, I checked the average owner number of that developer/publisher using other games released before target game release date, and if that's number is in Top 25% of all developers/publishers at that time, I assumed that publisher/developer is famous for the target game.)

XGBoost classification is utilized to build the model. Since the dataset is very unbalance, and there are around 70% of instances with less than 20k owners, I used normalized weight (sample_weight) for different classes to emphasize the error from games with higher owner numbers. The weight was determined using the average owner number of each class, and normalized using the maximum weight of all classes. The training, validation, test set were generated randomly based on 8:1:1 split. To avoid the overfitting, early stop has been applied with patience=50 using validation set. To improve the final performance, ensemble models have also been generated with different random seed and the final predictions were got from 10 ensemble models.

1. Performance and Error Analysis

(Left) Model performance; (Right) Confusion Matrix — Figure 5. Performance and Error Analysis

I checked model performance using accuracy, TPR (true positive rate), and FPR (False negative rate). Here, the threshold using in TPR and FPR is 20k, which means the active is the games with owner numbers higher than 20k. Our model can achieve 70% and 69% accuracy in validation and test set, respectively, and achieve similar performance on TPR. Thus, our model made mistakes for around 30% instances. Since FPR is relative low for test set, the major error should be due to the failed prediction for active instances. To further analyze the error in our test set, the normalized confusion matrix has also been measured, and it shows similar result as in TPR/FPR, which is our model can't achieve good performance for games with higher than 20k owners (active instances, with no larger than 40% accuracy). However, compared with the highest ranking correlations got by single numerical features (0.15), our model still has much improved performance on owner number prediction.

2. Feature Importance

Feature importance is a good way to interpret the developed model. Here, I used average gain to measure the feature importance. Feature importance result shows the average performance gain achieved using different features. Here, I only showed the Top10 features. The percentage of positive ratings has highest importance, which indicate the big effect of positive reviews. One thing should be noted is the percentage of positive ratings is got after game release here, but it can also be got through certain pre-testing or inner-testing from small group of reviewers before final release. Below percentage of positive ratings, release day, number of achievements, price, and release month are also very important. It's quit surprise to find that release date matters in game success. In addition, the categories and publisher reputation also have relative important effect on model prediction.

Summary

If you want to build a successful game, you should consider the game genre during the development (trying to develop action game), set reasonable game price (free to play if you can), find a famous publisher, and try to achieve good positive ratings percentage from inner-testing before the final release. Finally, don't forget to choose a better release date for your game !

Future Work

There is still much improvement space of our current model. Actually, the success of a game is most determined by its quality, include the animation, user experience, graphic design, story design, etc. However, I haven't include any these features in current analysis, and only considered the outer effect on the game. In future work, I will try to get more features related to the quality.

Reference

All code are accessible in my Github: https://github.com/jenniening/Steam_Game

#bloggingtips #WixBlog