Skip to content

SofiaKauser/Steam-Project

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

49 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Presentation: https://docs.google.com/presentation/d/1lfNAoeC02s2sYz1Dq-R_4PVU07mP-dPtWjHLV1DGSi4/edit?usp=sharing

Proposal

We began to consider what data would be relevant to a developer, e.g. Skyhook games, creator of simulator software such as ‘Lawn Mowing Simulator’ and ‘Train Simulator’. We found datasets on Kaggle but also realised we could use Steam’s own API to gather recent or historic data.
We wondered if we could find some features of the data that would enable us to recommend any strategies for player engagement.
Skyhook responded and said they did not use “machine learning to assess game development”, which gave us an opportunity to come up with some ideas for investigations and models we could create.
We decided to look at the data to see if we could use machine learning to find any features that could model and predict player engagement so we could feed that back to them.
Valve originally launched Steam to deliver and update their own creations, such as the multiplayer ‘Counter Strike’; however it has grown into ‘the largest digital distribution platform for PC gaming, estimated around 75% of the market share in 2013’ (https://en.wikipedia.org/wiki/Steam_(service))

Existing data sources (secondary):

  • GetNewsForApp
  • GetGlobalAchievementPercentagesForApp
  • GetPlayerSummaries
  • GetFriendList
  • GetPlayerAchievements
  • GetUserStatsForGame
  • GetOwnedGames
  • GetRecentlyPlayedGames

Suggested directions

  • Linking system requirements data with sales or review scores
  • Compare player numbers over time with news data
  • Building a social network model of users, to find patterns in reviews
  • Find some features that could be trained to predict review scores and popularity - e.g. other similar games
  • Predicting average playtime from other data

Files in this GITHUB

Network Analysis

The purpose of this exploration was to try to find if there were similar patterns of behaviour (e.g. games owned, reviews, playtime, achievements) between closely related users.

image

Enough (but not all) user have their profiles set to ‘public’ to make this interesting. However, having got to this point, scraping data for these users felt a long way from being able to train a machine learning algorithm on anything meaningful that we could use to help direct a developer. Here, the nodes are coloured for privacy, and sized for distance from the source node.

image

News Analysis

image

Does the ebb and flow of interest in a game depend on its presence in the media? What about community announcements, content updates, and the like? This could be undertaken as a frequency analysis of news items, or perhaps some natural language processing on the content of news items.

image

So you can see that there are definitely similar features with these graphs, even if they are not month-for-month the same… image

As an initial experiment, Hyperparameter tuning a neural network produced a 64% accuracy result when filtering the data down to just one game, which I felt was an encouraging start, and perhaps by including more data, the algorithm would learn more features and find a good score of >75%

Raw Data:

image

Data cleaning steps:

  • Handled in relational database style;
  • Identified and removed testing periods from the data;
  • Cleaning out inadequate data sources;
  • Dropping initial rows - these would feature unreasonably high percentage increase, since the game had only just launched;
  • Fixing data types;
  • Binary Binning of increase/decrease to just 0 (for any decrease) or 1 (for any increase)
  • Cleaned out news sources with too few contributions.

Code for reading in API and joining to existing dataframe

image image

Results

  • Accuracy scores, on predicting whether increase was positive or negative:
  • Logistic regression: 53%
  • Random Forest: 52%
  • Neural Network: 55%

The link between monthly users and the news frequency is just not strong. /n

I went on to experiment in other ways, with blurring the data (moving average or gaussian style)... and trying linear regression (disastrous…) but I did not find any higher scores.
Perhaps more blurring could help to find a link between… but the initial blur i tried did not improve the accuracy score.

Perhaps a finer detail of data for the player numbers, rather than average, as news items could have a momentary, day-to-day effect rather than across a month.

Median Playtime Analysis

We decided to see if there were any other features that could drive user engagement, this time represented as the median playtime of a game.

image

We created plots of the various features of the data set to find if any had linear correlations existed to begin with as an explorative investigation. We were keen to look at’ Ratings’ and ‘Price’ as a marker for predicting success as we thought a company would value this data.

We also looked at converting text data into numeric data as a means to predict game success for the Machine Learning model to work we used the tags and categories columns. And tried to use them for the Machine Learning Models but the scores were low (0.03) image

Scaled Logistic regression model for median play time, training score 0.78 and test score 0.76 Scaled Random Forest Classifier for median play time, training score 1.0 and test score 0.75

Conclusions

The data we have been looking at are too noisy to make anything but broad predictions from.

So what makes a game popular?
What metrics keep players engaged?
Maybe it’s not something you can put a number on…

Files NOT in this GITHUB

System requirements analysis

A lack of consistency in how system requirements are laid out made this impossible for us to build into a reliable database.

image Pictured: system requirements for Lawn Mowing Simulator, Counter Strike: Global Offensive, and Vampire: The Masquerade: Bloodlines

This turned out to be too difficult to parse, due to the lack of a systematic structure for this information.

Machine Learning on User Reviews

We tried to see if user reviews could be scraped from the Steam website, and then for each reviewer, build up a picture of other games they liked and disliked (Steam has a binary review system: thumbs up or thumbs down). This would enable us to train a logistic regression algorithm that could predict whether or not a particular user would like a given game, based on their other likes and dislikes. This would then enable us to identify users likely to appreciate the game, or perhaps find other games that had a similar profile.

image However, the data had a self-selection problem, where many users only give one type of review. Furthermore, the vast number of games available through Steam means that many users will have zero games in common with each other.
At a practical level the website was also very good at refusing calls from webscraping algorithms, reducing the size of the dataset we could work from.

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •