Analyzing Cyberpunk 2077 Related Tweets using Twitter Streaming API and Apache Flink
Ever since I used Apache Flink, always wanted to use it with Twitter Streaming API and for the subject, I chose Cyberpunk 2077. I’m also hyped about Cyberpunk 2077 and while waiting for the release playing with Flink looked awesome to me. Most of the necessary libraries are already present and ready in maven central.
In order to use Twitter’s Streaming API we need an application tokens. Thanks to my developer account on Twitter, I have created some apps a couple of years ago and since Twitter still supports old apps, I don’t need to apply for the new application. However, I should apply for future applications (todo for me).
First of all, I deployed Flink job manager and task manager components in a single instance using docker. To do that simply followed the link here https://ci.apache.org/projects/flink/flink-docs-release-1.12/deployment/resource-providers/standalone/docker.html#start-a-session-cluster and created a Session Cluster. It was pretty simple and straightforward so if needed another task slot, I just need to deploy taskmanager to same network with rest of the components. So far Flink is ready and waits for our code.
I followed the Twitter Connector guide to write necessary configurations to configure the stream. https://ci.apache.org/projects/flink/flink-docs-stable/dev/connectors/twitter.html
Datastream is ready and it’s retrieving the tweets however some of them are unwanted such as I don’t want to retweets and Twitter API also sends RT’s like a new Tweet with the “RT” keyword.
"RT @OriginalTweetOwner Hey @CyberPunkGame..."
, since we already have the main tweet I don’t want to add RT’s in the data.
For this blog, added the cyberpunk
and cyberpunk 2077
keywords to track field according to Twitter Stream API documentation.
I don’t want to catch any unrelated stuff so filtered out those using Flink’s FilterFunction.
Side note here Twitter also publishes deleted tweets in this Streaming API’s and the consumers have to follow this message and delete necessary tweets in their usages as well.
We have the data and we processed to our desired DTO, now we need to store and run some analysis. For storing data I used ElasticSearch and since ElasticSearch sinks are also ready from Flink packages.
ElasticsearchSink.Builder esSinkBuilder = new ElasticsearchSink.Builder<>(
HttpHost,
new ElasticsearchSinkFunction() {
public IndexRequest createIndexRequest(MyDTO element)
…
…
.filter(new FilterFunction() {
public boolean filter(TwitterData data) throws Exception {
if (data.getUser() != null) {
return true
} ..
}
})
Local build and testings looks good we just need to upload our jar into Flink. I uploaded and set necessary parameters through UI.
Let’s check data in Elasticsearch
Created some visualizations and added them to Dashboard. In the below picture we can see the last 24h data in Kibana.
Yep most used keywords besides Cyberpunk, are RTX and 3080 and when checking the tweets it seems there was a giveaway using that tags.
In the below picture, it can be seen the unique tweet senders, and it’s peaked at around 16:00 GMT because Cyberpunk 2077 preloading started at that time. We can also see the most used languages are: English, Portuguese, French and Spanish according the last 24H data.
Elasticsearch X-Pack also provides Machine Learning and Anomaly Detection features, and it has 30 day trial. The below picture shows the unique tweet senders by tweet language. It found an anomaly in the below picture
When we take a closer look by selecting the distinct users which sends English tweets we can see something happened around 17:00GMT. (PS: it was the preloading start @ Steam and Epic)