Crypto Tweets Analysis with Python

Hello everyone and welcome.

As part of a personal process, I decided to analyze and decipher several databases on subjects that interest me, and to share with you the results I obtained.

Among the most important topics of our time, we find the Web3.

The subject of crypto-currencies is an unavoidable topic that is the subject of much ink and passion. Some see in decentralized finance a way to speculate and make money, others see it as a danger for the economy. Still others see in this revolution the basis of tomorrow's finance.

I decided to analyze the debates on this topic, based on the comments published on one of the biggest debate platforms: Twitter.

The problematic will be the following:

Can crypto tweets be performance indicators of a Crypto-asset?

Let's start by importing the packages we will need for our study

In [5]:
import re
import pandas as pd
import csv
from nltk.tokenize import RegexpTokenizer
from nltk.stem import WordNetLemmatizer
from collections import Counter
import pycountry
import numpy as np
import matplotlib.pyplot as plt
import itertools
import plotly.express as px
import warnings
import seaborn as sns
import nltk
import string
from nltk.corpus import stopwords
warnings.filterwarnings('ignore')

Import of the database

The database I'm going to use has 80,000 tweets containing the word "crypto". This data was scrapped on August 28 and 29, 2022. Here is the link to the database if you want to reuse it: https://www.kaggle.com/datasets/tleonel/crypto-tweets-80k-in-eng-aug-2022

Here is our DataBase

In [6]:
import csv
df = pd.read_csv("crypto-query-tweets.csv", encoding='utf-8')

Variables description

date_time - Date and time the tweet was sent

username - The username that sent the tweet

user_location - Location entered in the account's location information on Twitter

user_description - Text added to "about" in the account

verified - If the user has the blue "verified by Twitter" checkbox.

followers_count - Number of followers.

following_count - Number of accounts followed by the person who sent the tweet

tweetlikecount - How many people liked the tweet?

tweetretweetcount - How many people retweeted the tweet?

tweetreplycount - How many people replied to this tweet?

tweetquotedcount - Number of people who quoted the tweet.

tweet_text - Text sent in the tweet

Features treatment

Let's start by analyzing the data we have to identify some indications and trends:

In [7]:
df.describe()
Out[7]:
followers_count following_count tweet_like_count tweet_retweet_count tweet_reply_count tweet_quote_count
count 8.000000e+04 80000.000000 80000.000000 80000.000000 80000.000000 80000.000000
mean 9.674178e+03 1187.229825 4.538963 1.489763 1.770712 0.081562
std 2.392675e+05 3333.715460 55.696321 43.490939 27.518316 3.539066
min 0.000000e+00 0.000000 0.000000 0.000000 0.000000 0.000000
25% 4.900000e+01 42.000000 0.000000 0.000000 0.000000 0.000000
50% 2.370000e+02 305.000000 0.000000 0.000000 0.000000 0.000000
75% 1.575500e+03 1390.000000 2.000000 0.000000 1.000000 0.000000
max 2.553591e+07 211354.000000 7438.000000 7459.000000 4045.000000 948.000000

This description of our columns gives us several information. On average, the users of our database have between 0 and 25535910 followers (for an average of 9674 followers), and follow between 0 and 211354 users (for an average of 1187 followed users). It also tells us that the users of our base like 4.5 Twitts and retweet 1.5 Twitts on average.

Another information that this table gives us is that some accounts have neither followers nor follow anyone. These accounts are often, according to my personal experience, "bot" or "fake" accounts that spam twitter. Let's check the tweets of these accounts to make sure of that:

In [8]:
tweets_fake_suspicion = df.loc[df["followers_count"]==0].loc[df["following_count"]==0]
for i in tweets_fake_suspicion["tweet_text"].head(5):
    print(i)
49That spaces was so bad for #crypto. Save the drama for fiat. best crypto discord group over 80Kmember and even ha… https://t.co/qt3stOwwkk
whitelist Detailed @MorghadeG @atowerises @crypto_davy_en @Ori_oba @k06_mr @NeenPapi69 @DEREBAN1025 @Stephan56248866 @GajananSabne @dirty30dirty
Crypto-Circus Review - Overview

Scammed by Crypto-Circus? Read the full Crypto-Circus review and know the possible activities of the scam brokers and if it is safe to trade here or not. Visit
https://t.co/RPM1GLvW2F

#CryptoCircusReview #CryptoCircusScam
@Futototomoosah1 @alarm_crypto_ that seems very interesting, gonna check it out.
@onecryptogirl @crypto_inez Shitt, why don't you post this to everyone???  https://t.co/mS9WaVDVea

We can see that on the first 4 tweets posted by "suspicious" users, 3 are ads with links that invite the user to redirect to another site. I noticed the same phenomenon on the whole database. This confirms my hypothesis. Thus, I decided to purely and simply remove these users from my database because I consider that they do not represent the trend and opinions on the crypto.

In [9]:
df = df[~df.isin(tweets_fake_suspicion)]
df = df.dropna(subset = ["followers_count","following_count"])
df.reset_index(drop=True,inplace=True)

Modalities Analysis

Let's start by analyzing the "verified" column which is an indicator of user certification.

In [10]:
sns.countplot(x="verified", data=df)
#plt.bar(df,height = "verified")
#px.bar(df, x='verified', color='verified', hover_name='verified', title='Verified users')
Out[10]:
<AxesSubplot:xlabel='verified', ylabel='count'>

Among the "real" users we filtered, very few are certified, which means that very few users with "influence" have posted a tweet containing the word "crypto". In my opinion, this can be explained by the fact that the topic of crypto-assets is still a niche topic, which succeeds the distrust of many people. Thus, a certified profile will prefer to address more popular topics.

Let's now analyze the location of the users who posted the tweets. Let's look at the distribution of the latter.

In [11]:
len(df['user_location'].unique())
Out[11]:
10553

We see here that there are 10574 different locations. We are going to classify these locations by increasing order of frequency in order to see the locations that appear the most.

To avoid duplicates and errors, we will change the list of locations to standardize the locations and remove missing values:

In [12]:
def remove_all_extra_spaces(string):
    return " ".join(string.split())

locs = [x.upper() for x in df['user_location'].astype(str)]
locs = [x for x in locs if str(x) != 'NAN']
locs = [remove_all_extra_spaces(x) for x in locs]

To avoid long calculation times, we will take the 1000 most frequent locations. Here they are below:

In [13]:
freq_location = {k: v for k, v in sorted(Counter(locs).items(), key=lambda item: item[1],reverse = True)}
first_freq_loc = dict(itertools.islice(freq_location.items(), 1000))
Locations = first_freq_loc.keys()
Count = first_freq_loc.values()
print(list(Locations)[:20])
['CARDANO OCEAN', 'METAVERSE', 'UNITED STATES', 'INDONESIA', 'BELGIUM', 'CRYPTO-ALERTS.ETH', 'NIGERIA', 'INDIA', 'UNITED KINGDOM', 'USA', 'REPUBLIC OF THE PHILIPPINES', 'LONDON, ENGLAND', 'AUSTRALIA', 'DEXIT COMMUNITY', 'BLOCKCHAIN', 'İSTANBUL', 'GLOBAL', 'LOS ANGELES, CA', 'TO THE MOON 🌕', 'SINGAPORE']

We notice that some locations are sometimes irregular. We must then match these locations, sometimes written in an incomprehensible way, sometimes simply non-existent, with a real location. To do this, we remove "prank" values such as "Cardano Ocean" or "Metaverse" from our data and only consider countries that are clearly or partially mentioned.

In [14]:
countries = []

for i in Locations: 
    try:
        countries.append(pycountry.countries.search_fuzzy(str(i))[0])
    except:#Si on ne retrouve pas la localisation dans le registre des pays
        countries.append(np.nan)

We then find the countries that tweeted the most on the word "Crypto" :

In [15]:
Countryfreq = pd.DataFrame({"Country" :countries,"Number of tweets" : Count })
Countryfreq = Countryfreq.dropna()
Countryfreq.reset_index(drop=True,inplace=True)

CountryNames = [x.name for x in Countryfreq["Country"]]
Countryfreq["CountryName"] = CountryNames

Countryfreq = Countryfreq.groupby("CountryName", as_index = False).sum().sort_values(by="Number of tweets",ascending = False)
Countryfreq.reset_index(drop=True,inplace=True)

Countrycodes = [pycountry.countries.get(name=x).alpha_3 for x in Countryfreq["CountryName"]]
Countryfreq["CountryCode"] = Countrycodes
Countryfreq.index+=1

Countryfreq.head(10)
Out[15]:
CountryName Number of tweets CountryCode
1 United States 1853 USA
2 Indonesia 722 IDN
3 United Kingdom 667 GBR
4 Belgium 616 BEL
5 Philippines 600 PHL
6 Nigeria 493 NGA
7 India 420 IND
8 Australia 297 AUS
9 Turkey 278 TUR
10 Canada 227 CAN

Let's now try to represent these numbers on an interactive map to better understand them.

In [16]:
fig = px.scatter_geo(
    Countryfreq, locations="CountryCode",color="CountryName",
    size="Number of tweets", hover_name="CountryName",
    projection="orthographic"
)
fig.show()

European, North American and Asian countries seem to be the biggest contributors of Crypto Tweets.

Let us now analyze the total share of each country:

In [17]:
fig = px.pie(
    Countryfreq.head(50), values='Number of tweets', names='CountryName',
    hole=0.5)
fig.update_layout(height=900, title='Number of tweets by country')
fig.show()

Among the most represented countries in these tweets, we find the United States in first position. In addition to being the country with the most active users on Twitter (69.3 million active users according to studies by We Are Social - Hootsuite), virtual currencies represent 11% of the country's exchange volumes (Landeau report). It is therefore not surprising to find this country in pole position.

The Indonesian government explained at the end of August that it was ready to set up a cryptocurrency exchange by the end of the year, which could explain the large number of tweets on the subject of crypto on 28 and 29 August (see https://www.cointribune.com/analyses/finance-decentralisee/crypto-le-gouvernement-indonesien-va-lancer-une-bourse/)

Among the other countries very represented in this graph, we also find the United Kingdom (equal in volume of exchange with the United States), Belgium where 10% of households have invested in cryptocurrencies or Nigeria (400 million dollars were exchanged in cryptocurrencies in Nigeria in 2020)

The results obtained here therefore seem consistent. However, we can wonder about the low figures of countries such as China or Russia, large users of crypto-investments. This absence can be explained by the fact that Russia does not even appear in the top 20 countries using Twitter. China has banned Twitter from its territory, which makes it difficult for Chinese users to use.

Now let's analyze the tweets themselves, what do they contain? Which crypto-currency is getting the most attention? What is the overall sentiment of users on this market?

Let's first look at the words that are used. To do this, we will sort the words used according to their type (Noun, Adverb, pronoun...) and then we will keep only the nouns and adverbs. For this, we will use NLP (Natural Language Processing) techniques. We start by tokenizing our tweets. A token can be a word, a character, or a sub-word (for example, in the word "higher", there are 2 sub-words: "high" and "er"). Punctuation such as "!", ".", and ";", can also be tokens. Tokenization is a fundamental step in every NLP operation.

In [ ]:
def clean(tweet):    
    return ' '.join(re.sub('(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)', " ", tweet).split()).lower()

df["tweet_text"] = df["tweet_text"].apply(clean)

nlp = spacy.load("en_core_web_sm")
tokenizer = RegexpTokenizer(r'\w+')
lemmatizer = WordNetLemmatizer()
stop = set(stopwords.words('english'))
punctuation = list(string.punctuation)
stop.update(punctuation)
w_tokenizer = nltk.tokenize.WhitespaceTokenizer()

def furnished(text):
    final_text = []
    for i in w_tokenizer.tokenize(text):
        if i.lower() not in stop:
            word = lemmatizer.lemmatize(i)
            final_text.append(word.lower())
    return ' '.join(final_text)

df["tweet_text"] = df["tweet_text"].apply(furnished)

Here is the result of the tokenization of our tweets:

In [ ]:
df["tweet_text"]
Out[ ]:
0        som cyptocurrency btc eth crypto w leone new h...
1                                         inside news done
2                          crypto prof pattern similar yes
3        cardano ada whale laced transaction output 2 1...
4        guy built great project heelped havve mmore co...
                               ...                        
37486    fintech devops coding data analytics technolog...
37487    thing balance sheet crypto confiscated guess c...
37488                                           hawkk done
37489    cardano ada whale laced transaction output 4 2...
37490    cardano ada whale laced transaction output 4 4...
Name: tweet_text, Length: 37491, dtype: object

Now that we've cleaned up and tokenized our tweets, let's take a look at which words appear most in our tweets with a word cloud:

In [ ]:
wordcloud = WordCloud(background_color = 'white').generate(str(df.tweet_text))

plt.figure(figsize=(12, 8))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.tight_layout(pad=0)
plt.savefig('top_cloud.png')

plt.show()

Here we see the words that appear the most in the tweets: the more a word appears, the more its size is important.

Thus, we notice some crypto-currencies such as ETH (Ethereum) or BTC (Bitcoin). This is normal since these 2 coins are the best known and the most important in terms of market capitalization.

However, we notice that the word that appears the most is the word Cardano. Cardano is an open source blockchain, as well as a platform for executing Smart Contracts. Cardano's internal cryptocurrency is called Ada (Word also present in the scatterplot). This project is led by Charles Hoskinson, co-creator of Ethereum. Thus, we can see that this crypto-currency was on a roll, and more so than other larger crypto-currencies such as ETH (Ethereum) or BTC (Bitcoin). Let's try to determine why.

Here we see the price of the ADA between October 25 and 29. We notice here a bearish price between these 2 dates of about 3.48%. The Cardano price, like the entire ctrypto market, is an extremely volatile market. A drop of 3.48% does not warrant a panic of ADA holders or a prominent place in the crypto sphere debate.

ADAUSDT_2022-10-10_16-34-53.png

So, what could justify such an effervescence around this project?

Thanks to the TextBlob package, let's take a look at the overall feeling of these of these tweets in order to know a little more about it:

In [ ]:
def get_tweet_sentiment(tweet):
    analysis = TextBlob(clean(tweet))
    if analysis.sentiment.polarity > 0:
        return 'positive'
    elif analysis.sentiment.polarity == 0:
        return 'neutral'
    else:
        return 'negative'

df['sentiment'] = df['tweet_text'].apply(get_tweet_sentiment)
df['sentiment'].value_counts()
Out[ ]:
neutral     19404
positive    13070
negative     5017
Name: sentiment, dtype: int64
In [ ]:
sns.countplot(x="sentiment", data=df)
Out[ ]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f23bac2c590>

If we look at all the Tweets, we notice that a large part of them are neutral, but that there are still many more positive tweets than negative ones. Let's now look at the tweets containing the word "Cardano" or "ADA":

In [ ]:
Cardanodf = df[df['tweet_text'].str.contains('cardano')]
Cardanodf['sentiment'].value_counts()
Out[ ]:
negative    902
positive    108
neutral      70
Name: sentiment, dtype: int64
In [ ]:
sns.countplot(x="sentiment", data=Cardanodf)
Out[ ]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f23bae5add0>

When we isolate the tweets containing the word "Cardano", we realize that they have an overwhelming majority of negative tweets. This is representative of the price of Cardano on these dates.

Google news allows us to filter news posted on specific dates. Going back to the news of August 26, 27 and 28, we find the following news: Cardano founder Charles Hoskinson said in a video released on August 26 that things are moving very quickly and that the Vasil hard fork will most likely happen in September. In the cryptocurrency world, a hard fork is a branching of the blockchain caused by a change in the consensus rules. By extension, the term is also used to refer to any non-backward compatible change in the protocol that could cause a permanent duplication of the chain. In layman's terms and to simplify, it is a copy of a blockchain, with different parameters.

Following this announcement, the price of Cardano could have risen, but it continued to fall until the 29th. In addition to the overall sentiment of the tweets, does this mean that Cardano users are not convinced by this fork? It is in any case an avenue of reflection to be explored.

Conclusion

A well-known adage in the crypto world says "Buy the rumor, sell the News". Twitter, with its large number of daily users, seems to be the perfect tool to flush out rumors and underground trading.

We have seen here with our example that the price of a crypto-investment can match the overall sentiment on a debating platform like Twitter.

However, there are many parameters to take into account when doing this kind of analysis. For the crypto-currency market remains an extremely volatile market that requires some caution.

By extension, could we apply the same principle to other financial markets? This could be the subject of another study.

Thank you for your attention,

Adam