Kyle's Blog

Can Social Media Predict Asset Price Movement?

TradR.fun is an open source model and trading application that produces buy and sell signals based on social media activity and sentiment.

It certainly seems that online forums can have a certain premonition about price movements. As chatter of a given asset increases, often so does the price. This is the hunch I wanted to validates and if correct, profit from with this project. I hope some parts of if not the whole project will be useful to others and it is hosted here on GitHub* and PythonAnywhere. Please get in touch with any questions or advice.

Using a large amount of data scraped hourly from Reddit along with historic price data a model is built to try and predict positive or negative future price movements.

* https://github.com/KyleBenzle/TradR

Overview of TradR

This project has 5 parts with the final result as, TradR.fun, the front-end app where users can see the hourly updated buy and sell signals.

1. Web scraper / API — gather real-time social media and price data.

2. Feature engineering — what data to use and what to predict.

3. Build a model — maximize buy/sell signal accuracy.

4. Trade — make real-time trades.

5. App — real-time interactive app for users.

Program Control

Several modules are controlled from the, “Main_Control_Script.py”. Once per hour the following are run:

RedditScraper.py
PriceApp.py
Analysis.py
/TradeMaker/Trade_Script.py
from github import Github
from time import sleep
import pandas as pd
import subprocess
import datetime
import glob
import os
import csv

count = 0

# How many time to run the loop (once per hour)
while count < 7:
   count = count + 1
   exec(open("RedditScraper.py").read())

   sleep(2)
# Get latest output from the scraper and add it and delete old one
   list_of_files = glob.glob('./*.csv') # * means all if need specific format then *.csv
   latest_file = max(list_of_files, key=os.path.getctime)
   latestFile = pd.read_csv(latest_file, encoding ='utf-8')
   latestFile = latestFile.iloc[0:1]
   latestFile.Hour=pd.datetime.now()
   mainFile = pd.read_csv('./ScrappedData/ScrappedReddit.csv', encoding='utf-8')
   os.remove("./ScrappedData/ScrappedReddit.csv")

   sleep(2)

   latestFileNew = latestFile[mainFile.columns]
   out = pd.concat([mainFile, latestFileNew])
   out.to_csv('./ScrappedData/ScrappedReddit.csv', index=False)

   sleep(2)

# Run price scraper
   exec(open("PriceApp.py").read())

   sleep(2)

# Run analysis and make predictions
   exec(open("Analysis.py").read())

   sleep(2)

# Upload data to GitHub
   # Set user info
   g = Github("f3e423110fbea4df9a0a9ada58caea3700641a14")
   repo = g.get_user().get_repo('tradr')
   all_files = []
   contents = repo.get_contents("")
   while contents:
        file_content = contents.pop(0)
        if file_content.type == "dir":
            contents.extend(repo.get_contents(file_content.path))
        else:
            file = file_content
            all_files.append(str(file).replace('ContentFile(path="','').replace('")',''))

   # Upload Price Data
   with open('./PriceData.csv', 'r') as file:
        content = file.read()


   git_prefix = 'Data/PriceGrabber/'
   git_file = git_prefix + 'PriceData.csv'

   if git_file in all_files:
        contents = repo.get_contents(git_file)
        repo.update_file(contents.path, "committing files", content, contents.sha, branch="main")
        print(git_file + ' UPDATED')
   else:
        repo.create_file(git_file, "committing files", content, branch="main")
        print(git_file + ' CREATED')

   # Scrapped Reddit
   with open('./ScrappedData/ScrappedReddit.csv', 'r') as file:
        content = file.read()

   # Upload Scrapped Reddit Data
   git_prefix = 'Data/ScrappedData/'
   git_file = git_prefix + 'ScrappedReddit.csv'

   if git_file in all_files:
        contents = repo.get_contents(git_file)
        repo.update_file(contents.path, "committing files", content, contents.sha, branch="main")
        print(git_file + ' UPDATED')
   else:
        repo.create_file(git_file, "committing files", content, branch="main")
        print(git_file + ' CREATED')

   sleep(2)

# Make trades using SignalInput.csv

   main_path = '/home/iii/MEGA/NYC Data/tradr/Data/TradeMaker/'
   python_path = f"{main_path}venv/bin/python3"
   args = [python_path, f"{main_path}trade_script.py",
   "/home/iii/MEGA/NYC Data/tradr/Data/SignalInput.csv"]
   process_info = subprocess.run(args)
   print(process_info.returncode)

# Wait one hour (or whatever)
   sleep(3600)

The Pyramid of Death

1. Scraping Reddit

Selenium is used because it gives more flexibility with no restrictions, but could try the API too.
The crypotocurrency community is nicely into separated forums based on asset, so it can give a more granular view. Because cryptocurrency is traded 24/7 and forums are very active lot of data is available.

Date scrapped hourly:

All comment text
Current users number
Number of posts in last hour
Number of comments in last hour
Number of votes in the last hour
Hourly price and volume data from Nomic.com API
For Assets/Subreddits

BTC, BCH, ETH, XMR, DASH
r/bitcoin, r/btc, r/ethereum, r/ethtrader, r/ethfinance, r/monero, r/xmrtrader

Price data is grabbed from the Nomics.com free API.

Code for the Reddit scraper is found here and Price scraper here.

Features
All together the following features are being used:

Hour of day
Day of week
Number of users/hour
Number of posts/hour
Comments/hour
15 most significant words
NLP on comments
The target for the training data is a +/- in the percent change for the next hour.

Analysis
Random forest models were used to classify the most significant words used in the comments and for +/- price signal classification based on all the features above. Full code is found here. The output is an update to the file, “SignalInput.csv” were 1=Buy and 0=Sell signal based on each subreddit and asset.

Trade

The target output of the analysis is an hourly updated array of buy/sell signals. After the usr has put in their API in the config.py file the the signals are read in by the, “trade_script.py” file and trades made on Binance once per hour.

The basic algorithm for the sales and purchases is that for a buy signal, 20% of the account is spent on that asset, for sell signals all of the asset will be sold.

App
At https://TradR.fun a user can choose what features to include and rerun the model to try and get the best possible score and predictions.

Results
As far as performance, after about one month the performance is flat. It is very conservative and I the algorithm spends most of its time sitting in USD. Sell signals seem to be much more common and only 10–20% of the time does the algorithm make a purchase.

Future Work
Continue to test accuracy with new data.

Tweak features.

Optimize the number of words.

Try time series models with price data included.

Use sliding window to test at what time interval signals are most accurate.

Add ability for users to trade on TradR.fun

Thanks for reading and please get in touch at https://KyleBenzle.com or on Twitter.