Since 2014, I’ve been actively participating in a quirky little internet community centered around discussion of religious and philosophical topics, which internally refers to itself as “The Great Debate Community” (henceforth known as the “GDC”). I’ve made a number of good friends through this community–including my partner, Jakki–and have also met quite a few of them in real life, so the community has come to mean a lot to me over the years. Historically, the GDC has been primarily centered around three platforms: YouTube, Google+, and Google Hangouts. When it was announced in October of 2018 that Google+ would be shutting down the following year, I began working on a way to archive G+ profiles and comment threads, so as to preserve as much community history as possible which would otherwise be lost once the site shut down for good.
The end result of that project was Plustractor, which I began work on in November 2018 and had working to the point of being usable by early February. About halfway through the development process, Google moved up the date for the Google+ shutdown from August to April, with the APIs (which Plustractor relied upon) shutting down on March 7th, so it really ended up being a race against the clock to get it finished in time. The basic function of Plustractor was to use the Google+ API to scrape data about three kinds of G+ objects: people (user profiles), activities (posts and shares), and comments, and save them to a local SQLite database. I could use it to grab anything from the basic profile data of a single user up to every single post and comment under an entire list of profiles.
I constructed a list of 600 or so profiles of interest, mainly based off my and a few other people’s circles, and the GDC wiki (which is no longer available as it was deleted for ToS violations…hmm, I wonder how that happened). I sorted these by priority, starting with major figures in the community, those who were most active on G+, those who ran regular hangouts, and community-relevant YouTubers with big channels who were frequent targets for discussion or drama…then going all the way down to lurkers and minor figures who were no longer active. Since the GDC is a fairly loose social network with no strict definition, I tried to cast as wide of a net as possible. A few people did slip through the cracks, much to my frustration–indeed, I was scraping profiles right up until the very last day the APIs stopped working–but all in all I’m pretty happy with what I got.
The final stats for the archive are as follows: 598 scraped profiles (those that were on my list and had at least one post on G+), 52044 people objects, 463482 activity objects, and 1101232 comment objects, all contained in a database that weighed in at 4.1 GB uncompressed. The list of scraped profiles can be viewed here in this Google Sheet.
Now that Google+ has come to an end, I will be proceeding to analyze this dataset to the best of my abilities, and will my using this blog to document my findings and post any interesting data visualizations that I make along the way. I really think this project could be a goldmine of information about the community and its members, in addition to being a great learning experience for myself, as I want to pursue a career in data science.
Further down the road, I want to design some kind of web interface that will allow people to view and search through the archive data, in a format similar to Google+ (but better and more compact, because let’s face it, Google+ was a bit of a trainwreck interface-wise). At this time I will not be releasing the full database to the public, but if there’s something you’re looking for that was in a thread posted by one of the people on the list, let me know. If there’s a particular query you’d like me to run or you have suggestions for avenues of research, PLEASE do get in contact with me as I’d love to talk with you. I have a number of ideas already, but I’m a novice when it comes to analyzing social networks…suggestions from someone with expertise in this area would be awesome.
Thanks for reading…much more fun stuff to come. The best ways to get in contact with me are as follows:
Email: kdbuchik [a t] gmail dot com
Skype: kevbuc13
Discord: Kevin Bee#3332
A quick addendum about Plustractor itself: this was definitely one of the most challenging and involved programming projects I’ve ever done…over 1100 lines of actual Python code at finish, and taking about 5 months of intermittent work. While nothing Plustractor did was all that complicated on a technical level, there were a lot of fiddly details to consider: checking to make sure I wasn’t re-scraping posts or comments I’d already saved, getting new posts and comments from profiles that I DID already scrape, interfacing with Google Sheets and updating my list of profiles, etc. Google itself also threw a lot of obnoxious caveats my way…around February, the API began throwing intermittent errors (and not just one type of error either) when requesting valid resources, and I needed to make patches to account for this and keep re-trying requests until they worked. Trust me when I say that the shutdown of Google+ was a LOT uglier than it appeared to the average user, with different things breaking on the back-end long before the actual site was shut down in April. (The fact that I was up against a hard deadline of March 7th didn’t make any of this easier.)
So here’s a giant middle finger to the Mountain View Monstrosity for making my life miserable for about 4 months. I can say with confidence that the learning experience of solving a pretty big real-world problem was definitely worth it, though.