What information does the database contain?

The Google+ API allows (well, allowed) access to three types of objects: People, Activities, and Comments. (Click the links to see the API documentation for what each object contains.) Plustractor’s database had one table for each of these objects, and columns for most of the top-level properties returned by the API (plus a few additional columns for useful data that was buried in JSON blobs, so as to make it easier to access). The list of these columns can be found here, though a slightly different schema will be used in analysis: I wrote a script to clean up and reorganize the original database, renamed a few columns, deleted ones that turned out to be empty (thanks Google!), and imported everything into MySQL. The schema for that database (with full column descriptions) will be put up here once I get a chance to write it, as that’s the format in which I’ll eventually be releasing the dataset in the future.

So, here are the specifics on what was actually scraped:

  • People records for each of the profiles in this spreadsheet
  • People records for a few additional profiles that I had on my list but that didn’t have any posts (mainly alts/sock accounts that I had circled)
  • Activity records for every public post or share created by any of those profiles (meaning shared with Public, in a public collection, or to a publicly-accessible group)
  • Comment records for every comment under those activities
  • People records for every author of one of those comments

In case that wasn’t clear, every single bit of profile data, post, and comment in my database was entirely public at the time of scraping. I intended to implement the ability to scrape profiles using an OAuth token to get posts that were only shared with circles, etc, but there just wasn’t enough time (again, THANKS GOOGLE).

Another thing that was not scraped in addition to non-public posts were any pictures, video, or other media uploaded to Google+ by the user. The API simply didn’t provide any way to get this content, and even if it did, the database would have been way too large anyway if I did include it. One exception, however, was profile and cover photos: these are stored on Google’s servers as their own separate files, and their URLs were included in the API responses. I programmed my tool to download these files for each of the profiles on my list (but only for them–I was originally going to download profile and cover photos for every profile I touched including commenters, but this ended up taking way too much space)…unfortunately after the project was complete, I found that the number of images that had been downloaded was far less than it should have been. All of the cover photos are gone now (as they were tied directly to Google+), but profile pictures seem to remain, so I’m going to make another go at grabbing as many of those as I can store.

Leave a Reply

Your email address will not be published. Required fields are marked *