Learning about Machine Learning (and Artificial Intelligence): part 2 of n
Background: I’m definitely no expert, and have no formal education, in machine learning (ML) nor artificial intelligence (AI), but I’m fascinated and intrigued by the technologies. I’ve been learning about them and building apps with them on my own for the past few years.
In this series of posts, I’ll explain what I know, how I got to where I am, how I’ve used them in my projects, and chart a course for my future with them. Hopefully this process will push me to learn more and may help those who later head down this path. This post is a very quick introduction to the science and a brief overview of my current ML project.
The first post this series quickly outlined how I started down this path.
In this post, I’ll outline the general approach and architecture of my (current) approach to using ML to politically categorize Twitter users.
ML can be employed to help solve a few different types of problems. From Google’s Introduction to Machine Learning Problem Framing:
In basic terms, ML is the process of training a piece of software, called a model, to make useful predictions using a data set. This predictive model can then serve up predictions about previously unseen data. We use these predictions to take action in a product; for example, the system predicts that a user will like a certain video, so the system recommends that video to the user.
Often, people talk about ML as having two paradigms, supervised and unsupervised learning. However, it is more accurate to describe ML problems as falling along a spectrum of supervision between supervised and unsupervised learning.
Supervised and unsupervised learning describe the method used to train the ML model. Supervised learning is where a model is trained by feeding it examples of how a set of inputs (or “parameters”) to the system map to a set of outputs. In unsupervised learning, the ML model learns on its own by detecting patterns in the data.
For my project, I am using supervised learning. Specifically, I train the ML model by feeding it Twitter user data (the input) and what I believe is the user’s political category, “left”, “right” or “neutral” (the output).
Determining the inputs for the model can be somewhat of an art. For simple problems, the input(s) can be obvious. For example, if one was training a ML model to predict daylight hours, the inputs would include the date, time of day and location, and the output would be a simple boolean value like “daytime”.
For my project, input data is constrained to data available from the Twitter API, which includes a user’s Twitter profile data (bio/description, @name, name, location, profile & banner images, followers and friends) and tweet data (words, emoji, urls, other Twitter users mentioned, #hashtags, images & videos sent, retweets, etc). That’s potentially a large number of inputs for each user.
If I had access to unlimited compute resources and time, I could use every available bit of information for each user to train the ML model, but because I’m doing this on my dime, and not all user information would be useful, I need some way to pick the smallest set of inputs that are relevant to the task but still yield acceptable performance in terms of training and prediction use.
I’ve used what I call a histogram approach to choosing inputs. Generally, I tally how often an input (like a particular #hashtag) appears in all of my labelled data (users that I’ve categorized), and the relative frequency of that input by category. If that input frequency is above a certain threshold and appears predominately in one category, then it’s included as in input.
For example, the #resist hashtag appears often, much more often in a user that I’ve categorized as “left” than “right” or “neutral”. Likewise #maga appears more in “right” users. By analyzing all of the manually categorized users, a set of inputs can be produced that I believe will be most useful in predicting the political category of a Twitter user. I then limit the number of inputs to something reasonable for my available resources, typically between 1000–2000 inputs per user. I have automated the process of producing the input set, and a new set of inputs is generated each day, using user data that is also updated continually.
Here’s an example of a JSON object that describes an input set with 1556 inputs, with input types of:
- emoji [186 inputs]
- hashtags 
- locations 
- n-grams  (2 and 3 word sequences)
- places  (Twitter geo objects)
- language sentiment  (Google’s Natural Language sentiment analysis)
- urls 
- user mentions 
- words 
- friends 
- media  (videos & photos)
The next step is the construction and evolution/training of a neural net using the labelled data (manually categorized Twitter users). I’ll discuss this process in the next part of this series.