When using machine learning algorithms for obtaining insights and predictions, there’s no one approach for every situation. That’s why students in a profession that requires data analysis must learn not only a wide array of algorithms but also how to decide which one(s) are appropriate for the task at hand.
Experts don’t always agree on which algorithms to use when extracting information from large data sets. However, the following are some of the machine learning algorithms that are seen as essential for professionals who work with big data, including those in graduate applied economics programs.
Types of Machine Learning
Before moving into some of the most popular algorithms, it’s important to understand that machine learning can be broken down into three categories. They are:
Supervised learning. Using pairs of input-output data to learn the mapping function between the input and the output. This can then be used to map functions between similar input/output pairs. Examples would include creating labels such as male or female or sick or healthy.
Unsupervised learning. In this case, only input data is given. Machine learning then searches to find any interesting underlying trends in the overall data. A simple example of this would be analyzing millions of receipts to determine the probability that a person ordering a beer in a restaurant will also order an appetizer. A more complex example is identifying clusters or groups of (persons, articles, treatments) that are similar enough. Developing measures of ‘similar enough’ requires careful analysis and treatment of data but can lead to identifying surprising subpopulations of interest.
Reinforcement learning. Learning the best course of action based on continuous analysis of past data. For example, learning the best path through a video game by playing enough times to determine which actions get the most rewards. This type of learning is typically used in robotics, with another well-known application in developing self-driving (algorithms for) cars.
The following presents some examples of big data concepts using machine learning algorithms that are supervised learning, except for K-means and Principal Component Analysis, which are unsupervised.
Linear Regression
This involves estimating the strength of the linear relationship between a set of input variables and an output variable. An example would be a linear regression model estimating the linear relationship between people’s height (input variable) and weight (output variable), and the impact of an additional inch in height on the weight of the person
Logistics Regression
Whereas linear regression is focused on continuous outputs, logistics regression involves using input variables to determine discrete, usually binary, outputs. For example, weight is a continuous value, while the number of people who go each year to a hospital’s emergency room with myocardial infarction is a discrete value. An example of logistic regression would involve estimating the likelihood that a person will be hospitalized within 30 days, given their medical history (diagnoses, hospitalizations, medication, lab results, etc.)
Decision Trees
A decision tree is a chart that supports decision making. The chart maps out all possible (relevant) decisions and estimates the probability of an event happening given that decision (path).
Naive Bayes
This method used Bayes’ Theorem to build classifiers. Why naive? Because this big data concept deliberately and naively assumes that all variables are independent of other variables. Even so, it’s been a very dependable predictive model. It’s used to determine the probability something will happen given that another event has already happened.
K-means
K-means is a machine learning algorithm used for unsupervised learning. It involves placing similar data points into a predetermined number of clusters, based on some measure of distance between them; an iterative “tuning” process then tries to identify the optimal position of the “means”, or central points of these clusters. Once a central point of the cluster is determined, each (new) item is assigned to a cluster based on how far it is from that center point. The number of clusters is in general not known a priori and will depend on the purpose of the classification problem being solved.
Principal Component Analysis
Principal Component Analysis (PCA) uses an “orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components,” according to the data analysis-focused website KD Nuggets.
Beginners should become familiar with these machine learning algorithms used with big data concepts as they learn to work with large data sets. Each can provide the right approach to the data challenge you are attempting to solve.