# Important Definitions

Term | Definition |
---|---|

A/B testing | A statistical way of comparing two (or more) techniques, typically an incumbent against a new rival. A/B testing aims to determine not only which technique performs better but also to understand whether the difference is statistically significant. A/B testing usually considers only two techniques using one measurement, but it can be applied to any finite number of techniques and measures. |

accuracy | The fraction of predictions that a classification model got right. In multi-class classification , accuracy is defined as follows: |

activation function | A function (for example, ReLU or sigmoid ) that takes in the weighted sum of all of the inputs from the previous layer and then generates and passes an output value (typically nonlinear) to the next layer. |

active learning | A training approach in which the algorithm chooses some of the data it learns from. Active learning is particularly valuable when labeled examples are scarce or expensive to obtain. Instead of blindly seeking a diverse range of labeled examples, an active learning algorithm selectively seeks the particular range of examples it needs for learning. |

AdaGrad | A sophisticated gradient descent algorithm that rescales the gradients of each parameter, effectively giving each parameter an independent learning rate . For a full explanation, see this paper . |

AR | Abbreviation for augmented reality . |

area under the PR curve | See PR AUC (Area under the PR Curve) . |

area under the ROC curve | See AUC (Area under the ROC curve) . |

artificial general intelligence | A non-human mechanism that demonstrates a broad range of problem solving, creativity, and adaptability. For example, a program demonstrating artificial general intelligence could translate text, compose symphonies, and excel at games that have not yet been invented. |

artificial intelligence | A non-human program or model that can solve sophisticated tasks. For example, a program or model that translates text or a program or model that identifies diseases from radiologic images both exhibit artificial intelligence. |

AUC (Area under the ROC Curve) | An evaluation metric that considers all possible classification thresholds . |

average precision | A metric for summarizing the performance of a ranked sequence of results. Average precision is calculated by taking the average of the precision values for each relevant result (each result in the ranked list where the recall increases relative to the previous result). |

backpropagation | The primary algorithm for performing gradient descent on neural networks . First, the output values of each node are calculated (and cached) in a forward pass. Then, the partial derivative of the error with respect to each parameter is calculated in a backward pass through the graph. |

bag of words | A representation of the words in a phrase or passage, irrespective of order. For example, bag of words represents the following three phrases identically: |

baseline | A model used as a reference point for comparing how well another model (typically, a more complex one) is performing. For example, a logistic regression model might serve as a good baseline for a deep model . |

batch | The set of examples used in one iteration (that is, one gradient update) of model training . |

batch normalization | Normalizing the input or output of the activation functions in a hidden layer . Batch normalization can provide the following benefits: |

batch size | The number of examples in a batch . For example, the batch size of SGD is 1, while the batch size of a mini-batch is usually between 10 and 1000. Batch size is usually fixed during training and inference ; however, TensorFlow does permit dynamic batch sizes. |

Bayesian neural network | A probabilistic neural network that accounts for uncertainty in weights and outputs. A standard neural network regression model typically predicts a scalar value; for example, a model predicts a house price of 853,000. By contrast, a Bayesian neural network predicts a distribution of values; for example, a model predicts a house price of 853,000 with a standard deviation of 67,200. A Bayesian neural network relies on Bayes' Theorem to calculate uncertainties in weights and predictions. A Bayesian neural network can be useful when it is important to quantify uncertainty, such as in models related to pharmaceuticals. Bayesian neural networks can also help prevent overfitting . |

bias (math) | An intercept or offset from an origin. Bias (also known as the bias term ) is referred to as b or w 0 in machine learning models. For example, bias is the b in the following formula: |

binary classification | A type of classification task that outputs one of two mutually exclusive classes . For example, a machine learning model that evaluates email messages and outputs either "spam" or "not spam" is a binary classifier . |

binning | See bucketing . |

boosting | A machine learning technique that iteratively combines a set of simple and not very accurate classifiers (referred to as "weak" classifiers) into a classifier with high accuracy (a "strong" classifier) by upweighting the examples that the model is currently misclassfying. |

broadcasting | Expanding the shape of an operand in a matrix math operation to dimensions compatible for that operation. For instance, linear algebra requires that the two operands in a matrix addition operation must have the same dimensions. Consequently, you can't add a matrix of shape (m, n) to a vector of length n. Broadcasting enables this operation by virtually expanding the vector of length n to a matrix of shape (m,n) by replicating the same values down each column. |

bucketing | Converting a (usually continuous ) feature into multiple binary features called buckets or bins, typically based on value range. For example, instead of representing temperature as a single continuous floating-point feature, you could chop ranges of temperatures into discrete bins. Given temperature data sensitive to a tenth of a degree, all temperatures between 0.0 and 15.0 degrees could be put into one bin, 15.1 to 30.0 degrees could be a second bin, and 30.1 to 50.0 degrees could be a third bin. |

calibration layer | A post-prediction adjustment, typically to account for prediction bias . The adjusted predictions and probabilities should match the distribution of an observed set of labels. |

candidate sampling | A training-time optimization in which a probability is calculated for all the positive labels, using, for example, softmax , but only for a random sample of negative labels. For example, if we have an example labeled beagle and dog candidate sampling computes the predicted probabilities and corresponding loss terms for the beagle and dog class outputs in addition to a random subset of the remaining classes ( cat , lollipop , fence ). The idea is that the negative classes can learn from less frequent negative reinforcement as long as positive classes always get proper positive reinforcement, and this is indeed observed empirically. The motivation for candidate sampling is a computational efficiency win from not computing predictions for all negatives. |

categorical data | Features having a discrete set of possible values. For example, consider a categorical feature named house style , which has a discrete set of three possible values: Tudor, ranch, colonial . By representing house style as categorical data, the model can learn the separate impacts of Tudor , ranch , and colonial on house price. |

checkpoint | Data that captures the state of the variables of a model at a particular time. Checkpoints enable exporting model weights , as well as performing training across multiple sessions. Checkpoints also enable training to continue past errors (for example, job preemption). Note that the graph itself is not included in a checkpoint. |

class | One of a set of enumerated target values for a label. For example, in a binary classification model that detects spam, the two classes are spam and not spam . In a multi-class classification model that identifies dog breeds, the classes would be poodle , beagle , pug , and so on. |

classification model | A type of machine learning model for distinguishing among two or more discrete classes. For example, a natural language processing classification model could determine whether an input sentence was in French, Spanish, or Italian. Compare with regression model . |

classification threshold | A scalar-value criterion that is applied to a model's predicted score in order to separate the positive class from the negative class . Used when mapping logistic regression results to binary classification . For example, consider a logistic regression model that determines the probability of a given email message being spam. If the classification threshold is 0.9, then logistic regression values above 0.9 are classified as spam and those below 0.9 are classified as not spam . |

class-imbalanced dataset | A binary classification problem in which the labels for the two classes have significantly different frequencies. For example, a disease dataset in which 0.0001 of examples have positive labels and 0.9999 have negative labels is a class-imbalanced problem, but a football game predictor in which 0.51 of examples label one team winning and 0.49 label the other team winning is not a class-imbalanced problem. |

clipping | A technique for handling outliers . Specifically, reducing feature values that are greater than a set maximum value down to that maximum value. Also, increasing feature values that are less than a specific minimum value up to that minimum value. |

co-adaptation | When neurons predict patterns in training data by relying almost exclusively on outputs of specific other neurons instead of relying on the network's behavior as a whole. When the patterns that cause co-adaption are not present in validation data, then co-adaptation causes overfitting. Dropout regularization reduces co-adaptation because dropout ensures neurons cannot rely solely on specific other neurons. |

confusion matrix | An NxN table that summarizes how successful a classification model's predictions were; that is, the correlation between the label and the model's classification. One axis of a confusion matrix is the label that the model predicted, and the other axis is the actual label. N represents the number of classes . In a binary classification problem, N=2. For example, here is a sample confusion matrix for a binary classification problem: |

continuous feature | A floating-point feature with an infinite range of possible values. Contrast with discrete feature . |

convenience sampling | Using a dataset not gathered scientifically in order to run quick experiments. Later on, it's essential to switch to a scientifically gathered dataset. |

convergence | Informally, often refers to a state reached during training in which training loss and validation loss change very little or not at all with each iteration after a certain number of iterations. In other words, a model reaches convergence when additional training on the current data will not improve the model. In deep learning , loss values sometimes stay constant or nearly so for many iterations before finally descending, temporarily producing a false sense of convergence. |

convex function | A function in which the region above the graph of the function is a convex set . The prototypical convex function is shaped something like the letter U . For example, the following are all convex functions: |

convex optimization | The process of using mathematical techniques such as gradient descent to find the minimum of a convex function . A great deal of research in machine learning has focused on formulating various problems as convex optimization problems and in solving those problems more efficiently. |

convex set | A subset of Euclidean space such that a line drawn between any two points in the subset remains completely within the subset. For instance, the following two shapes are convex sets: |

cost | Synonym for loss . |

counterfactual fairness | fairness metric |

crash blossom | A sentence or phrase with an ambiguous meaning. Crash blossoms present a significant problem in natural language understanding . For example, the headline Red Tape Holds Up Skyscraper is a crash blossom because an NLU model could interpret the headline literally or figuratively. |

cross-entropy | A generalization of Log Loss to multi-class classification problems . Cross-entropy quantifies the difference between two probability distributions. See also perplexity . |

cross-validation | A mechanism for estimating how well a model will generalize to new data by testing the model against one or more non-overlapping data subsets withheld from the training set . |

data analysis | Obtaining an understanding of data by considering samples, measurement, and visualization. Data analysis can be particularly useful when a dataset is first received, before one builds the first model. It is also crucial in understanding experiments and debugging problems with the system. |

DataFrame | A popular datatype for representing datasets in pandas . A DataFrame is analogous to a table. Each column of the DataFrame has a name (a header), and each row is identified by a number. |

data set or dataset | A collection of examples . |

decision boundary | The separator between classes learned by a model in a binary class or multi-class classification problems . For example, in the following image representing a binary classification problem, the decision boundary is the frontier between the orange class and the blue class: |

decision threshold | Synonym for classification threshold . |

decision tree | A model represented as a sequence of branching statements. For example, the following over-simplified decision tree branches a few times to predict the price of a house (in thousands of USD). According to this decision tree, a house larger than 160 square meters, having more than three bedrooms, and built less than 10 years ago would have a predicted price of 510 thousand USD. |

deep model | A type of neural network containing multiple hidden layers . |

deep neural network | Synonym for deep model . |

dense feature | A feature in which most values are non-zero, typically a Tensor of floating-point values. Contrast with sparse feature . |

dense layer | Synonym for fully connected layer . |

depth | The number of layers (including any embedding layers) in a neural network that learn weights. For example, a neural network with 5 hidden layers and 1 output layer has a depth of 6. |

dimension reduction | Decreasing the number of dimensions used to represent a particular feature in a feature vector, typically by converting to an embedding . |

dimensions | Overloaded term having any of the following definitions: |

discrete feature | A feature with a finite set of possible values. For example, a feature whose values may only be animal , vegetable , or mineral is a discrete (or categorical) feature. Contrast with continuous feature . |

discriminative model | A model that predicts labels from a set of one or more features. More formally, discriminative models define the conditional probability of an output given the features and weights; that is: |

discriminator | A system that determines whether examples are real or fake. |

dropout regularization | A form of regularization useful in training neural networks . Dropout regularization works by removing a random selection of a fixed number of the units in a network layer for a single gradient step. The more units dropped out, the stronger the regularization. This is analogous to training the network to emulate an exponentially large ensemble of smaller networks. For full details, see Dropout: A Simple Way to Prevent Neural Networks from Overfitting . |

dynamic model | A model that is trained online in a continuously updating fashion. That is, data is continuously entering the model. |

early stopping | A method for regularization that involves ending model training before training loss finishes decreasing. In early stopping, you end model training when the loss on a validation dataset starts to increase, that is, when generalization performance worsens. |

embeddings | A categorical feature represented as a continuous-valued feature. Typically, an embedding is a translation of a high-dimensional vector into a low-dimensional space. For example, you can represent the words in an English sentence in either of the following two ways: |

embedding space | The d-dimensional vector space that features from a higher-dimensional vector space are mapped to. Ideally, the embedding space contains a structure that yields meaningful mathematical results; for example, in an ideal embedding space, addition and subtraction of embeddings can solve word analogy tasks. |

empirical risk minimization (ERM) | Choosing the function that minimizes loss on the training set. Contrast with structural risk minimization . |

ensemble | A merger of the predictions of multiple models . You can create an ensemble via one or more of the following: |

epoch | A full training pass over the entire dataset such that each example has been seen once. Thus, an epoch represents N / batch size training iterations , where N is the total number of examples. |

equality of opportunity | fairness metric |

equalized odds | fairness metric |

example | One row of a dataset. An example contains one or more features and possibly a label . See also labeled example and unlabeled example . |

false negative (FN) | An example in which the model mistakenly predicted the negative class . For example, the model inferred that a particular email message was not spam (the negative class), but that email message actually was spam. |

false positive (FP) | An example in which the model mistakenly predicted the positive class . For example, the model inferred that a particular email message was spam (the positive class), but that email message was actually not spam. |

false positive rate (FPR) | The x-axis in an ROC curve . The false positive rate is defined as follows: |

feature | An input variable used in making predictions . |

feature cross | A synthetic feature formed by crossing (taking a Cartesian product of) individual binary features obtained from categorical data or from continuous features via bucketing . Feature crosses help represent nonlinear relationships. |

feature engineering | The process of determining which features might be useful in training a model, and then converting raw data from log files and other sources into said features. In TensorFlow, feature engineering often means converting raw log file entries to tf.Example protocol buffers. See also tf.Transform . |

feature extraction | Overloaded term having either of the following definitions: |

feature set | The group of features your machine learning model trains on. For example, postal code, property size, and property condition might comprise a simple feature set for a model that predicts housing prices. |

feature vector | The list of feature values representing an example passed into a model. |

federated learning | A distributed machine learning approach that trains machine learning models using decentralized examples residing on devices such as smartphones. In federated learning, a subset of devices downloads the current model from a central coordinating server. The devices use the examples stored on the devices to make improvements to the model. The devices then upload the model improvements (but not the training examples) to the coordinating server, where they are aggregated with other updates to yield an improved global model. After the aggregation, the model updates computed by devices are no longer needed, and can be discarded. |

feedback loop | In machine learning, a situation in which a model's predictions influence the training data for the same model or another model. For example, a model that recommends movies will influence the movies that people see, which will then influence subsequent movie recommendation models. |

feedforward neural network (FFN) | A neural network without cyclic or recursive connections. For example, traditional deep neural networks are feedforward neural networks. Contrast with recurrent neural networks , which are cyclic. |

few-shot learning | A machine learning approach, often used for object classification, designed to learn effective classifiers from only a small number of training examples. |

fine tuning | Perform a secondary optimization to adjust the parameters of an already trained model to fit a new problem. Fine tuning often refers to refitting the weights of a trained unsupervised model to a supervised model. |

full softmax | See softmax . Contrast with candidate sampling . |

fully connected layer | A hidden layer in which each node is connected to every node in the subsequent hidden layer. |

GAN | Abbreviation for generative adversarial network . |

generalization | Refers to your model's ability to make correct predictions on new, previously unseen data as opposed to the data used to train the model. |

generalization curve | A loss curve showing both the training set and the validation set . A generalization curve can help you detect possible overfitting . For example, the following generalization curve suggests overfitting because loss for the validation set ultimately becomes significantly higher than for the training set. |

generalized linear model | A generalization of least squares regression models, which are based on Gaussian noise , to other types of models based on other types of noise, such as Poisson noise or categorical noise. Examples of generalized linear models include: |

generative adversarial network (GAN) | A system to create new data in which a generator creates data and a discriminator determines whether that created data is valid or invalid. |

generative model | Practically speaking, a model that does either of the following: |

generator | The subsystem within a generative adversarial network that creates new examples . |

gradient | The vector of partial derivatives with respect to all of the independent variables. In machine learning, the gradient is the vector of partial derivatives of the model function. The gradient points in the direction of steepest ascent. |

gradient descent | A technique to minimize loss by computing the gradients of loss with respect to the model's parameters, conditioned on training data. Informally, gradient descent iteratively adjusts parameters, gradually finding the best combination of weights and bias to minimize loss. |

ground truth | The correct answer. Reality. Since reality is often subjective, expert raters typically are the proxy for ground truth. |

hashing | In machine learning, a mechanism for bucketing categorical data , particularly when the number of categories is large, but the number of categories actually appearing in the dataset is comparatively small. |

heuristic | A quick solution to a problem, which may or may not be the best solution. For example, "With a heuristic, we achieved 86% accuracy. When we switched to a deep neural network, accuracy went up to 98%." |

hidden layer | A synthetic layer in a neural network between the input layer (that is, the features) and the output layer (the prediction). Hidden layers typically contain an activation function (such as ReLU ) for training. A deep neural network contains more than one hidden layer. |

hinge loss | A family of loss functions for classification designed to find the decision boundary as distant as possible from each training example, thus maximizing the margin between examples and the boundary. KSVMs use hinge loss (or a related function, such as squared hinge loss). For binary classification, the hinge loss function is defined as follows: |

holdout data | Examples intentionally not used ("held out") during training. The validation dataset and test dataset are examples of holdout data. Holdout data helps evaluate your model's ability to generalize to data other than the data it was trained on. The loss on the holdout set provides a better estimate of the loss on an unseen dataset than does the loss on the training set. |

hyperparameter | The "knobs" that you tweak during successive runs of training a model. For example, learning rate is a hyperparameter. |

hyperplane | A boundary that separates a space into two subspaces. For example, a line is a hyperplane in two dimensions and a plane is a hyperplane in three dimensions. More typically in machine learning, a hyperplane is the boundary separating a high-dimensional space. Kernel Support Vector Machines use hyperplanes to separate positive classes from negative classes, often in a very high-dimensional space. |

i.i.d. | Abbreviation for independently and identically distributed . |

imbalanced dataset | Synonym for class-imbalanced dataset . |

independently and identically distributed (i.i.d) | Data drawn from a distribution that doesn't change, and where each value drawn doesn't depend on values that have been drawn previously. An i.i.d. is the ideal gas of machine learning—a useful mathematical construct but almost never exactly found in the real world. For example, the distribution of visitors to a web page may be i.i.d. over a brief window of time; that is, the distribution doesn't change during that brief window and one person's visit is generally independent of another's visit. However, if you expand that window of time, seasonal differences in the web page's visitors may appear. |

inference | In machine learning, often refers to the process of making predictions by applying the trained model to unlabeled examples . In statistics, inference refers to the process of fitting the parameters of a distribution conditioned on some observed data. (See the Wikipedia article on statistical inference .) |

input layer | The first layer (the one that receives the input data) in a neural network . |

instance | Synonym for example . |

interpretability | The degree to which a model's predictions can be readily explained. Deep models are often non-interpretable; that is, a deep model's different layers can be hard to decipher. By contrast, linear regression models and wide models are typically far more interpretable. |

inter-rater agreement | A measurement of how often human raters agree when doing a task. If raters disagree, the task instructions may need to be improved. Also sometimes called inter-annotator agreement or inter-rater reliability . See also Cohen's kappa , which is one of the most popular inter-rater agreement measurements. |

IoU | Abbreviation for intersection over union . |

iteration | A single update of a model's weights during training. An iteration consists of computing the gradients of the parameters with respect to the loss on a single batch of data. |

Keras | A popular Python machine learning API. Keras runs on several deep learning frameworks, including TensorFlow, where it is made available as tf.keras . |

Kernel Support Vector Machines (KSVMs) | A classification algorithm that seeks to maximize the margin between positive and negative classes by mapping input data vectors to a higher dimensional space. For example, consider a classification problem in which the input dataset has a hundred features. To maximize the margin between positive and negative classes, a KSVM could internally map those features into a million-dimension space. KSVMs uses a loss function called hinge loss . |

L 1 loss | Loss function based on the absolute value of the difference between the values that a model is predicting and the actual values of the labels . L 1 loss is less sensitive to outliers than L 2 loss . |

L 1 regularization | A type of regularization that penalizes weights in proportion to the sum of the absolute values of the weights. In models relying on sparse features , L 1 regularization helps drive the weights of irrelevant or barely relevant features to exactly 0, which removes those features from the model. Contrast with L 2 regularization . |

L 2 loss | See squared loss . |

L 2 regularization | A type of regularization that penalizes weights in proportion to the sum of the squares of the weights. L 2 regularization helps drive outlier weights (those with high positive or low negative values) closer to 0 but not quite to 0. (Contrast with L1 regularization .) L 2 regularization always improves generalization in linear models. |

label | In supervised learning, the "answer" or "result" portion of an example . Each example in a labeled dataset consists of one or more features and a label. For instance, in a housing dataset, the features might include the number of bedrooms, the number of bathrooms, and the age of the house, while the label might be the house's price. In a spam detection dataset, the features might include the subject line, the sender, and the email message itself, while the label would probably be either "spam" or "not spam." |

labeled example | An example that contains features and a label . In supervised training, models learn from labeled examples. |

lambda | Synonym for regularization rate . |

layer | A set of neurons in a neural network that process a set of input features, or the output of those neurons. |

learning rate | A scalar used to train a model via gradient descent. During each iteration, the gradient descent algorithm multiplies the learning rate by the gradient. The resulting product is called the gradient step . |

least squares regression | A linear regression model trained by minimizing L 2 Loss . |

linear model | A model that assigns one weight per feature to make predictions . (Linear models also incorporate a bias .) By contrast, the relationship of weights to features in deep models is not one-to-one. |

linear regression | Using the raw output ((y')) of a linear model as the actual prediction in a regression model . The goal of a regression problem is to make a real-valued prediction. For example, if the raw output ((y')) of a linear model is 8.37, then the prediction is 8.37. |

logistic regression | A classification model that uses a sigmoid function to convert a linear model's raw prediction ((y')) into a value between 0 and 1. You can interpret the value between 0 and 1 in either of the following two ways: |

logits | The vector of raw (non-normalized) predictions that a classification model generates, which is ordinarily then passed to a normalization function. If the model is solving a multi-class classification problem, logits typically become an input to the softmax function. The softmax function then generates a vector of (normalized) probabilities with one value for each possible class. |

Log Loss | The loss function used in binary logistic regression . |

log-odds | The logarithm of the odds of some event. |

loss | A measure of how far a model's predictions are from its label . Or, to phrase it more pessimistically, a measure of how bad the model is. To determine this value, a model must define a loss function. For example, linear regression models typically use mean squared error for a loss function, while logistic regression models use Log Loss . |

loss curve | A graph of loss as a function of training iterations . For example: |

loss surface | A graph of weight(s) vs. loss. Gradient descent aims to find the weight(s) for which the loss surface is at a local minimum. |

machine learning | A program or system that builds (trains) a predictive model from input data. The system uses the learned model to make useful predictions from new (never-before-seen) data drawn from the same distribution as the one used to train the model. Machine learning also refers to the field of study concerned with these programs or systems. |

majority class | The more common label in a class-imbalanced dataset . For example, given a dataset containing 99% non-spam labels and 1% spam labels, the non-spam labels are the majority class. |

matplotlib | An open-source Python 2D plotting library. matplotlib helps you visualize different aspects of machine learning. |

Mean Absolute Error (MAE) | An error metric calculated by taking an average of absolute errors. In the context of evaluating a model’s accuracy, MAE is the average absolute difference between the expected and predicted values across all training examples. Specifically, for $n$ examples, for each value $y$ and its prediction $hat{y}$, MAE is defined as follows: |

Mean Squared Error (MSE) | The average squared loss per example. MSE is calculated by dividing the squared loss by the number of examples . The values that TensorFlow Playground displays for "Training loss" and "Test loss" are MSE. |

Metrics API (tf.metrics) | A TensorFlow API for evaluating models. For example, tf.metrics.accuracy determines how often a model's predictions match labels. When writing a custom Estimator , you invoke Metrics API functions to specify how your model should be evaluated. |

mini-batch | A small, randomly selected subset of the entire batch of examples run together in a single iteration of training or inference. The batch size of a mini-batch is usually between 10 and 1,000. It is much more efficient to calculate the loss on a mini-batch than on the full training data. |

mini-batch stochastic gradient descent | A gradient descent algorithm that uses mini-batches . In other words, mini-batch stochastic gradient descent estimates the gradient based on a small subset of the training data. Regular stochastic gradient descent uses a mini-batch of size 1. |

minimax loss | A loss function for generative adversarial networks , based on the cross-entropy between the distribution of generated data and real data. |

minority class | The less common label in a class-imbalanced dataset . For example, given a dataset containing 99% non-spam labels and 1% spam labels, the spam labels are the minority class. |

ML | Abbreviation for machine learning . |

model | The representation of what a machine learning system has learned from the training data. Within TensorFlow, model is an overloaded term, which can have either of the following two related meanings: |

model capacity | The complexity of problems that a model can learn. The more complex the problems that a model can learn, the higher the model’s capacity. A model’s capacity typically increases with the number of model parameters. For a formal definition of classifier capacity, see VC dimension . |

model training | The process of determining the best model . |

Momentum | A sophisticated gradient descent algorithm in which a learning step depends not only on the derivative in the current step, but also on the derivatives of the step(s) that immediately preceded it. Momentum involves computing an exponentially weighted moving average of the gradients over time, analogous to momentum in physics. Momentum sometimes prevents learning from getting stuck in local minima. |

multi-class classification | Classification problems that distinguish among more than two classes. For example, there are approximately 128 species of maple trees, so a model that categorized maple tree species would be multi-class. Conversely, a model that divided emails into only two categories ( spam and not spam ) would be a binary classification model . |

multi-class logistic regression | Using logistic regression in multi-class classification problems. |

multinomial classification | Synonym for multi-class classification . |

NaN trap | When one number in your model becomes a NaN during training, which causes many or all other numbers in your model to eventually become a NaN. |

natural language understanding | Determining a user's intentions based on what the user typed or said. For example, a search engine uses natural language understanding to determine what the user is searching for based on what the user typed or said. |

negative class | In binary classification , one class is termed positive and the other is termed negative. The positive class is the thing we're looking for and the negative class is the other possibility. For example, the negative class in a medical test might be "not tumor." The negative class in an email classifier might be "not spam." See also positive class . |

neural network | A model that, taking inspiration from the brain, is composed of layers (at least one of which is hidden ) consisting of simple connected units or neurons followed by nonlinearities. |

neuron | A node in a neural network , typically taking in multiple input values and generating one output value. The neuron calculates the output value by applying an activation function (nonlinear transformation) to a weighted sum of input values. |

NLU | Abbreviation for natural language understanding . |

node (neural network) | A neuron in a hidden layer . |

noise | Broadly speaking, anything that obscures the signal in a dataset. Noise can be introduced into data in a variety of ways. For example: |

normalization | The process of converting an actual range of values into a standard range of values, typically -1 to +1 or 0 to 1. For example, suppose the natural range of a certain feature is 800 to 6,000. Through subtraction and division, you can normalize those values into the range -1 to +1. |

numerical data | Features represented as integers or real-valued numbers. For example, in a real estate model, you would probably represent the size of a house (in square feet or square meters) as numerical data. Representing a feature as numerical data indicates that the feature's values have a mathematical relationship to each other and possibly to the label. For example, representing the size of a house as numerical data indicates that a 200 square-meter house is twice as large as a 100 square-meter house. Furthermore, the number of square meters in a house probably has some mathematical relationship to the price of the house. |

NumPy | An open-source math library that provides efficient array operations in Python. pandas is built on NumPy. |

objective | A metric that your algorithm is trying to optimize. |

objective function | The mathematical formula or metric that a model aims to optimize. For example, the objective function for linear regression is usually squared loss . Therefore, when training a linear regression model, the goal is to minimize squared loss. |

offline inference | Generating a group of predictions , storing those predictions, and then retrieving those predictions on demand. Contrast with online inference . |

one-hot encoding | A sparse vector in which: |

one-shot learning | A machine learning approach, often used for object classification, designed to learn effective classifiers from a single training example. |

one-vs.-all | Given a classification problem with N possible solutions, a one-vs.-all solution consists of N separate binary classifiers —one binary classifier for each possible outcome. For example, given a model that classifies examples as animal, vegetable, or mineral, a one-vs.-all solution would provide the following three separate binary classifiers: |

online inference | Generating predictions on demand. Contrast with offline inference . |

optimizer | A specific implementation of the gradient descent algorithm. TensorFlow's base class for optimizers is tf.train.Optimizer . Popular optimizers include: |

outliers | Values distant from most other values. In machine learning, any of the following are outliers: |

output layer | The "final" layer of a neural network. The layer containing the answer(s). |

overfitting | Creating a model that matches the training data so closely that the model fails to make correct predictions on new data. |

pandas | A column-oriented data analysis API. Many machine learning frameworks, including TensorFlow, support pandas data structures as input. See the pandas documentation for details. |

parameter | A variable of a model that the machine learning system trains on its own. For example, weights are parameters whose values the machine learning system gradually learns through successive training iterations. Contrast with hyperparameter . |

parameter update | The operation of adjusting a model's parameters during training, typically within a single iteration of gradient descent . |

partial derivative | A derivative in which all but one of the variables is considered a constant. For example, the partial derivative of f(x, y) with respect to x is the derivative of f considered as a function of x alone (that is, keeping y constant). The partial derivative of f with respect to x focuses only on how x is changing and ignores all other variables in the equation. |

partitioning strategy | The algorithm by which variables are divided across parameter servers . |

perceptron | A system (either hardware or software) that takes in one or more input values, runs a function on the weighted sum of the inputs, and computes a single output value. In machine learning, the function is typically nonlinear, such as ReLU , sigmoid , or tanh. For example, the following perceptron relies on the sigmoid function to process three input values: |

performance | Overloaded term with the following meanings: |

perplexity | One measure of how well a model is accomplishing its task. For example, suppose your task is to read the first few letters of a word a user is typing on a smartphone keyboard, and to offer a list of possible completion words. Perplexity, P, for this task is approximately the number of guesses you need to offer in order for your list to contain the actual word the user is trying to type. |

pipeline | The infrastructure surrounding a machine learning algorithm. A pipeline includes gathering the data, putting the data into training data files, training one or more models, and exporting the models to production. |

positive class | In binary classification , the two possible classes are labeled as positive and negative. The positive outcome is the thing we're testing for. (Admittedly, we're simultaneously testing for both outcomes, but play along.) For example, the positive class in a medical test might be "tumor." The positive class in an email classifier might be "spam." |

post-processing | after |

PR AUC (area under the PR curve) | Area under the interpolated precision-recall curve , obtained by plotting (recall, precision) points for different values of the classification threshold . Depending on how it's calculated, PR AUC may be equivalent to the average precision of the model. |

precision | A metric for classification models . Precision identifies the frequency with which a model was correct when predicting the positive class . That is: |

precision-recall curve | A curve of precision vs. recall at different classification thresholds . |

prediction | A model's output when provided with an input example . |

prediction bias | A value indicating how far apart the average of predictions is from the average of labels in the dataset. |

preprocessing | sensitive attributes |

pre-trained model | Models or model components (such as embeddings ) that have been already been trained. Sometimes, you'll feed pre-trained embeddings into a neural network . Other times, your model will train the embeddings itself rather than rely on the pre-trained embeddings. |

prior belief | What you believe about the data before you begin training on it. For example, L 2 regularization relies on a prior belief that weights should be small and normally distributed around zero. |

proxy (sensitive attributes) | sensitive attribute |

proxy labels | Data used to approximate labels not directly available in a dataset. |

quantile | Each bucket in quantile bucketing . |

quantile bucketing | Distributing a feature's values into buckets so that each bucket contains the same (or almost the same) number of examples. For example, the following figure divides 44 points into 4 buckets, each of which contains 11 points. In order for each bucket in the figure to contain the same number of points, some buckets span a different width of x-values. |

quantization | An algorithm that implements quantile bucketing on a particular feature in a dataset . |

random forest | An ensemble approach to finding the decision tree that best fits the training data by creating many decision trees and then determining the "average" one. The "random" part of the term refers to building each of the decision trees from a random selection of features; the "forest" refers to the set of decision trees. |

rank (ordinality) | The ordinal position of a class in a machine learning problem that categorizes classes from highest to lowest. For example, a behavior ranking system could rank a dog's rewards from highest (a steak) to lowest (wilted kale). |

rater | A human who provides labels in examples . Sometimes called an "annotator." |

recall | A metric for classification models that answers the following question: Out of all the possible positive labels, how many did the model correctly identify? That is: |

Rectified Linear Unit (ReLU) | An activation function with the following rules: |

regression model | A type of model that outputs continuous (typically, floating-point) values. Compare with classification models , which output discrete values, such as "day lily" or "tiger lily." |

regularization | The penalty on a model's complexity. Regularization helps prevent overfitting . Different kinds of regularization include: |

regularization rate | A scalar value, represented as lambda, specifying the relative importance of the regularization function. The following simplified loss equation shows the regularization rate's influence: |

representation | The process of mapping data to useful features . |

ridge regularization | Synonym for L 2 regularization . The term ridge regularization is more frequently used in pure statistics contexts, whereas L 2 regularization is used more often in machine learning. |

ROC (receiver operating characteristic) Curve | A curve of true positive rate vs. false positive rate at different classification thresholds . See also AUC . |

Root Mean Squared Error (RMSE) | The square root of the Mean Squared Error . |

scalar | A single number or a single string that can be represented as a tensor of rank 0. For example, the following lines of code each create one scalar in TensorFlow: |

scaling | A commonly used practice in feature engineering to tame a feature's range of values to match the range of other features in the dataset. For example, suppose that you want all floating-point features in the dataset to have a range of 0 to 1. Given a particular feature's range of 0 to 500, you could scale that feature by dividing each value by 500. |

scikit-learn | A popular open-source machine learning platform. See www.scikit-learn.org . |

semi-supervised learning | Training a model on data where some of the training examples have labels but others don’t. One technique for semi-supervised learning is to infer labels for the unlabeled examples, and then to train on the inferred labels to create a new model. Semi-supervised learning can be useful if labels are expensive to obtain but unlabeled examples are plentiful. |

sentiment analysis | Using statistical or machine learning algorithms to determine a group's overall attitude—positive or negative—toward a service, product, organization, or topic. For example, using natural language understanding , an algorithm could perform sentiment analysis on the textual feedback from a university course to determine the degree to which students generally liked or disliked the course. |

serving | A synonym for inferring . |

shape (Tensor) | The number of elements in each dimension of a tensor. The shape is represented as a list of integers. For example, the following two-dimensional tensor has a shape of [3,4]: |

sigmoid function | A function that maps logistic or multinomial regression output (log odds) to probabilities, returning a value between 0 and 1. The sigmoid function has the following formula: |

softmax | A function that provides probabilities for each possible class in a multi-class classification model . The probabilities add up to exactly 1.0. For example, softmax might determine that the probability of a particular image being a dog at 0.9, a cat at 0.08, and a horse at 0.02. (Also called full softmax .) |

sparse feature | Feature vector whose values are predominately zero or empty. For example, a vector containing a single 1 value and a million 0 values is sparse. As another example, words in a search query could also be a sparse feature—there are many possible words in a given language, but only a few of them occur in a given query. |

sparse representation | A representation of a tensor that only stores nonzero elements. |

sparse vector | A vector whose values are mostly zeroes. See also sparse feature . |

sparsity | The number of elements set to zero (or null) in a vector or matrix divided by the total number of entries in that vector or matrix. For example, consider a 10x10 matrix in which 98 cells contain zero. The calculation of sparsity is as follows: |

squared hinge loss | The square of the hinge loss . Squared hinge loss penalizes outliers more harshly than regular hinge loss. |

squared loss | The loss function used in linear regression . (Also known as L 2 Loss .) This function calculates the squares of the difference between a model's predicted value for a labeled example and the actual value of the label . Due to squaring, this loss function amplifies the influence of bad predictions. That is, squared loss reacts more strongly to outliers than L 1 loss . |

static model | A model that is trained offline. |

stationarity | A property of data in a dataset, in which the data distribution stays constant across one or more dimensions. Most commonly, that dimension is time, meaning that data exhibiting stationarity doesn't change over time. For example, data that exhibits stationarity doesn't change from September to December. |

step | A forward and backward evaluation of one batch . |

step size | Synonym for learning rate . |

stochastic gradient descent (SGD) | A gradient descent algorithm in which the batch size is one. In other words, SGD relies on a single example chosen uniformly at random from a dataset to calculate an estimate of the gradient at each step. |

structural risk minimization (SRM) | An algorithm that balances two goals: |

supervised machine learning | Training a model from input data and its corresponding labels . Supervised machine learning is analogous to a student learning a subject by studying a set of questions and their corresponding answers. After mastering the mapping between questions and answers, the student can then provide answers to new (never-before-seen) questions on the same topic. Compare with unsupervised machine learning . |

synthetic feature | A feature not present among the input features, but created from one or more of them. Kinds of synthetic features include: |

target | Synonym for label . |

temporal data | Data recorded at different points in time. For example, winter coat sales recorded for each day of the year would be temporal data. |

test set | The subset of the dataset that you use to test your model after the model has gone through initial vetting by the validation set. |

tower | A component of a deep neural network that is itself a deep neural network without an output layer. Typically, each tower reads from an independent data source. Towers are independent until their output is combined in a final layer. |

training | The process of determining the ideal parameters comprising a model. |

training set | The subset of the dataset used to train a model. |

transfer learning | Transferring information from one machine learning task to another. For example, in multi-task learning, a single model solves multiple tasks, such as a deep model that has different output nodes for different tasks. Transfer learning might involve transferring knowledge from the solution of a simpler task to a more complex one, or involve transferring knowledge from a task where there is more data to one where there is less data. |

true negative (TN) | An example in which the model correctly predicted the negative class . For example, the model inferred that a particular email message was not spam, and that email message really was not spam. |

true positive (TP) | An example in which the model correctly predicted the positive class . For example, the model inferred that a particular email message was spam, and that email message really was spam. |

true positive rate (TPR) | Synonym for recall . That is: |

underfitting | Producing a model with poor predictive ability because the model hasn't captured the complexity of the training data. Many problems can cause underfitting, including: |

unlabeled example | An example that contains features but no label . Unlabeled examples are the input to inference . In semi-supervised and unsupervised learning, unlabeled examples are used during training. |

upweighting | Applying a weight to the downsampled class equal to the factor by which you downsampled. |

validation | A process used, as part of training , to evaluate the quality of a machine learning model using the validation set . Because the validation set is disjoint from the training set, validation helps ensure that the model’s performance generalizes beyond the training set. |

validation set | A subset of the dataset—disjoint from the training set—used in validation . |

Wasserstein loss | One of the loss functions commonly used in generative adversarial networks , based on the earth-mover's distance between the distribution of generated data and real data. |

weight | A coefficient for a feature in a linear model, or an edge in a deep network. The goal of training a linear model is to determine the ideal weight for each feature. If a weight is 0, then its corresponding feature does not contribute to the model. |

wide model | A linear model that typically has many sparse input features . We refer to it as "wide" since such a model is a special type of neural network with a large number of inputs that connect directly to the output node. Wide models are often easier to debug and inspect than deep models. Although wide models cannot express nonlinearities through hidden layers , they can use transformations such as feature crossing and bucketization to model nonlinearities in different ways. |

width | The number of neurons in a particular layer of a neural network . |