
Overfitting
Overfitting is a modeling error in statistics that occurs when a function is too closely aligned to a limited set of data points. Other methods include ensembling: predictions are combined from at least two separate models, data augmentation, in which the available data set is made to look diverse, and data simplification, in which the model is streamlined to avoid overfitting. Ways to prevent overfitting include cross-validation, in which the data being used for training the model is chopped into folds or partitions and the model is run for each fold. Overfitting the model generally takes the form of making an overly complex model to explain idiosyncrasies in the data under study. But a model can also be underfitted, meaning it is too simple, with too few features and too little data to build an effective model.

What Is Overfitting?
Overfitting is a modeling error in statistics that occurs when a function is too closely aligned to a limited set of data points. As a result, the model is useful in reference only to its initial data set, and not to any other data sets.
Overfitting the model generally takes the form of making an overly complex model to explain idiosyncrasies in the data under study. In reality, the data often studied has some degree of error or random noise within it. Thus, attempting to make the model conform too closely to slightly inaccurate data can infect the model with substantial errors and reduce its predictive power.





Understanding Overfitting
For instance, a common problem is using computer algorithms to search extensive databases of historical market data in order to find patterns. Given enough study, it is often possible to develop elaborate theorems that appear to predict returns in the stock market with close accuracy.
However, when applied to data outside of the sample, such theorems may likely prove to be merely the overfitting of a model to what were in reality just chance occurrences. In all cases, it is important to test a model against data that is outside of the sample used to develop it.
How to Prevent Overfitting
Ways to prevent overfitting include cross-validation, in which the data being used for training the model is chopped into folds or partitions and the model is run for each fold. Then, the overall error estimate is averaged. Other methods include ensembling: predictions are combined from at least two separate models, data augmentation, in which the available data set is made to look diverse, and data simplification, in which the model is streamlined to avoid overfitting.
Financial professionals must always be aware of the dangers of overfitting or underfitting a model based on limited data. The ideal model should be balanced.
Overfitting in Machine Learning
Overfitting is also a factor in machine learning. It might emerge when a machine has been taught to scan for specific data one way, but when the same process is applied to a new set of data, the results are incorrect. This is because of errors in the model that was built, as it likely shows low bias and high variance. The model may have had redundant or overlapping features, resulting in it becoming needlessly complicated and therefore ineffective.
Overfitting vs. Underfitting
A model that is overfitted may be too complicated, making it ineffective. But a model can also be underfitted, meaning it is too simple, with too few features and too little data to build an effective model. An overfit model has low bias and high variance, while an underfit model is the opposite — it has high bias and low variance. Adding more features to a too-simple model can help limit bias.
Overfitting Example
For example, a university that is seeing a college dropout rate that is higher than what it would like decides it wants to create a model to predict the likelihood that an applicant will make it all the way through to graduation.
To do this, the university trains a model from a dataset of 5,000 applicants and their outcomes. It then runs the model on the original dataset — the group of 5,000 applicants — and the model predicts the outcome with 98% accuracy. But to test its accuracy, they also run the model on a second dataset — 5,000 more applicants. However, this time, the model is only 50% accurate, as the model was too closely fit to a narrow data subset, in this case, the first 5,000 applications.
Related terms:
Business Valuation , Methods, & Examples
Business valuation is the process of estimating the value of a business or company. read more
Heteroskedasticity
In statistics, heteroskedasticity happens when the standard deviations of a variable, monitored over a specific amount of time, are nonconstant. read more
Model Risk
Model risk occurs when a financial model used to measure a firm's market risks or value transactions fails or performs inadequately. read more
Monte Carlo Simulation
Monte Carlo simulations are used to model the probability of different outcomes in a process that cannot easily be predicted. read more
Predictive Analytics
Predictive analytics is the use of statistics and modeling techniques to determine future performance based on current and historical data. read more
Residual Standard Deviation
The residual standard deviation describes the difference in standard deviations of observed values versus predicted values in a regression analysis. read more
Residual Sum of Squares (RSS)
The residual sum of squares (RSS) is a statistical technique used to measure the variance in a data set that is not explained by the regression model. read more
Stock Market
The stock market consists of exchanges or OTC markets in which shares and other financial securities of publicly held companies are issued and traded. read more