Basic principle is the same as ${\chi}^2$ – goodness of fit test
* Between categorical variables
The standard approach is to compute expected counts, and find the
distribution of sum of square of difference between expected counts and ordinary
counts(normalized).
* Between Numerical Variables
In our earlier posts, here and here, we found to our dismay that, our natural inclination to choose the top mutual fund performers of the past 1 & 3 years hasn’t worked too well.
That leaves us with the obvious question..
What actually goes wrong when we pick the top funds of the past few years?
Below is a representation of the best performing sectors year over year. What do you notice?
The sector performance over each and every year varies significantly and the top and bottom sectors keep changing dramatically almost every year.
Sample this:
View original post 1,399 more words
One of the standard problems in ML with meta modelling algorithms(Algorithms that run multiple statistical models over given data and identifies the best fitting model. For ex: randomforest or the rarely practical genetic algorithm) , is that they might favour overly complex models that over fit the given training data but on live/test data perform poorly.
The way these meta modelling algorithms work is they have a objective function(usually the RMS of error of the stats/sub model with the data) they pick the model based on.(i.e: whichever model yields lowest value of the objective function). So we can just add a complexity penalty(one obvious idea is the rank of the polynomial that model uses to fit, but how does that work for comparing with exponential functions?) and the objective function suddenly becomes RMS(Error) + Complexity_penalty(model).
Now depending on the right choice of Error function and Complexity penalty this can find models that may perform less than more complex models on the training data, but can perform better in the live model scenario.
The idea of complexity penalty itself is not new, I don’t dare say ML borrowed it from scientific experimentation methods or some thing but the idea that the more complex a theory or a model it should be penalized over a simpler theory or model is very old. Here’s a better written post on it.
Related Post: https://softwaremechanic.wordpress.com/2016/08/12/bayesians-vs-frequentistsaka-sampling-theorists/
Okay, just kidding, while that’s kinda true, I was just pranking on y’all. What I want to
talk about is a stats/math/Machine Learning method used when trying to find clusters in a
given dataset. So [Elbow Method] (https://en.wikipedia.org/wiki/Elbow_method_(clustering))
is basically a measure/method for interpretation and validation of conistency of a cluster.
Ugh.. the original sentence in Wikipedia is so long with all 10-letter words, I couldn’t
even type it again.(Above attempt was simplified during typing-on-the-fly)
The basic issue is that, during a cluster analysis we need to settle on a few things:
* A measure for distance within, across and between clusters and points in the
clusters
In the case of elbow method it is a visual method for the third option. Basically, it’s a
ratio of variance (within clusters) divided by overall variance. So it explains how much(or
what %)of
the total variance is explained by choosing “n” number of clusters.
The name elbow method comes from visually plotting the number of clusters Vs the ratio(% of
variance explained) and finding that point where there’s an acute bend(if no.of.clusters is
in X-axis), picking the number of clusters at that point.
F-test is any stats test that uses F-distribution
It is often used when comparing stats models that have been fitted to a data set.. Ahh.. That
sounds no different from F-score then.. May be just different
fields(Statistics and Machine Learning) have different naming conventions?? Anyway two different
F-words.. So let’s just say what F-score/test?? Why two names for samething and move on…
Null Hypothesis: Means of a given set of normally distributed populations all having same standard deviation being equal.(used in ANOVA)
The hypothesis that a proposed regression model fits the data well.
The hypothesis that a data set in a regression analysis follows the simpler of two proposed linear models that are nested within each other.
It(non-regression type) is also a test of homoskedasticity
Formula: or Ok. That doesn’t sound like the F-score
Formula(for regression models):