Why you need to improve your training data, and how to do it

Pete Warden's blog

sleep_lostPhoto by Lisha Li

Andrej Karpathy showed this slide as part of his talk at Train AI and I loved it! It captures the difference between deep learning research and production perfectly. Academic papers are almost entirely focused on new and improved models, with datasets usually chosen from a small set of public archives. Everyone I know who uses deep learning as part of an actual application spends most of their time worrying about the training data instead.

There are lots of good reasons why researchers are so fixated on model architectures, but it does mean that there are very few resources available to guide people who are focused on deploying machine learning in production. To address that, my talk at the conference was on “the unreasonable effectiveness of training data”, and I want to expand on that a bit in this blog post, explaining why data is so important…

View original post 3,917 more words

Advertisements

The Bubble Under the Mathematical Rug

Math with Bad Drawings

Don’t freak out, but we’re surrounded by normal distributions.

They’re in our heights; our weights; our sampling means; our fever-dreams; our Galton Boards…

Image (3)

Every normal is a variation on the same bell-curved theme. Just specify two parameters—the mean, i.e., the center of the distribution, and the variance, which measures its breadth—and you’ve got a normal distribution. They’re one big clan, with a strong family resemblance.

But—for me, at least—this raises a question: Who is the matriarch of the family? Which normal distribution is the founding member, the Mitochondrial Eve, the universal common ancestor?

View original post 825 more words

The Battle of the Sexes is Bullshit: A Review of Stephen Marche’s The Unmade Bed (2018)

Committing Sociology

At the dramatic climax of Traffic (2000), Michael Douglas’s character, the guy in charge of the War on Drugs, breaks down in the middle of a press conference and goes off-script: “If there is a War on Drugs then our own families have become the enemy. How can you wage war on your own family?” The overarching message of Stephen Marche’s The Unmade Bed (2018) is of a similar stamp: namely, that the martial language employed by “social justice warriors” and “men’s rights activists” is a toxic dead-end. The Battle of the Sexes is bullshit: “Rather than enrich the realm of politics with the difficult business of intimate life, identity politics flattens the personal until it fits into established intellectual categories.” If the hawkish ideologues who fan the flames of the “Gender Wars” in Social Media Land are to be believed, then our own families have become the enemy. But…

View original post 482 more words

Life on the Poincaré Disk

Point at Infinity

Just at this time I left Caen, where I was then living, to go on a geological excursion under the auspices of the school of mines. The changes of travel made me forget my mathematical work. Having reached Coutances, we entered an omnibus to go some place or other. At the moment when I put my foot on the step the idea came to me, without anything in my former thoughts seeming to have paved the way for it, that the transformations I had used to define the Fuchsian functions were identical with those of  non-Euclidean geometry. I did not verify the idea; I should not have had time, as, upon taking my seat in the omnibus, I went on with a conversation already commenced, but I felt a perfect certainty. On my return to Caen, for conscience’ sake I verified the result at my leisure.

-Henri Poincaré, Science and…

View original post 2,107 more words

Math Classes Every College Should Teach

Math with Bad Drawings

Math 40: Trying to Visualize a Fourth Dimension. Syllabus includes Flatland, the Wikipedia page for “hypercube,” long hours of squinting, and self-inflicted head injuries.

Image (6).jpgMath 99: An Irritating Introduction to Proof. The term begins with five weeks of the professor responding to every question with, “But how do you knoooooooow?” If anyone is still enrolled at that point, we’ll have to wing it, since no one has ever lasted that long.

Image.jpgMath 101: Binary. An introductory study of the binary numeral system. Also listed as Math 5.

View original post 278 more words

Topological maps or topographic maps?

David Richeson: Division by Zero

While surfing the web the other day I read an article in which the author refers to a “topological map.” I think it is safe to say that he meant to write “topographic map.” This is an error I’ve seen many times before.

A topographic map is a map of a region that shows changes in elevation, usually with contour lines indicating different fixed elevations. This is a map that you would take on a hike.

A topological map is a continuous function between two topological spaces—not the same thing as a topographic map at all!

I thought for sure that there was no cartographic meaning for topological map. It turns out, however, that there is.

A topological map is a map that is only concerned with relative locations of features on the map, not on exact locations. A famous example is the graph that we use to…

View original post 95 more words

Python’s Weak Performance Matters

Meta Rabbit

Here is an argument I used to make, but now disagree with:

Just to add another perspective, I find many “performance” problems in
the real world can often be attributed to factors other than the raw
speed of the CPython interpreter. Yes, I’d love it if the interpreter
were faster, but in my experience a lot of other things dominate. At
least they do provide low hanging fruit to attack first.

[…]

But there’s something else that’s very important to consider, which
rarely comes up in these discussions, and that’s the developer’s
productivity and programming experience.[…]

This is often undervalued, but shouldn’t be! Moore’s Law doesn’t apply
to humans, and you can’t effectively or cost efficiently scale up by
throwing more bodies at a project. Python is one of the best languages
(and ecosystems!) that make the development experience fun, high
quality, and very efficient.

(from Barry Warsaw)

I…

View original post 705 more words

Seat Belts, Condoms and the Indian Equity Investor

The Eighty Twenty Investor

The Seat Belt mystery

Image result for seat belts

In early 1970s, when the use of seat belts were made mandatory in the US to improve driver safety, something strange happened.

Instead of road accident deaths coming down they actually went up!

While the regulators were perplexed by this phenomenon, an economist by the name Sam Peltzman came up with a controversial answer.

Image result for seat belts + peltzman

He argued that though the drivers had lower risks due the additional safety that a seat belt provides,  many drivers actually compensated for the additional safety by driving more recklessly (driving faster, not paying as much attention, etc.) under the comfort of the added safety.

“The safer they make the cars, the more risks the driver is willing to take”

This meant that bystanders – pedestrians, bicyclists etc – would receive no safety benefit from the seat belts but would rather suffer as a result of increased recklessness.

He termed this…

View original post 1,643 more words

Data Science: Structured thinking — a collection of guide.

Inspired by this. Read it first: http://www.analyticsvidhya.com/blog/2013/06/art-structured-thinking-analyzing/

  1. Figure out the questions involved in the analytics project and decide which ones can be tackled
    separately, and which ones are intertwined with others, and which ones need to be answered first
    before tackling others. Then pick one.
    0.5 Layout the data requirements and hypothesis before looking at what data is available
  2. Actually Look at the data summary(dataframe.describe()) that includes mean, mode, std, and quartiles)
  3. Look for patterns in the summary. Think about what each of the values mean to your question? What
    do questions do they lead to? How do they modify your question?
  4. Figure out the ML problem use this.

  5. Go back to step 1 and 2 again and redo them with the ML problem .
  6. See if you have enough data (noise vs signal) or you need more samples or do you need more
    features. (see http://scikit-learn.org/stable/modules/feature_selection.html)

First Model building time-split:
1.Descriptive analysis on the Data – 50% time
2.Data treatment (Missing value and outlier fixing) – 40% time
3.Data Modelling – 4% time
4.Estimation of performance – 6% time

Data Exploration steps:
Source Reference: https://www.analyticsvidhya.com/blog/2016/01/guide-data-exploration/
Below are the steps involved to understand, clean and prepare your data for building your predictive model:

    1.Variable Identification
    2.Univariate Analysis
    3.Bi-variate Analysis
    4.Missing values treatment
    5.Outlier treatment
    6.Variable transformation
    7.Variable creation

Missing Value Treatment:
    1.Deletion:
    2.Mean/ Mode/ Median Imputation
    3.Prediction Model:
    4.KNN Imputation:
Outlier Treatment:
    1.Data Entry Errors:
    2. Measurement Error:
    3. Experimental Error:
    4. Intentional Outlier:
    5. Data Processing Error:
    6. Sampling error:
    7. Natural Outlier: