Exploring the ChestXray14 dataset: problems

Luke Oakden-Rayner

A couple of weeks ago, I mentioned I had some concerns about the ChestXray14 dataset. I said I would come back when I had more info, and since then I have been digging into the data. I’ve talked with Dr Summers via email a few times as well. Unfortunately, this exploration has only increased my concerns about the dataset.

WARNING: there are going to be lots of images today. If data and bandwidth are a problem for you, beware. Also, this piece is about 5000 words long. Sorry :)
DISCLAIMER: Since some people are interpreting this wrongly, I do not think this piece in any way reflects broader problems in medical deep learning, or suggests that claims of human performance are impossible. I’ve made a claim like that myself, in recent work. These results are specific to this dataset, and represent challenges we face with medical data. Challenges…

View original post 5,094 more words


Word_2_vector.. (aka word embeddings)

Word 2 vector:

  • word 2 vector is a way to take a big set of text and convert into a matrix with a word at
    each row.
  • It is a shallow neural-network(2 layers)
  • Two options/training methods (

CBOW(Continuous-bag-of-words assumption)

  • — a text is represented as the bag(multiset) of its words
  • — disregards grammar
  • — disregards word order but keeps multiplicity
  • — Also used in computer vision

skip-gram() — it is a generalization

of n-grams(which is basically a markov chain model, with (n-1)-order)
* — It is a n-1 order markov model
* — Used in Protein sequencing, DNA Sequencing, Computational
linguistics(character and word)
* — Models sequences, using the statistical properties of n-grams
* — predicts x_i based on x_(i-(n-1)), ....,x_(i-1) .
* — in language modeling independence assumptions are made so that each
word depends only on n-1 previous words.(or characters in case of
character level modeling)
* — The probability of a word conditional on previous n-1 words follows a
Categorical Distribution
* — In practice, the probability distributions are smoothed by assigning non-zero probabilities to unseen words or n-grams.

Bias-vs-Variance Tradeoff:

  • — Finding the right ‘n’ for a model is based on the Bias Vs Variance tradeoff we’re wiling to make

Smoothing Techniques:

  • — Problems of balance weight between infrequent n-grams.
  • — Unseen n-grams by default get 0.0 without smoothing.
  • — Use pseudocounts for unseen n-grams.(generally motivated by
    bayesian reasoning on the sub n-grams, for n < original n)

  • — Skip grams also allow the possibility of skipping. So a 1-skip bi(2-)gram would create bigrams while skipping the second word in a three sequence.

  • — Could be useful for languages with less strict subject-verb-object order than English.

Alternative link

  • Depends on Distributional Hypothesis
  • Vector representations of words called “word embeddings”
  • Basic motivation is that compared to audio, visual domains, the word/text domain treats
    them as discrete symbols, encoding them as sparse dataset. Vector based representation
    works around these issues.
  • Also called as vector Space models
  • Two ways of training: a, CBOW(Continuous-Bag-Of-Words) model predicts target words, given
    a group of words, b, skip-gram is ulta. aka predicts group of words from a given word.

  • Trained using the Maximum Likelihood model

  • Ideally, Maximizes probability of next word given the previous ‘h’ words in terms of a softmax function
  • However, calculating the softmax values requires computing and normalizing each probability using score for all the other words in context at every step.
  • Therefore a logistic regression aka binary classification objective functionis used.
  • The way this is achieved is called negative sampling

A Data Driven Guide to Becoming a Consistent Billionaire

The Art and Science of Data

Did You Really Think All Billionaires Were the Same?

Recently, I became a bit obsessed with the one percent of the one percent – Billionaires. I was intrigued when I stumbled on articles telling us who and what billionaires really are. The articles said stuff like: Most entrepreneurs do not have a degree and the average billionaire was in their 30s before starting their business. I felt like this was a bit of a generalization and I’ll explain. Let’s take a look at Bill Gates and Hajime Satomi, the CEO of Sega. Both are billionaires but are they really the same? In the past decade, Bill Gates has been a billionaire every single year while Hajime has dropped off the Forbes’ list three times. Is it fair to put these two individuals in the same box, post nice articles and give nice stats when no one wants to be a…

View original post 1,457 more words

Dirichlet Distribution

Dirichlet Distribution(DD):
* — Symmetric DD can be considered a distribution of distributions
* — Each sample from Symmetric DD is a categorical distribution over K categories.
* — Generates samples that are similar discrete distributions
* — It is parameterized G0, a distribution over K categories and \alpha a scale
import numpy as np
from scipy.stats import dirichlet

def stats(scale_factor, G0=[.2, .2, .6], N=10000):
samples = dirichlet(alpha = scale_factor * np.array(G0)).rvs(N)
print ” alpha:”, scale_factor
print ” element-wise mean:”, samples.mean(axis=0)
print “element-wise standard deviation:”, samples.std(axis=0)

for scale in [0.1, 1, 10, 100, 1000]:

Dirichlet Process(DP)

  • — A way to generalize Dirichlet Distribution
  • — Generate samples that are distributions similar to the parameter H_0
  • — Also has the parameter \alpha determines how much will the samples
    vary from H0

  • — a sample H of DP( \alpha, H_0) is constructed by drawing a countably
    infinite number of samples \theta k and then setting
    H = \sum_{k=1}^{\infty} \pi_k * \delta(x-\theta_k)

where \pi_k — is carefully chosen weights that sum to 1
\delta — is the Dirac delta function
* — Since the samples from a DP are similar to a parameter H_0 one-way to
test if a DP is generating your dataset is to check if the distributions(of
different attributes/dimensions) you get are similar to each other. Something like
permutation test, but understand the assumptions and caveats.

  • — The code for dirichlet sampling can be written as:
    import matplotlib.pyplot as plt
    from scipy.stats import beta, norm

def dirichlet_sample_approximation(base_measure, alpha, tol=0.01):
betas = []
pis = []
betas.append(beta(1, alpha).rvs())
while sum(pis) < (1.-tol):
s = np.sum([np.log(1 – b) for b in betas])
new_beta = beta(1, alpha).rvs()
pis.append(new_beta * np.exp(s))
pis = np.array(pis)
thetas = np.array([base_measure() for _ in pis])
return pis, thetas

def plot_normal_dp_approximation(alpha):
plt.title(“Dirichlet Process Sample with N(0,1) Base Measure”)
plt.suptitle(“alpha: %s” % alpha)
pis, thetas = dirichlet_sample_approximation(lambda: norm().rvs(), alpha)
pis = pis * (norm.pdf(0) / pis.max())
plt.vlines(thetas, 0, pis, )
X = np.linspace(-4,4,100)
plt.plot(X, norm.pdf(X))


  • — The code for Dirichlet process can be written as :
    from numpy.random import choice

class DirichletProcessSample():
def init(self, base_measure, alpha):
self.base_measure = base_measure
self.alpha = alpha

self.cache = []
self.weights = []
self.total_stick_used = 0.

def call(self):
remaining = 1.0 – self.total_stick_used
i = DirichletProcessSample.roll_die(self.weights + [remaining])
if i is not None and i < len(self.weights) :
return self.cache[i]
stick_piece = beta(1, self.alpha).rvs() * remaining
self.total_stick_used += stick_piece
new_value = self.base_measure()
return new_value

def roll_die(weights):
if weights:
return choice(range(len(weights)), p=weights)
return None

Statistics — Tests of independence

Tests of independence:

Basic principle is the same as ${\chi}^2$ – goodness of fit test
* Between categorical variables

${\chi}^2$-square tests:

The standard approach is to compute expected counts, and find the
distribution of sum of square of difference between expected counts and ordinary
* Between Numerical Variables

${\chi}^2$-square test:

  • Between a categorical and numerical variable?

Null Hypothesis:

  • The two variables are independent.
  • Always a right-tail test
  • Test statistic/measure has a ${\chi}^2$ distribution, if assumptions are met:
  • Data are obtained from a random sample
  • Expected frequency of each category must be
    atleast 5
  • ### Properties of the test:
  • The data are the observed frequencies.
  • The data is arranged into a contingency table.
  • The degrees of freedom are the degrees of freedom for the row variable times the degrees of freedom for the column variable. It is not one less than the sample size, it is the product of the two degrees of freedom.
  • It is always a right tail test.
  • It has a chi-square distribution.
  • The expected value is computed by taking the row total times the column total and dividing by the grand total
  • The value of the test statistic doesn’t change if the order of the rows or columns are switched.
  • The value of the test statistic doesn’t change if the rows and columns are interchanged (transpose of the matrix