# Conditional probability brain teaser

Standard

This evening one question popped into my head. If we have three independent events A, B and C, will be C independent from the intersection of A and B?

Let’s start with the basics first. Two events A and B are independent if the probability of their intersection is equal to the multiplication of their disjoint probabilities: $P(A \cap B) = P(A) P(B)$. So far nothing particularly hard.

What can we say about three events? When are three events, A, B and C, independent? This is pretty straightforward too. Three events are defined as independent when they are pairwise independent (so when $P(A \cap B) = P(A) P(B)$$P(A \cap C) = P(A) P(C)$ and $P(B \cap C) = P(B) P(C)$) and, in addition to that, $P(A \cap B \cap C) = P(A) P(B) P(C)$. Note that pairwise independence doesn’t imply independence. This is a very important point. We can generalise the above formulation for n events as follows:

$P(\bigcap_{i \in S} A_{i}) = \prod_{i \in S}P(A_{i}),\hspace{35pt}\text{for every subset S of }\{1, 2, ...,n\}$

Again, so far nothing particularly hard.

Now that we have brushed the basics, let’s go back to our initial question. Can we demonstrate that if A, B and C are independent events, then C will be independent from the intersection of A and B? This to me doesn’t make intuitive sense. I’m not saying that intuitively I would think the opposite but as soon as I asked myself the question I had to write the following equations on a piece of paper to attack the problem.

The answer is Yes, and here is an elegant proof. Let’s start by defining our conditional probability:

$P(C | A \cap B) = \frac{P(A \cap B \cap C)}{P(A \cap B)}$

Because we know that the three events are independent from each other, we can proceed with the following substitution:

$\frac{P(A) P(B) P(C)}{P(A \cap B)} = \frac{P(A) P(B) P(C)}{P(A) P(B)} = P(C)$

After applying some cancellations we remain with the equation $P(C | A \cap B) = P(C)$, which basically says that our conditional probability of C is independent from the intersection of A and B.

Problem solved! That was an interesting  one🙂

# List comprehensions in Python

Standard

List comprehensions are one the most interesting features of the Python programming language. List comprehensions are however often feared because of their esoteric syntax, or behaviour. I’m instead convinced they are quite easy to use, and on 95% of the cases they do the job in a way more elegant fashion.

So what are list comprehensions? Wikipedia has a good definition for them:

A list comprehension is a syntactic construct available in some programming languages for creating a list based on existing lists. It follows the form of the mathematical set-builder notation (set comprehension) as distinct from the use of map and filter functions.

So, list comprehension is a syntactic construct that allows you to build lists from other existing lists. Mm.. that sounds nice, but what can I do with that?

Let’s look at a little code snippet. Let’s say you want to build a list of integers from 0 to 10. One quick way to do it, is to use the command ‘range’ and put all its content into a list object.

>>> list(range(11))
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]


Now I want to show you another way to achieve the same result using a list comprehension.

>>> [i for i in range(11)]
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]


What I’m doing here is simply to say to Python that I want to return in a list all elements ‘i’ in ‘range(11)’. This sounds overkilling. Why should we complicate our life doing something like that? It’s also syntactically less immediate.

That’s true. But the power of list comprehensions manifests itself on more advanced cases. Let’s say we want to have a build a list of all odd numbers from 0 to 20. You could create an empty list and via a ‘for loop’ append the items if they are odd, like this:

>>> l = []
>>> for i in range(21):
... if i % 2 != 0:
... l.append(i)
>>> l
[1, 3, 5, 7, 9, 11, 13, 15, 17, 19]


This is however ugly and not very pythonic as people like to say. A more elegant way of achieving the same result in just one line of code is to use a list comprehension:

>>> [i for i in range(21) if i % 2 != 0]
[1, 3, 5, 7, 9, 11, 13, 15, 17, 19]


I don’t know you, but I prefer this more concise way. We are basically saying, return every ‘i’ from ‘i in range(21)’ which respects the test ‘i % 2 != 0’. Again, this is a very basic example. Let’s go one step further.

Let’s say you want to return in a list all prime numbers which are less than 50. Mm.. this sounds like a tricky task. Well, thanks to list comprehension, it’s actually quite easy to do:

>>> nonPrimes = [j for i in range(2, 8) for j in range(i*2, 50, i)]
>>> primes = [x for x in range(2, 50) if x not in nonPrimes]
>>> print(primes)
[2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47]


Well, that was not too bad! To compute a list of non prime numbers we condensed a nested for loop into just one line of code. Then, we used the result of that line (the nonPrime list) to filter our of a list containing all integer values from 2 to 49. Personally I like this solution a lot.

If list comprehensions are nice, what about dictionary comprehensions? These are also a nice trick to have in your bag. Let’s look at those.

A very stupid but quick example to get started could be to create a dictionary whose keys are all integers from 0 to 10, and the value associated which each of them a meaningless string.

>>> {k: str("I'm the value associated to the key " + str(k)) for k in range(11)}
{0: "I'm the value associated to the key 0",
1: "I'm the value associated to the key 1",
2: "I'm the value associated to the key 2",
3: "I'm the value associated to the key 3",
4: "I'm the value associated to the key 4",
5: "I'm the value associated to the key 5",
6: "I'm the value associated to the key 6",
7: "I'm the value associated to the key 7",
8: "I'm the value associated to the key 8",
9: "I'm the value associated to the key 9",
10: "I'm the value associated to the key 10"}


Well, that could have saved you quite a lot of time if you goal was to create such a useless dictionary🙂

Now, let’s look at a more interesting problem. Let’s say you have a dictionary and your goal is to invert its keys with the values, and vice versa. Well, once again, this is relatively easy to achieve through a dictionary comprehension.

>>> myDict = {1: 'cat', 2: 'dog', 3: 'rabbit'}
>>> {key:value for value, key in myDict.items()}
{'cat': 1, 'dog': 2, 'rabbit': 3}


This is just a quick tutorial on what can be done via list and dictionary comprehensions. I can just suggest you to get familiar with this feature because is very commonly found in any Python project.

The good thing is you can get used to it in less than half an hour, and trust me, you won’t regret it!

# There is no such a thing as Big Data

Quote

Sample sizes are never large. If N is too small to get a sufficiently-precise estimate, you need to get more data (or make more assumptions). But once N is “large enough,” you can start subdividing the data to learn more (for example, in a public opinion poll, once you have a good estimate for the entire country, you can estimate among men and women, northerners and southerners, different age groups, etc.). N is never enough because if it were “enough” you’d already be on to the next problem for which you need more data. (Andrew Gelman)

# Bayesian A/B Testing Framework

Standard

If you’re familiar with A/B testing but sometimes struggle with its limits here is a quick guide I wrote to get started with PyMC and Bayesian A/B testing. Bayesian A/B testing has multiple advantages over the classic way to testing for multiple hypothesis. The main ones are the fact it bakes in uncertainty in a more natural way and it doesn’t limit its results to a very un-informative point estimate.

The tutorial is meant to be easy to read and let you experiment with the code if you like. We intentionally avoided to delve into the description of how Markov Chain Monte Carlo models work, and leave the topic for a second article.

Bayesian A/B Testing Framework

# NYC BigApps 2015

Standard

I would like to focus everyone’s attention to NYC BigApps 2015. BigApps is a civic tech initiative of the City of New York. Developers, designers, and entrepreneurs are challenged to create functioning, marketable tech tools that could help solving pressing civic challenges. Whether you’re a novice or pro, a coder or a content specialist, building an app, a device, or another product altogether, BigApps provide resources that make it easy to make a difference.

There are four main challenges you can chose from. You can also team up with other individuals. The website has a nice social feature which allows you to contact and partners with other competitors, or ask for help if you need.

Unfortunately, being open just to U.S. citizens I won’t be able to directly participate (I’m a U.K. citizen). However, I’m thinking to unofficially take part to the Connected Cities Challenge and use it as a good excuse to explore some open datasets which have been made available by the Mayor’s Office and other partners. To be honest, is quite a shame only U.S. citizens can participate to the contest.

# It’s time to start posting new contents :)

Standard

Hello all!

Almost one year has passed since my last post here. During this time I’ve managed to get a position as a Data Analyst in one leading Digital Advertising company and become part of the Engineering Team. I also spent 6 months in New York for work and met a lot of new people. I must be honest and said that 2014 has been exceptionally good for me. I couldn’t ask for more.

We are almost half way through 2015 and I feel like something is missing. During the last 6 months I’ve managed to complete a few other courses in Statistics, Java, and Python. Now is time to showcase a few of the things I’ve learned and try to do something meaningful with my skills.

Over the next coming weeks I’m going to post some of the things I’ve been working on and new projects I’m going to start soon. There will be a switch in topic from Analytics to Data Science. My goal is to land a position as Data Scientist, so it’s time to do something about it and in the meantime leverage this channel to connect with other people with similar interests.

# How to implement a moving average in Apache Hive

Standard

I’ve been working for a couple of days on a way to implement a rolling average in Apache Hive. Finally, thanks to a reply in StackOverflow by Alex Florescu, I’ve been able to achieve my goal.

Why should be interesting to calculate a moving average in HIve? well, there are many reasons why you might be interested. In my case I was interested on calculate a rolling count of unique users which were falling into a specific basket. I was interested on check in a moving 21 days window how many unique users I had in market for a specific campaign.

Supposing you have a table in Hive which stores all the the visits in a website (which in the industry we call sitelogic exposures) named store.exposure; you can implement a moving count using a similar query to the following one:

select t2.date, round(count(t1.users)) as users_inmarket
from (select distinct to_date(datehour) date
from store.exposure
where datehour >= '2014-05-01 00:00:00' ) t2
inner join (select to_date(datehour) date, count(distinct userpid) users
from store.exposure
where datehour between '2014-05-01 00:00:00' and '2014-05-31 23:59:59'
group by to_date(datehour) ) t1
where t2.date is not null
and datediff(t2.date, t1.date) between 0 and 21
group by t2.date


To run a moving average you can do something like this:


select t2.date, round(sum(t1.users)/3) as avg_unique_users
from (select distinct to_date(datehour) date
from store.exposure
where datehour >= '2014-05-25 00:00:00' ) t2
inner join (select to_date(datehour) date, count(distinct userpid) users
from store.exposure
where datehour between '2014-05-25 00:00:00' and '2014-05-31 23:59:59'
group by to_date(datehour) ) t1
where t2.date is not null
and datediff(t2.date, t1.date) between 0 and 2
group by t2.date