First off, I'm not sure whether to post this here or in the coding thread.
Anyways, onto my little
complex problem.
The statement is simple, but execution is far from it AFAIK.
Basically, I want to generate a probability mass function. The probability concerns the following:
Say you have
n∈ℕ\{0} events and you want to know what the probability of
x events occurring is where 0 <=
x <=
n. Each event is a binary choice of true and false. The kicker? The probability of it being true follows an arbitrary probability sequence
pk where 1 <=
k <=
n is the occurrence index. In other words, if we call each event
Uk∈{0,1} then
P(
Uk = 1) =
pk.
So, if
X =
U1 +
U2 + ... +
Un, then the probability mass function is
f(
x) =
P(
X = x).
Easy to state on paper, but generating the values of
f(
x) is a pain. Now, if
pk =
p where
p is a constant probability, then this is a plain old binomial distribution.
However, since it's not constant, then it becomes a lot more complicated.
There are two algorithms I have that work. The first one:
- Generate all unique permutations of x 1s and n-x 0s.
- For each permutation i, generate a value mi = mi,n:
- Base case: mi,0 = 1
- Recursive case: mi,j = pj*mi,j-1 , when the jth symbol is a 1
mi,j = (1-pj)*mi,j-1 , when the jth symbol is a 0
- f(x) is thus the sum of all mi.
The second one:
- Base cases:
- mn,x,1 = pn
- mn,x,0 = 1-pn
- mn,v,0 = mn,w,1= 0 where {v,w} != {x,x}
- mz,v,0 = mz,w,1 = 0 where v > x, w > x, and z < n
- Recursive case:
- mi,j,0 = (1-pi)(mi+1,j,0 + mi+1,j+1,1)
- mi,j,1 = pi(mi+1,j,0 + mi+1,j+1,1)
- f(x) = m1,0,0 + m1,1,1
Technically, step 1.4. isn't necessary but it's there to prevent unnecessary backtracking.
In any case, both of these have horrible computational complexities. Just for the first one, the number of permutations is
n!/(
x!(
n-
x)!) and each of them needs to be parsed so the minimum complexity is
n*
n!/(
x!(
n-
x)!) which I believe is
O(
n*
n!). Sure, easy when
n is small, but I'm interested when
n is around 10-100. For
n = 30 I couldn't be arsed to wait any longer. 20 is the maximum I've been willing to wait.
Then to spin it around even further, say I kept x fixed but varied n? What would the probability mass function
g(
n) be? Is it as simple as generating all the values with
f(
x) where
n is varied and then dividing all values with the total sum?
Is there a way to make this faster? Knowing the structures of the probability mass functions would be nice but I don't necessarily care what the functions
f(
x) and
g(
n)
look like as long as I can generate the values in a reasonable amount of time.
EDIT:
Alright, I might have a solution using probability generating functions. I'm a bit unfamiliar with this field so if anyone would be willing to chip in, be my guest.
Let
Uk be an indicator variable for the
kth event such that
Uk∈{0,1} and
P(
Uk = 1) =
pk .
The PGF of
Uk is
GUk(
z) = 1 -
pk +
pk z .
Now, let
Xn =
U1 + U2 + U3 + ... + Un . Since
Uk are independent from each other,
GXn(z) =
GU1(
z)
GU2(
z)
GU3(
z)...
GUn(
z).
And boy do the tests show rapid speed increase!
From there,
P(
Xn = x) = (d
x/dz
x)
GXn(z)/x!
which should be easier to manage and give a representation of the probability mass function.