[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Re[2]: Summary, was Re: Every time ..., was Re: General form
Looking at it from completely different view point ... large consumer service
organization (tens of millions of customers) several years was looking at its
information infrastructure. It determined
1) had something like 6000 unique databases around the corporation
2) there was possibly 95% commonality in the data across all the databases
(things like names, addresses, phone numbers, etc)
3) possibly 20% of the data was "dirty" ... i.e. not accurate for one reason or
another
"dirty" data could be because it was originally entered incorrectly and/or it
had been incorrectly updated (i.e. person moved and thier address was
consistently updated ... and/or the name of the person at a specific address
wasn't consistently updated ... i.e. large number of different databases some
indexed by address, others indexed by name).
This doesn't directly give the lifetime of the associated attributes ... but
some big piece of the (20%) "dirty data" is proportional to the lifetime of the
attributes (approximate frequency of change) and the window to consistently
propogate changes thru the internal corporate databases
Lets say for half the population ... the name attribute never changes in 95% of
the cases ... for the other half of the population ... the name attribute
changes one or more times in 90?% of the cases. For the population as a whole,
an address attribute changes 10-20 times. Name change probability can be
calculated for the population as a whole ... but since there is such a large
bimodel distribution for name changes ... it can be worthwhile to calculate the
values seperately for the two sets.
If there three independent variables with avg lifetimes of 2 years, 5 years, and
10 years ... then the probability that at least one of the variables changes in
the next year is:
.5+.2+.1 ... = .8
The avg. expected lifetime of the collection of the three variables
is approximately 1.25 years.
1/(1/t0+1/t1+1/t2)
dependent variables ... say 90% of the time that address changes ... the phone
number also changes ... can be treated seperately. If the avg. address lifetime
is 5 years and the avg. phone number lifetime is 10 years (for situations where
the phone number changes independently of the address) ... then the probability
that either will change next year is:
.2 + .1 = .3
Effectively only treating the probability of independent event changes (i.e.
dependent variable events don't increase the overall probability that there will
be a change in the collection of variables). The expected lifetime of a unique
collection of attribute values is proporational to the probability of something
changing. This is only identical to the individual attribute lifetime
calculations when they are totally indepedent attributes. The lifetime of a
collection of unque attributes values is proportional to the probability that
there will be some change ... and is not (necessarily) porportional to the
lifetime of the individual attribute lifetimes. It is probability that there is
one or more changes to the values of the collection ... which is the collection
of the independent probabilities (calculation involving the number of variables
in a multi-variable change event ... i.e. dependent variables ... would only be
involved if the avg. number of attribute changed per change event was being
calculated ... but don't enter into the calculation of whether there is a change
event).
The lifetime of address/phone combination then is about 3.3years. If I have a
street address, a city address, a state address and a zipcode ... and they all
change dependently ... whether there are 2 dependent changes or four dependent
changes doesn't increase the overall probability that at lest one thing changes.
For detailed modeling ... in addition to strong bi-model & multi-model
distributions .... there are likely 2nd order effects ... like if somebody has
changed jobs twice in the past five years increases the probability that they
will change jobs in the next five years (although if they have just changed jobs
in the last month, it reduces the probability that they will change jobs in the
next year).
The calculation is the probability of whether there will be any change (whether
for a single attribute or some collection of attributes) ... from which the avg.
expected lifetime can be derived. However, the avg. expected lifetime of a
collection of unique attribute values is only indirectly related to the expected
individual lifetimes of those attributes.