Statistical Methods
Five statistical methods for identifying collocation are available in this
program. They are Dunning's log likelihood, (pointwise) mutual information,
chi-square, cubic association ratio (MI3), and Frager and McGowan coefficient.
(See Manning and Schütze, 1999 and Oakes 1998, for more details)
1. Dunning's log likelihood :
The test compares two hypotheses about W1 and W2
Hypothesis 1 : P(w2|w1) = p = P(w2| not w1)
Hypothesis 2 : P(w2|w1) = p1 not equal
p2 = P(w2|not w1)
Assuming binomial distribution, the log likelihood ratio is calcuated
as follows:
Note that c1 is the frequency of W1, c2 is the frequency of W2, c12
is the frequency of bigram W1-W2, N is the number of total words in the
corpus, p = c2/N, p1 = c12/c1, and p2 = (c2 – c12) / (N-c1).
The higher the value, the more likely that W2 is the collocation of W1.
To see the statistical significance, multiply the result by -2, and
consult the Chi-square table at the degree of freedom as one.
2. Pearson's Chi-square
Collocation between W1 and W2 is calculated in according to the frequency
of bigram occurences in each cell. "W1 - W2" represents the number of occurences
that W1 - W2 is found in the corpus. "not W1 - W2" represents the number
of occurences that the preceding word of W2 is not W1. "W1 - not W2" represents
the number of occurences that W1 is not followed by W2. "not W1 - not W2"
represents the rest.
W1 - W2 |
not W1- W2 |
W1 - not W2 |
not W1 - not W2 |
Chi-square is calculated using the formula below, where Oi,j is the
observed frequency in the table, and Ei,j is the expected frequency in
each cell when W1 - W2 occur together by chance. Expect frequency on each
cell is equal to (row total * column total ) / grand total
The higher the number, the more significant of collocation between
W1 and W2.
3. Mutual Information
This statistical method compare the probability of finding the two words
together to the probability that the two words are independent to each
other. If x' and y' and collocation, it is likely that P(x' y') be highly
greater than P(x') * P(y'). Thus, the higher the value, the more likely
that the two words are collocation.
However, mutual information tends to give high value for rare events.
For example, when W1 and W2 always occur together, MI will be higher if
the frequency is lower.
When calculating statistical values for 3-,4-,5-words, pseudo-bigram
transformation is used
to estimate the value. (Silva and Lopes, 1999).
References
-
Manning, Christopher and Hinrich Schütze. 1999. Foundations of Statistical
Natural Language Processing. Cambridge: MIT Press.
-
Oakes, Micheal P. 1998. Statistics for Corpus Linguistics. Edinburg University
Press.
-
Silva, Joaquim Ferreira da, and Gabriel Pereira Lopes. 1999. A Local Maxima
Method and a Fair Dispersion Normalization for Extraction Multi-word Units from
Corpora. In Proceedings of the 6th Meeting on the Mathematics of Language,
Orlando, July 23-25.