Montag, 25. Juli 2011

Phrases in English (PIE)

What can PIE do?

Explore the distribution of words and phrases in English via various query interfaces:
  • N-grams are sequences of n words, where n falls in the range 1-8, and word means a token of any lexical entity assigned a PoS tag by the CLAWS parser (details). For example, the most frequent 1-gram in the BNC data is the, and the end of the tops the list of 4-grams.
  • Phrase-frames are sets of variants of an n-gram identical except for one word, represented here by the wildcard symbol *. The most frequent (and most productive, i.e. having the greatest number of variants) 4-frame is the * of the, with 5652 variants such as the end of the, the rest of the, the top of the, the nature of the etc. 
  • PoS-grams are patterns of Part of Speech tags assigned to word forms without reference to the specific lexical entities.  When ordered by types, the most frequent "3-PoS-gram" is ART ADJ NOUN, e.g. the other hand.  On the other hand, when ordered by tokens, the 3-PoS-gram PREP ART NOUN as in at the end are more frequent. 
  • Char-grams are sequences of n letters.  Their distribution can be studied by position (initial, medial, final) as well as by frequency in tokens or types.  Unsurprisingly, the is the most frequent 3-char-gram by tokens (8,222,751 tokens, 1007 types), but ing has the most distinct types (2,991,683 tokens, 9416 types).

http://phrasesinenglish.org/

Keine Kommentare:

Kommentar veröffentlichen