What can PIE do?
Explore the distribution of words and phrases in English via various query interfaces:- N-grams are sequences of n words, where n falls in the range 1-8, and word means a token of any lexical entity assigned a PoS tag by the CLAWS parser (details). For example, the most frequent 1-gram in the BNC data is the, and the end of the tops the list of 4-grams.
- Phrase-frames are sets of variants of an n-gram identical except for one word, represented here by the wildcard symbol *. The most frequent (and most productive, i.e. having the greatest number of variants) 4-frame is the * of the, with 5652 variants such as the end of the, the rest of the, the top of the, the nature of the etc.
- PoS-grams are patterns of Part of Speech tags assigned to word forms without reference to the specific lexical entities. When ordered by types, the most frequent "3-PoS-gram" is ART ADJ NOUN, e.g. the other hand. On the other hand, when ordered by tokens, the 3-PoS-gram PREP ART NOUN as in at the end are more frequent.
- Char-grams are sequences of n letters. Their distribution can be studied by position (initial, medial, final) as well as by frequency in tokens or types. Unsurprisingly, the is the most frequent 3-char-gram by tokens (8,222,751 tokens, 1007 types), but ing has the most distinct types (2,991,683 tokens, 9416 types).
http://phrasesinenglish.org/
Keine Kommentare:
Kommentar veröffentlichen