105.02K
Категория: Английский языкАнглийский язык

Text segmentation

1.

Text segmentation

2.

• Before any real processing is done, text needs
to be segmented at least into linguistic units
such as words, punctuation, numbers.
• This process is called tokenization and
segmented units are called word tokens.
• Ex: In addition, she was there.
• After segmentation:
In addition , she was there .

3.

Tokenization
• Tokenization and sentence splitting can be
described as ‘low-level’ segmentation which is
performed at the initial level of text
processing. The tasks are handled by reg. ex.
Written in perl or any other programming
language.

4.

Tokenization II
• High-level text segmentation or
intrasenetential segmentation involves
segmentation of linguistic groups such as
named entities, segmentation of noun groups.
• Inter-sentential segmentation involves
grouping of sentences and paragraphs into
discourse topics which are also called text
tiles.

5.

Word segmentation
• Multiple occurrence of words in a text.
• Word types are word of vocabulary.
• Ex. If Shakespeare’s works included more than
8oo,ooo word tokens, it has 31,000 types of
vocabulary

6.

Tokenizing sentences
• It is tiresome to tokenize sentences by adding
white space. Moreover, if you tokenize
sentences they cannot be put back to normal.
• SGML or XML are cleaner strategies for
tokenization to revert it easily to original text.
• Ex.
<w c=w> it</w> <w c=w> is </w> <w c=w> here
</w> <w c=p>. </w>

7.

Sentence segmentation
• Important for many text processing apps:
syntactic parsing, information extraction, text
alignment, Machine translation…etc.

8.

• Accurate splitting is known as sentence boundary
disambiguation (SBD) requires analysis of the
local context around the periods and othe
punctuations
• Compare:
• He stopped to see Dr. White.
• He stopped at Meadows Dr. Whie falcon was still
open.
Which period is sentence internal and which one is
sentence terminal?

9.

Simplist algorithm for sentence
boundary disambiguation
• ‘period- space- capital letter’
• It marks all periods, exclamation marks and q
marks that are followed by a space and a
capital letter.
• Regex:
• [.?!][ ()”]+[A-Z]

10.

Part of speech tagging
• Criteria:
• 1- syntactic distribution
• 2- syntactic function
• 3- morphological and syntactic classes that
different parts of speech can be assigned to.

11.

Applications
• Preprocessors
• Large tagged text corpora (see Mark Davies
Corpus)
• Info technology apps: text indexing and
retrieval (nouns and adjectives are better
candidates for good indexing than adverbs,
verbs and pronouns

12.

Parsing
• See Stanford university parser online
(http://nlp.stanford.edu:8080/parser/index.js
p)
• Using grammar to assign syntactic analysis to
a string of words.
• Shallow parsing: partition of the input into
chunks identifying the headword of each
chunk.

13.

• Dependency parsing

14.

CFP context free parsing
• Context-free grammars are important in
linguistics for describing the structure of
sentences and words in natural language, and
in computer science for describing the
structure of programming languages and
other formal languages. (wikipedia)
English     Русский Правила