Week 6: Language¶

CS50's Introduction to Artificial Intelligence with Python

Natural Language Processing¶

NLP encompasses tasks where AI receives human language as input.

Application	Description
Automatic summarization	Converting text into concise summaries
Information extraction	Pulling structured data from text corpora
Language identification	Determining which language text is written in
Machine translation	Converting text between languages
Named entity recognition	Identifying proper nouns like company names
Speech recognition	Transcribing spoken words to text
Text classification	Categorizing text by type or sentiment
Word sense disambiguation	Resolving words with multiple meanings

Syntax and Semantics¶

Syntax: Sentence structure and grammatical rules. Sentences can be syntactically correct yet ambiguous — "I saw the man with the telescope" could mean viewing someone who possesses a telescope, or viewing using one.

Semantics: Meaning. Different sentences can convey identical meaning despite structural differences. Perfectly grammatical sentences can be meaningless (e.g., Chomsky's "Colorless green ideas sleep furiously").

Context-Free Grammar¶

Formal Grammar: Systematic rules for generating sentences.

Context-Free Grammar (CFG): Abstracts sentences into structural representations.

Example: "She saw the city"¶

Word	Part of Speech
She	N (Noun)
saw	V (Verb)
the	D (Determiner)
city	N (Noun)

"the city" → Noun Phrase (NP)
"saw the city" → Verb Phrase (VP)
Full sentence S = NP + VP

NLTK Implementation¶

import nltk

grammar = nltk.CFG.fromstring("""
    S -> NP VP

    NP -> D N | N
    VP -> V | V NP

    D -> "the" | "a"
    N -> "she" | "city" | "car"
    V -> "saw" | "walked"
""")

parser = nltk.ChartParser(grammar)

sentence = input("Sentence: ").split()
try:
    for tree in parser.parse(sentence):
        tree.pretty_print()
        tree.draw()
except ValueError:
    print("No parse tree possible.")

N-grams¶

An n-gram is a sequence of n items from text.

Term	Description
Unigram	Single item
Bigram	Two consecutive items
Trigram	Three consecutive items

Character n-grams use characters; word n-grams use words.

N-grams enable prediction — smartphones suggest next words based on probability distributions from preceding sequences.

Tokenization¶

Splits character sequences into tokens (words or sentences).

Challenges: - Separating punctuation from words - Handling apostrophes in contractions ("o'clock") - Managing hyphens in compound words - Distinguishing sentence-ending periods from abbreviation periods ("Mr.")

Proper tokenization is essential before analyzing n-grams or any text features.

Markov Models for Text¶

Generate text by establishing probability distributions for tokens based on preceding ones. After training, a trigram-based model chooses the third word from distributions shaped by the first two words, then selects the fourth word using words two and three.

Produces grammatical-sounding sentences that lack actual semantic meaning or purpose.

Bag-of-Words Model¶

Represents text as unordered word collections, ignoring syntax and focusing on meaning. Effective for classification tasks like sentiment analysis.

Examples: - "My grandson loved it! So much fun!" → positive: "loved," "fun" - "Product broke after a few days." → negative: "broke" - "Kind of cheap and flimsy, not worth it." → negative: "cheap," "flimsy"

Naive Bayes¶

Computes P(sentiment | text) — probability that text expresses a particular sentiment.

Using Bayes' rule:

P(sentiment | words) ∝ P(words | sentiment) × P(sentiment)

Calculates:

P(positive) × P("word₁" | positive) × P("word₂" | positive) × ...

The "naive" assumption treats each word's probability as independent from others — unrealistic but effective.

Addressing Zero Probabilities¶

If a word never appears in positive training samples, its conditional probability becomes 0, incorrectly eliminating entire sentences.

Additive Smoothing: Adds value α to each distribution entry.

Laplace Smoothing: Adds 1 to each value — assumes all values appeared at least once.

Word Representation¶

One-Hot Representation¶

Each word gets a vector of vocabulary length with a single 1.

Example for "He wrote a book":

He:    [1, 0, 0, 0]
wrote: [0, 1, 0, 0]
a:     [0, 0, 1, 0]
book:  [0, 0, 0, 1]

Problems: - 50,000-word vocabulary → 50,000-dimensional vectors - Fails to capture similarity between synonyms ("wrote" and "authored")

Distributed Representation¶

Compresses meaning across fewer dimensions:

He:    [-0.34, -0.08, 0.02, -0.18, …]
wrote: [-0.27,  0.40, 0.00, -0.65, …]
book:  [-0.23, -0.16, -0.05, -0.57, …]

Smaller vectors
Synonyms have similar values — captures semantic similarity

Word Context Principle¶

J.R. Firth: "You shall know a word by the company it keeps."

Words appearing in similar contexts share meaning. Words completing "for ___ he ate" likely include breakfast, lunch, or dinner.

word2vec¶

Generates distributed word representations using Skip-Gram Architecture — a neural network predicting context from target words.

Architecture: - Input layer: One unit per target word - Hidden layer: 50–100 units generating distributed representations - Output layer: Predicts contextually similar words

Results — book's nearest neighbors: books, essay, memoir, essays, novella, anthology, blurb, autobiography, audiobook

Vector Arithmetic¶

Word vectors enable semantic operations:

king − man ≈ queen − woman
(ramen − japan) + america ≈ burritos

Neural Networks for Language¶

Translate sequences to sequences — essential for chatbots, Q&A, and translation.

Recurrent Neural Networks¶

Process sequences by maintaining hidden states:

Input enters network → creates initial hidden state
Next input combines with previous hidden state → new hidden state
Repeat until end token
Decoding phase generates output sequences

Problems: - All input information compresses into a single final state — inefficient for long sequences - Not all hidden states carry equal importance

Attention¶

Allows networks to prioritize important values. When answering "What is the capital of Massachusetts?", attention highlights "capital" and "Massachusetts" as most relevant.

Process: 1. Calculate attention scores 2. Multiply scores by hidden state values 3. Sum into a context vector 4. Decoder uses context vector for output generation

Sequential processing in RNNs bottlenecks performance with large datasets — motivating transformers.

Transformers¶

Process all input words simultaneously rather than sequentially.

Encoding¶

Words enter network simultaneously
Position encoding adds positional information, preserving word order
Self-attention steps determine each word's context
Multiple self-attention iterations refine understanding
Output: encoded representations ready for decoding

Decoding¶

Previous output word + position encoding enter network
Multiple self-attention steps process information
Attention steps receive encoded representations
Words attend to each other across sequences
Parallel processing enables speed and accuracy

The transformer architecture powers modern large language models and most state-of-the-art NLP systems.