Week 6: Language¶
CS50's Introduction to Artificial Intelligence with Python
Natural Language Processing¶
NLP encompasses tasks where AI receives human language as input.
| Application | Description |
|---|---|
| Automatic summarization | Converting text into concise summaries |
| Information extraction | Pulling structured data from text corpora |
| Language identification | Determining which language text is written in |
| Machine translation | Converting text between languages |
| Named entity recognition | Identifying proper nouns like company names |
| Speech recognition | Transcribing spoken words to text |
| Text classification | Categorizing text by type or sentiment |
| Word sense disambiguation | Resolving words with multiple meanings |
Syntax and Semantics¶
Syntax: Sentence structure and grammatical rules. Sentences can be syntactically correct yet ambiguous — "I saw the man with the telescope" could mean viewing someone who possesses a telescope, or viewing using one.
Semantics: Meaning. Different sentences can convey identical meaning despite structural differences. Perfectly grammatical sentences can be meaningless (e.g., Chomsky's "Colorless green ideas sleep furiously").
Context-Free Grammar¶
Formal Grammar: Systematic rules for generating sentences.
Context-Free Grammar (CFG): Abstracts sentences into structural representations.
Example: "She saw the city"¶
| Word | Part of Speech |
|---|---|
| She | N (Noun) |
| saw | V (Verb) |
| the | D (Determiner) |
| city | N (Noun) |
- "the city" → Noun Phrase (NP)
- "saw the city" → Verb Phrase (VP)
- Full sentence S = NP + VP
NLTK Implementation¶
import nltk
grammar = nltk.CFG.fromstring("""
S -> NP VP
NP -> D N | N
VP -> V | V NP
D -> "the" | "a"
N -> "she" | "city" | "car"
V -> "saw" | "walked"
""")
parser = nltk.ChartParser(grammar)
sentence = input("Sentence: ").split()
try:
for tree in parser.parse(sentence):
tree.pretty_print()
tree.draw()
except ValueError:
print("No parse tree possible.")
N-grams¶
An n-gram is a sequence of n items from text.
| Term | Description |
|---|---|
| Unigram | Single item |
| Bigram | Two consecutive items |
| Trigram | Three consecutive items |
Character n-grams use characters; word n-grams use words.
N-grams enable prediction — smartphones suggest next words based on probability distributions from preceding sequences.
Tokenization¶
Splits character sequences into tokens (words or sentences).
Challenges: - Separating punctuation from words - Handling apostrophes in contractions ("o'clock") - Managing hyphens in compound words - Distinguishing sentence-ending periods from abbreviation periods ("Mr.")
Proper tokenization is essential before analyzing n-grams or any text features.
Markov Models for Text¶
Generate text by establishing probability distributions for tokens based on preceding ones. After training, a trigram-based model chooses the third word from distributions shaped by the first two words, then selects the fourth word using words two and three.
Produces grammatical-sounding sentences that lack actual semantic meaning or purpose.
Bag-of-Words Model¶
Represents text as unordered word collections, ignoring syntax and focusing on meaning. Effective for classification tasks like sentiment analysis.
Examples: - "My grandson loved it! So much fun!" → positive: "loved," "fun" - "Product broke after a few days." → negative: "broke" - "Kind of cheap and flimsy, not worth it." → negative: "cheap," "flimsy"
Naive Bayes¶
Computes P(sentiment | text) — probability that text expresses a particular sentiment.
Using Bayes' rule:
Calculates:
The "naive" assumption treats each word's probability as independent from others — unrealistic but effective.
Addressing Zero Probabilities¶
If a word never appears in positive training samples, its conditional probability becomes 0, incorrectly eliminating entire sentences.
Additive Smoothing: Adds value α to each distribution entry.
Laplace Smoothing: Adds 1 to each value — assumes all values appeared at least once.
Word Representation¶
One-Hot Representation¶
Each word gets a vector of vocabulary length with a single 1.
Example for "He wrote a book":
Problems: - 50,000-word vocabulary → 50,000-dimensional vectors - Fails to capture similarity between synonyms ("wrote" and "authored")
Distributed Representation¶
Compresses meaning across fewer dimensions:
He: [-0.34, -0.08, 0.02, -0.18, …]
wrote: [-0.27, 0.40, 0.00, -0.65, …]
book: [-0.23, -0.16, -0.05, -0.57, …]
- Smaller vectors
- Synonyms have similar values — captures semantic similarity
Word Context Principle¶
J.R. Firth: "You shall know a word by the company it keeps."
Words appearing in similar contexts share meaning. Words completing "for ___ he ate" likely include breakfast, lunch, or dinner.
word2vec¶
Generates distributed word representations using Skip-Gram Architecture — a neural network predicting context from target words.
Architecture: - Input layer: One unit per target word - Hidden layer: 50–100 units generating distributed representations - Output layer: Predicts contextually similar words
Results — book's nearest neighbors: books, essay, memoir, essays, novella, anthology, blurb, autobiography, audiobook
Vector Arithmetic¶
Word vectors enable semantic operations:
Neural Networks for Language¶
Translate sequences to sequences — essential for chatbots, Q&A, and translation.
Recurrent Neural Networks¶
Process sequences by maintaining hidden states:
- Input enters network → creates initial hidden state
- Next input combines with previous hidden state → new hidden state
- Repeat until end token
- Decoding phase generates output sequences
Problems: - All input information compresses into a single final state — inefficient for long sequences - Not all hidden states carry equal importance
Attention¶
Allows networks to prioritize important values. When answering "What is the capital of Massachusetts?", attention highlights "capital" and "Massachusetts" as most relevant.
Process: 1. Calculate attention scores 2. Multiply scores by hidden state values 3. Sum into a context vector 4. Decoder uses context vector for output generation
Sequential processing in RNNs bottlenecks performance with large datasets — motivating transformers.
Transformers¶
Process all input words simultaneously rather than sequentially.
Encoding¶
- Words enter network simultaneously
- Position encoding adds positional information, preserving word order
- Self-attention steps determine each word's context
- Multiple self-attention iterations refine understanding
- Output: encoded representations ready for decoding
Decoding¶
- Previous output word + position encoding enter network
- Multiple self-attention steps process information
- Attention steps receive encoded representations
- Words attend to each other across sequences
- Parallel processing enables speed and accuracy
The transformer architecture powers modern large language models and most state-of-the-art NLP systems.