Syntax: What is a legal program
Example: English sentences are always a non-phrase followed by a verb phrase.
Generally described by a formal grammer.
Semantics: What does the program mean
Example: cat is an annoying furry manmal.
Generally does not have a good formal definition, but s suite of test programs.
Prefix: Every operator preceeds its operands.
No complex rules of precedence, no need for parenthesis.
Example: * 30 20 = 600
Example * + - 6 4 2 5 = 20
Alternative syntax: (+ 4 (* 3 2 )) = 24
Infix: Every operator is surrounded by its operands.
Need for precdence (My dear aunt sally).
Need to have associativity
The "normal" way of doing things.
Example: 1 + 2 * 3 - 4 = 3
Postfix: Every operator is preceeded by its operands.
No complex rules of precedence, no need for parenthesis.
Very easy implementation using stacks.
Example: 2 20 4 5 + - * = 22
Precedence: Which operation to do first
Traditionally, * / then + - but what about % and casting
Can be changed by ()
Associativity
For the same operator, do you go left or right (is 4-2-1 equal to 1 or 3)
Generally go left, but exponentiation goes right
Smalltalk has no precedence and everything is left-to-right (ick!)
Arity: How many operands does the operator take
One: unary minus, sqrt, casting
Two + - * /
Three ??
Abstract Syntax Trees
These are directed graphs that describe an expression.
Each interior node is an operation, each leaf node a constant.
Used to easily manipulate expressions.
For transative operators, can swap pointers
Can eliminate constant subexpressions
Can reduce common subexpressions
Lexical anaysis is the art of grouping charactors into meaningful words.
Definition: Keyword: A set of charactors that is used as part of the language definition
Definition: reserved words are keywords that cannot be used as names (of variables and procedures, etc).
Example: Understand w h i l e as a keyword 'while'.
Example: Understand >= as 'greater-than-or-equal-to'.
Typically, tokens are sperated by white space. However, main(int argc, char argv) has more tokens than whitespaces.
Grammers have four parts
A terminal is a basic unit of grammer. These are the base units of the grammer, like 'if', 'then' and 'a real number'.
A token is a unit of grammer defined by one or more rules.
The starting token is the token that represents the whole input, the top of the parse tree, and the place to start in parsing.
Grammers are generally written in Backus-Naur form (BNF)
Tokens are enclosed in <>
Terminals are written litterally, optionally with quotes around them.
Pipes "|" mean 'or'.
Left and right hands sides of rules are seperated by '::=' or an arrow if drawn by hand.
Example grammer for real
numbers
<real> ::= <int> . <fraction>
<int> ::= <digit> | <int><digit>
<fraction> ::= <digit> | <digit><fraction>
<digit> ::= 0 | 1 | 2 | 3 | 4 | 5 | 6
| 7 | 8 | 9
But, does this grammer parse 3.14?
Does it parse 3? Does it parse 0.3?
Example grammer for if
<S> ::= if <expression> then <S>
<S> ::= if <expression> then <S> else
<S>
But how do you parse if E1 then if E2 else S2?
With which if does the else corespond?
A grammer is ambigious if there is more than one parse tree possible for the same imput.
Generally this leads to more than one meaning for the same imput
Generally they either
Change the grammer to prevent this (every if statement ends with a 'fi').
Add additional non-grammer rules (every else coresponds to the closest if).
Chomsky defined four grammar levels. See http://en.wikipedia.org/wiki/Chomsky_hierarchy.
The most powerful grammars can describe anything “computable”. The least powerful grammars describe useful things than can be computed within reasonable time and memory restrictions.
Languages generated by grammars are a superset of the languages than can be generated by regular expressions.
Definition of a Computer Language
It's a formal grammar describing the syntax, a set of operators