Notes About HTTP
HTTP Basics
HTTP is the hypertext transfer protocol. The basic scenario
is the following.
-
Server opens a socket and waits for a connection at some well known port
number (generally 80).
-
Client connects to the server at that port on that host.
-
Client sends a request down the socket to the server.
-
Server sends a reply.
-
Server closes the connection.
If the client wants more than one item from a server, it must build a connection
for each item requested. HTTP/1.1 allows multiple requests per connection,
but is not yet widely implemented. Most clients can make multiple
requests at once, even using HTTP versions less than 1.1. They do
this by having multiple simultaneous connections.
HTTP Versions
Version 0.9 was the basic protocol.
-
Make a socket
-
Send filename
-
Receive file.
Bad: You had to guess the file type from the filename. You
had to guess when the file was last modified (bad for caching). You
could not request just the metadata, or just part of the file. You
had to make a new socket for every request.
Version 1.0
-
Added the HEAD command
-
Added METADATA
-
Added cookies!!
Good: You could now get the last-modified and if-modified-since
commands. You were told the file type. You can get just the
metadata.
Bad: You cannot get just part of a file. You must
make a new socket for every file.
Version 1.1
Good: You can reuse sockets
Bad: You cannot get just part of a file. I get
no royalties.
HTTP/2
Good: Header compression. Binary format. Server can push content. Maybe 2x as fast!
Bad: Binary formaty means harder to debug. I get
no royalties.
HTTP Caching
Clients don't normally just request files. Instead, they normally
check the cache first, and only request files if the file is not
found in the cache. The cache is a store of recently requested files.
Sometimes the client will verify the cache contents with the server.
This incurs the latency penalty, but not the transfer penalty. Verify
is done with the head command (see below).
When caching works it has several advantages
-
Improves response time
-
Reduces network load
-
Reduces server load
-
Improves performance of OTHER clients and OTHER requests
Caching does have several disadvantages
-
Slows response time when it fails
-
Makes hit counts hard to measure
-
Takes substantial disk/memory resources
HTTP Proxies
A proxy is a server/client combination that sits between the original server
and the original client. In other words, the picture changes from
the thing on the left to the thing on the right. Proxys are useful
for implementing network security, for (sometimes) improving performance,
and for solving some network addressing/routing problems. Most clients
do not use proxies, however.
Client <----> Server Client <----> Proxy <-----> Server
One common use of a proxy is to put a proxy at each gateway in order
to cache files, and reduce network traffic across the network. One
study I read said that if all interior Internet gateways had a proxy server,
total Internet traffic could be reduced by 30%.
HTTP Requests
Requests go from the client to the server, and a requests from the client
asking the server to perform some service. Each requests starts with
a method, followed by a resource-indicator (generally a filename),
and a protocol-version. Optionally, there can be one or more
modifiers. There are three main methods, used in examples below.
-
GET /index.html http/1.0
retrieve the meta-data and the body of /index.html
-
HEAD /robots.txt http/1.0
retrieve only the meta-data of /robots.txt
-
PUT /my/secret/file http/1.0
create or modify the file on the server
All requests can have one or more modifiers. Examples include...
-
If-Modified-Since: Sat, 29 Oct 1994 19:43:21 GMT
-
Content-Length: 3472
-
Authorization: Basic Qwxyehsuzjehgsoiznshyebsn
The If-Modified-Since modifier tells the server to send the data only if
the data has changed since the given date. This is most useful for
clients that wish to cache.
The Content-Length modifier is used only for the PUT method, and tells
the length of the file body to follow. All put requests must have a
body.
The Authorization modifier encodes the user's name and password in a
base-64 encoding scheme. This scheme provides protection against
only the most casual snooping attempts, since base 64 encoding can be decoded
by anyone without need to know a secret password.
HTTP Responses
Responses come from the server to the client in response to client requests.
Each response is a series of lines describing the status (success or failure)
of the request, followed optionally by the meta-data for the requested
object and optionally the body of the file
GET requests return a status code, and if successful the file meta-data
and the file data. The status code is the first line returned by
the server, the meta-data are the next few lines, and the body of the file
starts after the first blank line. For example,
HTTP/1.0 200 OK
Date: Wed, 22 Oct 1997 04:02:44 GMT
Server: Apache/1.1.1
Content-type: text/html
Content-length: 2919
Last-modified: Wed, 15 Oct 1997 18:14:24 GMT
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN">
<HTML>
<HEAD> (Body continues here)
HEAD and PUT are just like the GET request without the body.
HTTP RESPONCE CODES
1xx: Informational - Not used, but reserved
for future use
2xx: Success - The action was successfully
received, understood, and accepted.
3xx: Redirection - Further action must
be taken in order to complete the request
4xx: Client Error - The request contains
bad syntax or cannot be fulfilled
5xx: Server Error - The server failed
to fulfill an apparently valid request
The individual values of the numeric status codes defined for HTTP/1.0,
and an example set of corresponding Reason-Phrase's, are presented below.
The reason phrases listed here are only recommended -- they may be replaced
by local equivalents without affecting the protocol.
Status-Code = "200" ; OK
| "201" ; Created
| "202" ; Accepted
| "204" ; No Content
| "301" ; Moved Permanently
| "302" ; Moved Temporarily
| "304" ; Not Modified
| "400" ; Bad Request
| "401" ; Unauthorized
| "403" ; Forbidden
| "404" ; Not Found
| "500" ; Internal Server Error
| "501" ; Not Implemented
| "502" ; Bad Gateway
| "503" ; Service Unavailable
| extension-code
HTTP Authorization
Authorization in HTTP is very flexible. Here is the "basic" scheme.
-
Client requests a URL
-
Server returns 401 not authorized, and must include a "challenge" and
a "realm".
-
Client asks user for a username and a password, concatintaes them with
a colon between, and encodes them base 64. It sends a NEW request
with this data and the realm as the authorization
-
Server, if it chooses, responds with the data.
Note that base64 encoding is reversable, and therefore any network snoop
can get the username password info. Realms are an opaque identifiers.
HTTP Performance
HTTP performance can be divided into several parts.
-
Network latency
-
Network bandwidth
-
Server latency
-
Server bandwidth
Latency means the time after the request is issued until the first byte
of the answer is received. Bandwidth is the rate at which data flows
after the first byte is received. For large files bandwidth across
the internet dominates total time (normal internet bandwidth is 4 to 40
KB/sec). Network latency is typically in the hundred millisecond
range. Server latency/bandwidth is hard to quantify but depends on
many factors
-
Server load
-
File type (cgi-bin files and database requests are slow)
-
File location (across network and deep inside subdirectories are slow)
-
Reference frequency (files that are recently accessed are fast)
Access to small files across the local net to our server (Euclid) can take
about 100 ms. Full downloads of very large files across the whole
internet can take hours.
If there is a modem anywhere in the download path then normally
modem performance dominates over other considerations.