Notes About HTTP

HTTP Basics

HTTP is the hypertext transfer protocol. The basic scenario is the following.

Server opens a socket and waits for a connection at some well known port number (generally 80).
Client connects to the server at that port on that host.
Client sends a request down the socket to the server.
Server sends a reply.
Server closes the connection.

If the client wants more than one item from a server, it must build a connection for each item requested. HTTP/1.1 allows multiple requests per connection, but is not yet widely implemented. Most clients can make multiple requests at once, even using HTTP versions less than 1.1. They do this by having multiple simultaneous connections.

HTTP Versions

Version 0.9 was the basic protocol.

Make a socket
Send filename
Receive file.

Bad: You had to guess the file type from the filename. You had to guess when the file was last modified (bad for caching). You could not request just the metadata, or just part of the file. You had to make a new socket for every request.

Version 1.0

Added the HEAD command
Added METADATA
Added cookies!!

Good: You could now get the last-modified and if-modified-since commands. You were told the file type. You can get just the metadata.
Bad: You cannot get just part of a file. You must make a new socket for every file.

Version 1.1

Let's you reuse sockets

Good: You can reuse sockets
Bad: You cannot get just part of a file. I get no royalties.

HTTP/2

Complete rewrite

Good: Header compression. Binary format. Server can push content. Maybe 2x as fast!
Bad: Binary formaty means harder to debug. I get no royalties.

HTTP Caching

Clients don't normally just request files. Instead, they normally check the cache first, and only request files if the file is not found in the cache. The cache is a store of recently requested files. Sometimes the client will verify the cache contents with the server. This incurs the latency penalty, but not the transfer penalty. Verify is done with the head command (see below).

When caching works it has several advantages

Improves response time
Reduces network load
Reduces server load
Improves performance of OTHER clients and OTHER requests

Caching does have several disadvantages

Slows response time when it fails
Makes hit counts hard to measure
Takes substantial disk/memory resources

HTTP Proxies

A proxy is a server/client combination that sits between the original server and the original client. In other words, the picture changes from the thing on the left to the thing on the right. Proxys are useful for implementing network security, for (sometimes) improving performance, and for solving some network addressing/routing problems. Most clients do not use proxies, however.

        Client <----> Server         Client <----> Proxy <-----> Server

One common use of a proxy is to put a proxy at each gateway in order to cache files, and reduce network traffic across the network. One study I read said that if all interior Internet gateways had a proxy server, total Internet traffic could be reduced by 30%.

HTTP Requests

Requests go from the client to the server, and a requests from the client asking the server to perform some service. Each requests starts with a method, followed by a resource-indicator (generally a filename), and a protocol-version. Optionally, there can be one or more modifiers. There are three main methods, used in examples below.

GET /index.html http/1.0 retrieve the meta-data and the body of /index.html
HEAD /robots.txt http/1.0 retrieve only the meta-data of /robots.txt
PUT /my/secret/file http/1.0 create or modify the file on the server

All requests can have one or more modifiers. Examples include...

If-Modified-Since: Sat, 29 Oct 1994 19:43:21 GMT
Content-Length: 3472
Authorization: Basic Qwxyehsuzjehgsoiznshyebsn

The If-Modified-Since modifier tells the server to send the data only if the data has changed since the given date. This is most useful for clients that wish to cache.

The Content-Length modifier is used only for the PUT method, and tells the length of the file body to follow. All put requests must have a body.

The Authorization modifier encodes the user's name and password in a base-64 encoding scheme. This scheme provides protection against only the most casual snooping attempts, since base 64 encoding can be decoded by anyone without need to know a secret password.

HTTP Responses

Responses come from the server to the client in response to client requests. Each response is a series of lines describing the status (success or failure) of the request, followed optionally by the meta-data for the requested object and optionally the body of the file

GET requests return a status code, and if successful the file meta-data and the file data. The status code is the first line returned by the server, the meta-data are the next few lines, and the body of the file starts after the first blank line. For example,

HTTP/1.0 200 OK
Date: Wed, 22 Oct 1997 04:02:44 GMT
Server: Apache/1.1.1
Content-type: text/html
Content-length: 2919
Last-modified: Wed, 15 Oct 1997 18:14:24 GMT

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN">
<HTML>
<HEAD> (Body continues here)

HEAD and PUT are just like the GET request without the body.

HTTP RESPONCE CODES

      1xx: Informational - Not used, but reserved for future use
      2xx: Success - The action was successfully received, understood, and accepted.
      3xx: Redirection - Further action must be taken in order to complete the request
      4xx: Client Error - The request contains bad syntax or cannot be fulfilled
      5xx: Server Error - The server failed to fulfill an apparently valid request

The individual values of the numeric status codes defined for HTTP/1.0, and an example set of corresponding Reason-Phrase's, are presented below. The reason phrases listed here are only recommended -- they may be replaced by local equivalents without affecting the protocol.

HTTP Authorization

Authorization in HTTP is very flexible. Here is the "basic" scheme.

Client requests a URL
Server returns 401 not authorized, and must include a "challenge" and a "realm".
Client asks user for a username and a password, concatintaes them with a colon between, and encodes them base 64. It sends a NEW request with this data and the realm as the authorization
Server, if it chooses, responds with the data.

Note that base64 encoding is reversable, and therefore any network snoop can get the username password info. Realms are an opaque identifiers.

HTTP Performance

HTTP performance can be divided into several parts.

Network latency
Network bandwidth
Server latency
Server bandwidth

Latency means the time after the request is issued until the first byte of the answer is received. Bandwidth is the rate at which data flows after the first byte is received. For large files bandwidth across the internet dominates total time (normal internet bandwidth is 4 to 40 KB/sec). Network latency is typically in the hundred millisecond range. Server latency/bandwidth is hard to quantify but depends on many factors

Server load
File type (cgi-bin files and database requests are slow)
File location (across network and deep inside subdirectories are slow)
Reference frequency (files that are recently accessed are fast)

Access to small files across the local net to our server (Euclid) can take about 100 ms. Full downloads of very large files across the whole internet can take hours.
If there is a modem anywhere in the download path then normally modem performance dominates over other considerations.