Hypertext Transfer Protocol (HTTP)

Introduction

Hypertext Transfer Protocol (HTTP) was invented by a group of individuals, amongst whom most famously was Tim Berners-Lee. The original HTTP 0.9 was superceded by HTTP 1.0. HTTP 1.0 is described in RFC 1945 as "...an application-level protocol with the lightness and speed necessary for distributed, collaborative, hypermedia information systems". HTTP can be used for many things, to deliver files, query results, CGI script output etc. as well as act as a gateway for access to other Internet protocols such as SMTP, NNTP, FTP, Gopher, and WAIS.

For reference, HTTP uses the Uniform Resource Identifier (URI) (RFC 1630), the Uniform Resource Location (URL), and Uniform Resource Name (URN) (see RFC 2396 which supercedes RFC 1738 and RFC 1808), for indicating the resource on which a Method is to be applied (see later). Messages are passed in a format similar to that used by Internet Mail and the Multipurpose Internet Mail Extensions (MIME) detailed in Part One RFC 2045 and Part Two RFC 2046 (these supercede RFC 1521, RFC 1522 and RFC 1590).

The HTTP client is called a Browser and sends requests to a HTTP Server (also called a Web Server) which responds to the client(s). The servers normally use TCP port 80 to listen on, although this can change.

An HTTP Proxy acts as both a server and a client. It forwards Requests to servers on behalf of clients. A client that uses a Proxy will include the full URL (Absolute URI) of the Resource instead of just the path (URI) e.g.

GET http://www.cisco.com/path/resource.html HTTP/1.0

A URL is effectively a subset of a URI and is made up of the protocol as well as the Fully Qualified Domain Name and the URI i.e. the relative path to the resource.

The Resource identified by a Request is determined by examining both the Request-URI and the Host header field.

1. If the Request-URI is an Absolute URI, the host is part of the Request-URI. Any Host header field value in the request MUST be ignored.

2. If the Request-URI is not an Absolute URI, and the request includes a Host header field, the host is determined by the Host header field value.

This document starts off describing HTTP 1.0 and then incorprates discussion of HTTP 1.1 which has additional features that are important for current and future Internet use. HTTP 1.1 incorporates the features in HTTP 1.0 and should be the version that every application now uses.

HTTP 1.0 Messages

HTTP is considered to be a Stateless protocol meaning that any connections made, do not remain open. The server Response message to a client Request normally contains the Resource, so the connection does not need to remain open.

The format for Request and Response messages is as follows:

First line which ends with a Carriage Return (CR - ASCII 13) and a Line Feed (LF - ASCII 10).
Either 0 or more lines called Header Lines.
A blank line which is CRLF on its own.
a message in the body containing the data

An Entity consists of meta-information in the form of Entity-header fields and content in the form of an Entity-body.

The Message is the basic unit of HTTP communication, consisting of a structured sequence of octets.

Request Initial Line

There are the following three parts to the Request Initial line:

Method such as GET, POST, HEAD etc.
The local path of the Resource called a URI e.g. /path/index.html
HTTP version e.g. HTTP/1.0

So with this in mind the initial line of a Request will look something like

GET /path/index.html HTTP/1.0

HTTP 1.0 Methods

GET - retrieve whatever information is resulting from the Request-URI be it data or a result of a script execution on the server.
HEAD - this is identical to the GET, however nothing is sent back in the Message Body (Entity Body). Only the Status and Header lines are returned so bandwidth is saved whilst the client is still able to gain information on the Resource's contents.
POST - this is used to send data to a server e.g. a CGI script, additional data to a database, a response to a web form or a newsgroup posting etc. A message body and extra headers describing the data are sent. A Request-URI is normally sent that indicates the program to be used to handle the data being sent. The response to a POST could be program output. A Content-Length field should be included in all POSTs.

Response Initial Line

There are the following three parts to the Response Initial line, called the Status line:

HTTP version
Response Status Code which takes the following format:
- 1xx - informational and not used in HTTP 1.0
- 2xx - success
- 3xx - redirects client to another URL
- 4xx - a client error
- 5xx - a server error
A phrase describing the error code

HTTP 1.0 Response Status Codes

The HTTP 1.0 Response Status Codes are:

200 - OK Response from the Server i.e. the Request was successful the Response varying depending on whether the Request was a GET, HEAD, POST or TRACE (only HTTP 1.1).
201 - Created Response means that a Request has resulted in a new Resource being created
202 - Accepted Response means that a Request has been accepted but not yet processed
204 - No Content Response means that a Request has been dealt with, however there is no content for the server to send back, hence there is never a Message body.
301 - Moved-Permanently Response means that the Resource has been moved permanently to a new URL
302 - Found Response means that the Resource has been moved temporarily, therefore it should continue to use the original URL.
304 - Not-Modified Response. is sent if a client sends a GET and the data has not been modified since the time in the If-Modified-Since field.
400 - Bad Request Response means that this indicates a Bad Request i.e. the format of the Request is wrong.
401 - Unauthorised Response means the client has no authorisation to access the Resource, a WWW-Authenticate header field must be included in the Request.
403 - Forbidden Response means that the Request is forbidden whether authorisation is used or not.
404 - Not Found Response means that the Resource cannot be found
500 - Internal Server Error Response means that an unexpected internal server error such as bad syntax on a CGI script
501 - Not Implemented Response means that the server does not know how to carry out the Request
502 - Bad Gateway Response means that the server is acting as a Proxy or a Gateway and it cannot get a valid response from the upstream server.
503 - Service Unavailable Response means that the Resource is unavailable, perhaps due to CPU load.

Later on we will come across other codes that were included in HTTP 1.1.

HTTP 1.0 Headers

Header lines provide information on the message in the HTTP packet. The format of the Header line is Header-name: value CRLF. The 'value' is sometimes called a 'token'. This structure follows that of E-mail and News as described in RFC 822 (Backus-Naur Form (BNF) of notation). This RFC was later updated to RFC 1123. HTTP 1.0 has a choice of 16 different headers all of which are optional. These are listed below:

Allow: - used by the GET and HEAD methods to inform the recipient of valid methods associated with the resource.
Authorization: - contains credentials used by the Requestor that allow it to be authenticated by the server
Content-Encoding: - indicates the encoding/compression method used on the Resource Entity body so that the client (receiver) knows what to use to decode the content.
Content-Length: - indicates the length in octets, of the Entity-Body (Message Body) in decimal.
Content-Type: - media (MIME) type of the data being sent in the Entity-Body e.g. text/html, image/gif etc.
Date: - the date and time that the message was sent in a format similar to Mon, 21 Jan 2002 21:08:57 GMT.
Expires: - the date and time that the message data should expire. The date is in the same format as Date:.
From: - the E-mail address of the Requestor
If-Modified-Since: - this includes a date in the format described earlier. The GET method uses this for conditional retrieval, good for saving on network bandwidth. If the requested resource has not been modified since the time specified in this field, a copy of the resource will not be returned from the server.
Last-Modified: - includes a date in the familiar format and indicates the date when the sender thinks the resource was last modified. If the recipient has a copy of this resource which is older than the date given by the Last-Modified field, that copy should be considered out of date.
Location: - this indicates the precise URL of the URI for the resource being requested. The format is Location: http://www.somewhere.org/place.html.
Pragma: - this includes a Pragma Directive such as no-cache that is applied by every system en-route. The no-cache directive is used often as it ensures that all systems (including proxies) that the Request should be forwarded on to the server even if a cached copy of the data exists somewhere along the way (ideal for real-time data retrieval such as Stock Market prices).
Referer: - this allows the Requestor to tell the server the URI from which the Request-URI came from. The server can use this information to create lists of links for more efficient caching etc.
Server: - this identifies the server software being used and is in the form program-name/x.xx.
User-Agent: - this identifies the software on the client being used to form the Requests. The format of the information is program-name/x.xx.
WWW-Authenticate: - this must be included in 401 response messages. The field value consists of at least one challenge that indicates the authentication scheme(s) and parameters applicable to the Request-URI.

HTTP 1.1 Messages

HTTP 1.1 adds features to HTTP 1.0 including:

Allows multiple transactions over one persistent connection
Cache support, all HTTP 1.1 Servers include a Date: header with every Response, so each Response is date-stamped for the cache.
Chunked Encoding allowing a response to be sent before the total length is known.
Multiple domains can be served from one IP address i.e. one server can host multiple web domains.

HTTP 1.1 Methods

HTTP 1.1 has introduced new methods plus the idea of 'Safe' Methods i.e. ones that just retrieve information and data are considered safe, these include GET and HEAD. Methods such as DELETE, PUT and POST are not safe because they can potentially cause harm as they move data from one machine to another. There is also the idea of a particular Method having the property of Idempotence meaning that the side effects of many identical requests are the same as just one Request when using that Method. The Methods GET, HEAD, PUT and DELETE are considered to be Idempotent.

As well as GET, HEAD and POST (HTTP 1.0), the following additional Methods are available in HTTP 1.1:

OPTIONS: - this is designed to be used to obtain information about the communications options and requirements available within the Request-Response chain for a particular Request URI, this is not really being used yet.
PUT: - this requests that an entity (message body) be put under the supplied Request URI i.e. created by the server. If the resource already exists then it is replaced with this 'updated' one. Note the difference between the PUT and POST methods!
DELETE: - this is a Request that the server delete the Resource indicated by the supplied URI.
TRACE: - this is used to provide a application-layer loop-back of the request message. The client can see what is being received at the other end of the request chain and use that data for testing or diagnostic information.
CONNECT: - this is unused at the moment but is reserved for use with a proxy that can dynamically switch to being a tunnel such as SSL tunneling.

HTTP 1.1 Response Status Codes

In addition to the HTTP 1.0 Response Status Codes used in the Response Initial Line, the additional HTTP 1.1 Response Status Codes are:

100 - Continue Response used to ease bandwidth usage over slow links. It is sent by a server to a client during the process of the client sending a request, to let the client know that it has received the first part of the Request i.e. the Request Header. The server may not accept the Request Body which can be quite large, so there would be no point in sending it and thereby wasting bandwidth. A proper Response is still sent by the server once the full Request has been received.
101 - Switching Protocols Response sent by the server when the client has sent an Upgrade message header to switch protocols to a more advantageous protocol e.g. from HTTP 1.0 to HTTP 1.1.
200 - See HTTP 1.0 Response Status Codes earlier.
201 - See HTTP 1.0 Response Status Codes earlier.
202 - See HTTP 1.0 Response Status Codes earlier.
203 - Non-Authoritative Information Response sent by the Server when the meta information in the Entity header (the information contained in the META tags) is not from the original server but perhaps a subset supplied by a third party.
204 - See HTTP 1.0 Response Status Codes earlier.
205 - Reset Content Response means that the Request has been actioned and that a reset of the client document should occur so that information is not erroneously re-sent. There is no message body with this Response.
206 - Partial Content Response means that the Server has fulfilled a Partial GET Request. The Request must include a Range header and optionally an If-Range header. The Response must include a Content-Range header or a multi-part Content-Type header, a Date header and an ETag or a Content-Location header. This is used when content comes from a cache.
300 - Multiple Choices Response containing a list of locations for the resource requested along with the individual characteristics for each representation of the requested Resource. The client can then choose from which location to choose the most appropriate resource.
301 - See HTTP 1.0 Response Status Codes earlier.
302 - See HTTP 1.0 Response Status Codes earlier.
303 - See Other Response means that the Response to the particular Request can be obtained from a different URI and a GET must be made to the new URI given in the Location field.
304 - See HTTP 1.0 Response Status Codes earlier.
305 - Use Proxy Response means that the Resource MUST be accessed via the Proxy given by the URI in the Location field.
306 - unused.
307 - Temporary Redirect Response means that the Resource is temporarily located under a different URI given in the Location field.
400 - See HTTP 1.0 Response Status Codes earlier.
401 - See HTTP 1.0 Response Status Codes earlier.
402 - Payment Required is reserved for future use.
403 - See HTTP 1.0 Response Status Codes earlier.
404 - See HTTP 1.0 Response Status Codes earlier.
405 - Method Not Allowed Response means that the Method being used for the Resource is not permitted. The Response includes a list of allowed Methods.
406 - Not Acceptable Response means that the client is sending a Request with unacceptable characteristics as far as the Resource is concerned. The Response should contain a list of available entity characteristics and locations so that the client can try again.
407 - Proxy Authentication Required Response means that the client must authenticate itself with the Proxy. The Proxy responds with a Proxy-Authenticate header with a challenge. The client makes an authorisation Request with a Proxy-Authorization header.
408 - Request Timeout Response means that the client failed to send a Request within the time allowed by the Server.
409 - Conflict Response means that there was a conflict with the state of the Resource e.g. a PUT Request making changes to a Resource conflict with another party's PUT Request on that same Resource. The Response should contain information to help resolve the problem.
410 - Gone Response means that the Resource is no longer available, nor is there a forwarding address.
411 - Length Required Response means that the Server has refused the Request because it requires a Content-Length header field.
412 - Precondition Failed Response means that a pre-condition in one of the Requests header fields evaluated to False. This is used when a client wants to make sure that it is accessing the correct Resource.
413 - Request Entity Too Large Response means that the Server is unwilling to process the Request because the Entity is too large. If the Server includes a Retry-After header, then the client can try again.
414 - Request-URI Too Long Response means that the URI in the Request is too long for the Server. This is rare!
415 - Unsupported Media Type Response means that the Entity in the Request is in a format that the Resource does not understand.
416 - Requested Range Not Satisfiable Response means that the client has included a Range Request header with a Range that does not overlap with that of the Resource.
417 - Expectation Failed Response means that the expectation given in a Request's Expect Request header was not met.
500 - See HTTP 1.0 Response Status Codes earlier.
501 - See HTTP 1.0 Response Status Codes earlier.
502 - See HTTP 1.0 Response Status Codes earlier.
503 - See HTTP 1.0 Response Status Codes earlier.
504 - Gateway Timeout Response means that the Server is acting as a Gateway or Proxy and has not received a response back from an upstream server such as FTP, LDAP, HTTP or DNS. This response being required to fully satisfy the client's Request.
505 - HTTP Version Not Supported Response means that the Client's Request is using a version of HTTP that the Server cannot or does not wish to support.

HTTP 1.1 Headers

HTTP 1.1 has a choice of 47 different headers of which the Header Host: must be present in Requests. HTTP 1.1 header fields are grouped into General-header, Request-header, Response-header, and Entity-header fields.

Below are listed the HTTP 1.1 headers that are extra to those HTTP 1.0 headers described earlier.

Accept: - used to specify media types that are suitable for the Response e.g. gzip.
Accept-Charset: - used to specify what character sets are acceptable for the Response.
Accept-Encoding: - similar to Accept other than there are restrictions to the encodings allowed in the Response. The Accept-Encoding field specify the encodings that are acceptable.
Accept-Language: - similar to Accept other than there are restrictions to the languages allowed in the Response. It uses the Language-Range and Language-Tags to decide which order of preference the Server has for the natural languages. Language tags include:
- en-us - English-US
- en-gb - English-UK
- en-cockney - Cockney
- en - any English
- da - Danish
Accept-Ranges: - this is used by the Server to indicate that it accepts range requests for its Resources.
Age: - this is the client's estimate of the amount of time since the response was sent by the origin server.
Allow: - see the HTTP 1.0 headers.
Authorization: - see the HTTP 1.0 headers.
Cache-Control: - used to specify directives that MUST be obeyed by all caching mechanisms along the request/response chain. The directives specify behavior intended to prevent caches from interfering with the request or response. These directives override the default caching algorithms. Details on these directives can be found in RFC 2616. Directives include the following:
- no-cache - subsequent requests for this content must be forwarded to the origin server for revalidation.
- no-store - the whole message must not be stored
- max-age - this or the Expire header can be used to determine when content is stale.
- max-stale - (request only) the number of seconds over and above the maxage that the client is still prepared to accept the content
- min-fresh - (request only) the minimum freshness that the client is prepared to accept (seconds)
- no-transform - some proxies may perform file transformations such as .BMP to .GIF image file conversions in order to save disk space. This directive prevents that from happening in case certain application problems occur.
- only-if-cached - (request only) this can be set by the client if the network is struggling and you do not want to go to the origin server for content that is not in the cache. The client will then only receive content that is stored by the cache, or group of caches.
- cache-extension - additional cache control mechanisms can be added using this directive. If not understood by clients or caches the directives should be ignored.
- public - (response only) content can be cached by any cache.
- private - (response only) content intended for a single user and must not be cached by a shared cache.
- must-revalidate - (response only) if the cached response becomes stale, then the cache must do an end-to-end revalidation (i.e. direct with the origin server) every time thereafter.
- proxy-revalidate - (response only) same as 'must-revalidate' except that the proxy-cache can resupply the content rather than the server.
- s-maxage - (response only) this determines the age of content that can be cached by a shared cache and overides maxage.
Connection: - the client specifies options that are desired for that particular connection and MUST NOT be communicated by proxies over further connections. The Close option means that the connection closes after each Response and therefore prevents the Persistent connection (see later).
Content-Encoding: - see the HTTP 1.0 headers.
Content-Language: - describes the natural languages of the intended audience for the enclosed entity. This uses the language tags described a little earlier.
Content-Length: - see the HTTP 1.0 headers.
Content-Location: - this may be used to supply the resource location for the entity enclosed in the message, when that entity is accessible from a location separate from the requested resource's URI.
Content-MD5: - this is an MD5 digest of the entity-body for the purpose of providing an end-to-end Message Integrity Check (MIC) of the entity-body.
Content-Range: - The Content-Range entity-header is sent with a partial entity-body to specify where in the full entity-body the partial body should be applied.
Content-Type: - see the HTTP 1.0 headers.
Date: - see the HTTP 1.0 headers.
ETag: - this provides the current value of the entity tag for the requested variant. The entity tag may be used for comparison with other entities from the same resource.
Expect: - this is used to indicate that particular server behaviours are required by the client.
Expires: - see the HTTP 1.0 headers.
From: - see the HTTP 1.0 headers.
Host: - indicates the internet host URI and port being requested by the client. Each client Request MUST include the Host: header so that the server knows which domain the Request is for e.g.
```
GET /path/index.html HTTP/1.1
Host: www.firsthost.com:81
```
The default port is 80 so the port number does not need to be specified unless a different port is being used as illustrated here. If the Client does not include the Host: header, then the Server replies with a 400 Bad Request Response.
If-Match: - Used to operate conditional Methods (e.g. GET) dependent on TAGs. This allows efficient updates of cached information with a minimum amount of transaction overhead, by say just obtaining the most recent version of a Resource.
If-Modified-Since: - see the HTTP 1.0 headers.
If-None-Match: - Similar to If-Match but the reverse i.e. ensuring that none of the Entities match the tags.
If-Range: - If a client has a partial copy of an entity in its cache, and wishes to have an up-to-date copy of the entire entity in its cache, it could use the Range: request-header with a conditional GET (using either or both of If-Unmodified-Since and If-Match). However, if the condition fails because the entity has been modified, the client would then have to make a second request to obtain the entire current entity-body. Instead of having two Requests, the Client can use an If-Range: header to roll both operations into one. The If-Range can use either a Tag or a Last-modified date.
If-Unmodified-Since: - As well as the original If-Modified-Since: header which is limited to the GET method, HTTP 1.1 has a If-Unmodified-Since: header which causes the server to send the Resource if it HAS NOT been changed since the date specified. This is NOT limited to the GET method. If the Resource HAS been modified then the server responds with a 412 Precondition Failed message.
Last-Modified: - see the HTTP 1.0 headers.
Location: - see the HTTP 1.0 headers.
Max-Forwards: - this provides a mechanism with the TRACE and OPTIONS Methods to limit the number of proxies or gateways that can forward the request to the next inbound server. Each Proxy/Gateway updates this field as the Request is forwarded.
Pragma: - see the HTTP 1.0 headers.
Proxy-Authenticate: - this MUST be included as part of a 407 Proxy Authentication Required response. The field value consists of a challenge that indicates the authentication scheme and parameters applicable to the proxy for this Request-URI.
Proxy-Authorization: - this allows the client to identify itself to a Proxy which requires authentication.
Range: - HTTP retrieval requests using conditional or unconditional GET methods MAY request one or more sub-ranges of the entity, instead of the entire entity, using the Range request header, which applies to the entity returned as the result of the request. Since all HTTP entities are represented in HTTP messages as sequences of bytes, the concept of a byte range is meaningful for any HTTP entity.
Referer: - see the HTTP 1.0 headers.
Retry-After: - this can be used with a 503 Service Unavailable response to indicate how long the service is expected to be unavailable to the requesting client. It includes either a Date: or the time in seconds.
Server: - see the HTTP 1.0 headers.
TE: - this indicates what extension transfer-encodings it is willing to accept in the response and whether or not it is willing to accept trailer fields in a Chunked Transfer-Encoding.
Trailer: - this indicates that the given set of header fields is present in the trailer of a message encoded with Chunked Transfer-Encoding.
Transfer-Encoding: - this indicates what encoding has been applied to the Message body. If a server wishes to use Chunked Transfer-Encoding when sending data to a client, then the header Transfer-Encoding: Chunked will be included in the response.
Upgrade: - this allows the client to specify what additional communication protocols it supports and would like to use if the server finds it appropriate to switch protocols.
User-Agent: - see the HTTP 1.0 headers.
Vary: - this indicates the set of request-header fields that fully determines, while the response is fresh, whether a cache is permitted to use the response to reply to a subsequent request without revalidation.
Via: - this MUST be used by Gateways and Proxies to indicate the intermediate protocols and recipients between the Client and the Server on Requests, and between the Origin Server and the Client on Responses.
Warning: - this is used to carry additional information about the status or transformation of a message which might not be reflected in the message itself. There may be a number of Warning headers each with a natural language text string detailing the warning. The Warning header uses Warning Codes such as:
- 110 Response is stale
- 111 Revalidation failed
- 112 Disconnected operation
- 113 Heuristic expiration, in any response whose age is more than 24 hours
- 199 Miscellaneous warning
- 214 Transformation applied
- 299 Miscellaneous persistent warning
WWW-Authenticate: - see the HTTP 1.0 headers.

Persistent Connections

In HTTP 1.0 each Request and Response TCP connection is closed for each resource GET. This takes up unnecessary time, bandwidth and CPU power when there are several resources being obtained from one server. By default, in HTTP 1.1, Persistent Connections are used so that a number of Requests can be sent in a row. This is called Pipelining. Responses are then read in the same order that the Requests were sent. Errors can be corrected within the TCP session rather than use separate TCP sessions The header Connection: close can stop this Pipelining and closes the TCP connection after each response.

Caching

Caching of data is useful for saving bandwidth utilisation particularly with repeated requests for large amounts of data. Locally caching graphics, sound and text is far preferable to having the same content 'dragged' across the network time and time again. HTTP 1.1 has mechanisms built into it that aid caching. This makes sense since HTTP is designed for a distributed data-sharing environment. If much of the data is stored close to the Requestor then this helps to eliminate sending Requests across the network, plus the need to send full Responses diminishes.

The Expiration mechanism is used to reduce the number of network round-trips whereas the Validation mechanism is used to reduce the network bandwidth requirements.

For reasons of improved performance, the requirement for availability and the likelihood of disconnection, it is necessary to have a relaxed approach to Semantic Transparency, i.e. communication all the way through from client to origin server is not always possible but the service needs to have some value still. Some operations such as credit card transactions require complete semantic transparency so the client needs to know whether transparency has been reduced or not.

A correct cache MUST respond to a request with the most up-to-date response held by the cache. Whenever a cache returns a response that is neither first-hand nor fresh enough, it MUST attach a warning to that effect, using a Warning: header (see earlier for the Warning codes).

The Cache-Control header is used in Requests and Responses to convey specific directives to the cache from the client or the server. These directives can override the default caching mechanisms. Normally, the origin server and the intermediate caches decide when content is to expire (expiration information). The client however, may wish to use directives within the Cache-Control header to decide the maximum age of an unvalidated reponse the client is prepared to accept, or the maximum age a 'Stale' response the client is prepared to accept.

Expiration

Ideally you want to avoid requests from the Client to the Server. For this purpose, the server issues an Expiration Time for a response indicating a future time until when the information is unlikely to change. The cache uses this to satisfy Requests until that Expiration time has expired. When an Origin server does not supply explicit Expiration times, the cache can perform some algorithms and look at other headers such as Last-Modified: to produce Heuristic Expiration times. Because the Origin server has no say in the Expiration times when Heuristic Expiration is used, this could compromise Semantic Transparency.

All caches and Origin servers should use NTP to maintain a synchronised clock with the rest of the world. Origin servers need to send a Date Header with every Response and the Age: Response Header is used by the cache to show the age of the Response message. The Age value is the sum of the time that the response has been resident in each of the caches along the path from the origin server, plus the amount of time it has been in transit along network paths.

To decide whether a Response is fresh or not the concept of the Freshness Lifetime is used. The Freshness Lifetime is equal to the Max Age Value if this has been included as a directive in the Cache-Control header, otherwise it is the difference between the Expires value and the Date value. Normally, there are multiple paths between the client and the cache(s) and orgin server so multiple Responses are likely with different expiration times. The most recently generated response is the one to use. A cache check with the Origin server can be forced with a Max Age set to 0 in the Cache control header. This technique can be used if multiple Responses are received via different caches and are being received in a different order from when they were sent by the Origin server.

Validation

Validation is when a cache has a stale entry that it would like to use in a Response, and it checks with the Origin server or another cache to see if it can still use the stale entry.

The mechanism to do this involves the inclusion of Cache Validators in the Full Response sent by the Origin server. The Cache Validator is kept with the cached entry and is sent by the cache or client, to the Origin server in a Conditional Request, in the event of a stale entry. The server then checks that validator against the current validator for the entity, and, if they match, it responds with a status code, 304 Not Modified and no entity-body. Otherwise, it returns a full response including the entity-body. This saves on bandwidth usage. A Cache Validator that is often used is the Last-Modified: header. If the 1 second granularity of the date format is not enough, or the date format conflicts with some functionality, then the Etag: header could be used instead. A Strong Validator is one that changes every time there is a change to the Entity whether or not there is a semantic impact. A Weak Validator is one that is not updated when there are insignificant changes to the Entity, but is only updated when there are semantic-affecting changes.

Response Construction

As well as using cached Responses to reply in full or in part to a Request, sometimes a Cache may have to construct a Response from a new Response received from the Origin server and parts of an old Response.

When defining the behavior of caches and non-caching proxies, HTTP headers are divided into two categories:

End-to-end headers - which are transmitted to the ultimate recipient of a request or response. End-to-end headers in responses MUST be stored as part of a cache entry and MUST be transmitted in any response formed from a cache entry.
Hop-by-hop headers - which are meaningful only for a single transport-level connection, and are not stored by caches or forwarded by proxies.

All headers defined by HTTP/1.1 are end-to-end headers apart from the following HTTP/1.1 headers:

Connection:
Keep-Alive:
Proxy-Authenticate:
Proxy-Authorization:
TE:
Trailers:
Transfer-Encoding:
Upgrade

Because HTTP 1.1 can use authentication, certain headers need to be left alone by Transparent Proxies these are:

Content-Location: (Response and Request)
Content-MD5: (Response and Request)
ETag: (Response and Request)
Last-Modified: (Response and Request)
Expires: (Response)

When a cache makes a Validating Request to a server, and the server provides a 304 (Not Modified:) Response or a 206 (Partial Content:) Response, the cache then constructs a Response to send to the requesting client. If the status code is 304, the cache uses the entity-body stored in the cache entry as the entity-body of this outgoing Response. If the status code is 206 (Partial Content) and the ETag or Last-Modified headers match exactly, the cache can combine the contents stored in the cache entry with the new contents received in the Response and use the result as the entity-body of this outgoing Response.

N.B. History mechanisms employed by browsers are NOT the same as caching. This is because the Historical entity retrieved is exactly what the user saw previously. A cached entity that is retrieved may be a combination of a cached Response integrated with a Partial Response with new material.

Secure Socket Layer (SSL)

Overview of SSL

HTTPS is HTTP running over Secure Sockets Layer (SSL) which was developed by Netscape. SSL (now up to version 3.0) is a tunnelling protocol that allows a proxy server to act as a tunnel between the client and the server. SSL runs at the application layer and provides secure transaction of data such as credit card details, between a client and an E-commerce server. SSL uses certificates, private/public key exchange pairs and Diffie-Hellman key agreements to provide privacy (key exchange), authentication and integrity with Message Authentication Code (MAC). This information is know as a Cypher Suite and exists within a Public Key Infrastructure (PKI).

Three Elements of SSL

Confidentiality

Data can only be viewed by the intended user. This is achieved by way of symmetric keys. That is, each of the parties has knowledge of the key to be used. The keys can be known by one of two methods:

Key Exchange - One party generates a symmetric key and then encrypts and transmits it using an asymmetric encryption scheme where each device has a private key and a public key that can be shared to all devices. Data encrypted using the public key can be decrypted using the private key and the reverse is true. A well known asymmetric key encryption scheme is Rivest Shamir Adelma (RSA). The private key is never shared and always remains secure.
Key Agreement - both parties generate a shared symmetric key ususally using the Diffie-Hellman algorithm. Parameters used to generate the shared key are exchanged between the client and server.

Authentication

Is the other party really who they say they are? This is confirmed by way of Digital Certificates. A Certificate Authority (CA) is a trusted authority that issues digital certificates via a PKI so that the certificates are not compromised. Typical CAs include the following:

VeriSign�
Entrust�
Netscape� iPlanet
Windows� 2000 Certificate Server
Thawte�
Equifax
Genuity

The clients and servers must have certificates issued from the same CA or from the hierarchy of CAs that trust each other. The certificate contains details about the owner, details about the certificate issuer, the owner's public key, validity and expiration dates, and associated privileges. A certificate is verified when the client checks it with the CA using the CAs public key within the PKI. Once verified the client can trust that the public key within the certificate for the server to which it wants to connect.

Message Integrity

In order to ensure that a message has not been interfered with between the sender and receipient, a Message Digest (or Hash) is applied to the message and attached to it. The message digest is a fixed length value that cannot be easily reversed. The message digest is encrypted to form the Message Authentication Code (MAC), using the sender's private key, and then it is decrypted at the other end by using the sender's public key. The message digest can either be created using Message Digest 5 (MD5) or Secure Hash Algorithm (SHA).

The public/private keys used to form the MAC (sign it with a digital signature) could be from RSA (used also for key exchange as described earlier). There is a new standard for signing which could be used instead of RSA. This is called the Digital Signature Algorithm (DSA is only used for digital signatures. It is considered a good idea to separate key exchange from signing. DSA is standardised in the Digital Signature Standard (DSS) designated FIPS-186. DSS uses Diffie-Hellman type algorithms and uses SHA-1 for the message digest.

Operation of SSL

The client initiates an HTTP request for an SSL tunnel either via a hook in HTTP or by calling HTTPS directly. The cache can then issue a CONNECT method (see earlier) using https:// url to tunnel SSL over HTTP.

By default, SSL uses a number of ports including 443, 643, 1443 and 2443. For encryption SSL uses RC4-128, Diffie-Hellman 1024, MD5 and Null. The encryption is carried out at layer 4 i.e. the socket layer.

SSL handshaking occurs as follows:

The client sends a 'hello' to the server as a request for a secure connection
The server sends a 'hello' to the client.
The server sends its authentication certificate and public key
The server sends a server_key_exchange
The server sends a certificate request
The server indicates that the server hello is complete.
Upon verification of the server certificate, the client sends its certificate
The client sends a client_key_exchange with a randomly generated key derived from the server key
The client sends a certificate verify message
The client sends a change_cipher_spec
The client indicates that it has finished
The server decodes the client key with its own private key
The server sends a change_cipher_spec
The server indicates that it has finished
The client and server exchange encrypted data

The SSL Record Protocol then takes the application data and splits it into fragments. Each fragment has the following operations carried out on it:

Compressed
Message Authentication Code (MAC) added
Encrypted
SSL Record header added to the front of the fragment

SSL provides the following:

Client-to-server, end-to-end encrypted traffic (including basic authentication usernames/passwords, content of submitted forms, etc)
Strongly authenticated server credentials supplied to the browser (including hostname and name of site operator etc.)
Strongly authenticated user credentials supplied to the server (requires user to have a personal certificate)

The problems associated with SSL are:

It prevents caching.
Using SSL imposes greater overheads on the server and the client.
Some firewalls and/or web proxies may not alow SSL traffic.
There is a cost associated with gaining a Certificate.

References

RFC 1945 describes the original HTTP 1.0.

RFC 2616 describes HTTP 1.1 and supercedes RFC 2068 from 1990.

RFC 2617 describes authentication for HTTP 1.1 and supercedes RFC 2069.

RFC 2109 describes HTTP state management.

RFC 2145 describes HTTP version numbers.

Home

Disclaimer