Cproxy in metaproxy
Notes for design


Questions:
 - Granularity of filters. Many small ones, or a few larger ones? Start with
   many small ones, combine later if that looks more reasonable.

The cproxy process divides naturally in three stages: Preprocessing, fetching the
page from the backend, and postprocessing it. The postprocessing naturally splits
into headers and different types of content, as actually does the preprocessing,
although we only see content in POST requests.

Preprocessing:
 - Analyze the URL, extract session number and target URL
    - Check for session in the Referer header if not in the URL. 
    - Additionaly check for a Cproxy-session-cookie (new feature).
    - Both of these extra checks should redirect to the proper URL with the session,
      so that we can continue as before. That way we also get the referer headers
      right on subsequent links.
 - Create session if it does not exist
    - If we have a "p-file" from a Z-request, invoke a content connector
    - Possibly create a session in some simpler way (if only proxyIp needs to be set)
 - Load the session data
    - also overrides for the propxying config
    - Small debug warning if we have an override that has no effect, possibly because
      the default config has been updated to handle the case already
 - Deproxify some headers
    - Debug warning if you see headers that look proxified but are not configured to
      be deproxified. Option to deproxify all unknown headers (with an option to leave
      some anyway).
    - Configuration (A lot can be overridden on a per-connector basis)
      - Headers to be deproxified
         - Can we have different forms of headers. Many just contain a URL, but
           should we consider space or comma-separated lists of URLs? Or strings
           that contain URLs in quotes, tags, or somehow else? Probably not (yet).
      - Option to deproxify everything that looks proxified
      - Option to leave some unproxified anyway
      - Option to do arbitrary regexp replacements on given headers
      - Option to drop a header completely
      - Option to add a new header
 - Cookie processing
    - Take cookies from the request, and from the session, merge correctly
 - Content: In case of POST (PUT?, other?) requests, we may have content to preprocess.
    - Probably a separate module
    - Mostly deproxifying some inputs that look proxified
    - So far we have not needed this, but POST requests are on the wish list
    - POST forms contain name-value pairs, but we can also have POSTed XML,
      JSON, or other data types. Need to (prepare for) processing them differently.
    - configuration
      - Names of fields to be deproxified the usual way
      - Option to deproxify all that look proxified
      - Option to leave some fields untouched
      - Option to do general regexp replacing on some fields

Fetching:
 - Send the headers and receive the result.
 - Current http client filter?
 - Keep as a separate step in case we change threading, streaming, etc
 - Configuration
   - might be useful to be able to override some details for some connectors
   - timeouts?

Postprocessing headers:
 - Proxify some headers
 - Add a via header (?)
 - Process cookies, rewrite domains and paths on set-cookie headers
   - We may still need to keep a cookie jar on the proxy side, for some
     really messy cases. So far I have managed to avoid it, but I keep coming
     back to the possibility.
 - Content-length? Will change if we mess with the content!
 - Content-type. We may get several content-type headers with conflicting info.
   For mod_proxy_html I needed to consolidate those to one, here we can probably
   pass them all through and let the browser worry about things. It is better at
   it anyway, and the site has been tested against (some) browser(s).
 - Configuration
   - Options for cookie mangling
     - It could be useful to duplicate some cookies with different paths, to trick
       the browser to send them to the proxy (a site may set a cookie for *.site.com
       and expect that to come back to images.site.com. This can not be done with
       the current URL rewriting scheme, the '*' would have to be in the middle of
       the cookie path, where browsers don't do wildcard matching.
   - Headers to proxify
   - Overrides to do regexp substitutions on any headers (usually connector-specific)
   - Option to add constant headers, or to remove headers

Postprocessing content:
 - Depends on the content type
 - May change the content-length
 - Consider chunking and streaming, especially for contents that don't need
   any rewriting.
 - HTML content
    - Rewrite tags, attributes 
    - Rewrite javascript content (separate filter?)
    - Configuration (possibly overridden by connector-specifics)
       - What tags, what attributes to proxify
       - Option to proxify other tags/attributes that look like links
       - Option to do general regexp mangling of attributes
       - Option to do general regexp mangling with content of some tags.
 - Javascript content (also stuff inside SCRIPT tags in html?)
    - So far we have not done anything but debug warnings
    - Check at least for hard-coded URLs, give a warning
    - Option to detect quoted URLs and proxify
    - general regexp mangling
 - CSS content
    - So far we have not done anything
 - Other content: Pass through unmodified.

Debug mode(s)
 - I would like to keep the old debug flag. It can be triggered by adding /cproxydebug/
   in the URL, or via the configuration.
    - Adds lots of debug output to the page, explaining every rewrite etc
    - Changes content-type to text/plain, so it is readable on a browser
 - I would also like to keep the dump directory functionality
   - If there exists a directory that matches the session file name (cf.17 and
     cf.17.dump), debug output into a new file (one for each request in the session)
     Same kind of output as in the debug mode above.
   - New feature: Write a summary file in the dump directory as well, with all files
     listed together with key features of the request/response (cookies, redirects,
     errors)

Testing
 - Unit tests for the modules
 - Integration test against locally served files, so we can see actual proxying
   and verify results

