# cproxy filter

The filter_cproxy is used to set up a browser session the way it was stored
by the cf-engine, so as to re-establish cookies, authentication etc when the
user clicks on a proxified link. 

## Overview

The cproxy filter is intended to be used in a filter chain that also contains

* http_rewrite filter for rewriting the HTTP request and response
* http_client filter for passing the request on to the actual website

Together these filters manage the processing of a proxified URL like 
  http://ebsco2-cfproxy.indexdata.com:8888/ZZZ/ebsco2/17/www.indexdata.com/
so that the system fetches www.indexdata.com, and rewrites all links on
the page to point back to http://ebsco2-cfproxy.indexdata.com:8888/...


## Function

Simplifying a bit, the filter does three things:

 * Analyzes the request URL to extract the session number
 * Reads the session file
 * Adds cookies etc to the HTTP request

Analyzing the URL sounds easy, and it is, in the simple case when the URL
actually contains the necessary parts. But if it does not, there is a fallback
to look in the Referer-header, and another fallback to look at a cookie. These
are needed because some sites go out of their way to produce links in a form
that we do not manage to rewrite (for example via javascript or a flash
animation).

There is also the special case that we may not have any session at all, but
have a parameter file, which contains information about a content connector
which can be used for creating the session as needed. This is done by firing
a SRU request to a cf-zserver, with parameters from the p-file, most notably
the content connector. The assumption is that the content connector will log
in to the website, or do what ever else it needs to, and will create a session
that we can use.

Provided that we found (or created) a session number, we can now read the
session file. We support two forms of cookie lines, an old format that the
current cf-engines produce, which only has the cookie name, value, and
domain, and the new format that will be standard in all newer cf-engines,
where we pass all cookie attributes, most notably the path. For the new cproxy,
the session file can also contain other parameters, like custom replacements.

Finally the filter modifies the HTTP request. It adds the http authentication
and referer header, if those were specified in the session file. It also adds
a X-Metaproxy-Proxy header if we had a proxyip in the session file. This is
later used by the http client filter that actually fetches the page.
Lastly the filter merges cookies from the session by going through the session
cookies, and checking that
 * The domain matches the domain in the (deproxified) request
 * The path matches the path in the (deproxified) request, if we have a path
 * The request does not already contain a cookie with the same name
If all tests pass, the cookie is appended to the cookie line in the request.

The filter then passes the request on to the next step in the filter chain.

When the response comes from the chain, the filter adds a cookie for the
session fallback, and possibly adds debug output to the content. If so required
by the session file, it can also do custom replacements on the content itself.


## Configuration

The XML configuration is pretty simple: Here one example:

      <filter type="cproxy">
        <debug>0</debug>
        <sessionmaxage>360</sessionmaxage> <!--minutes-->
        <cfconfig>/etc/cf-proxy/cproxy.cfg</cfconfig>
      </filter>

The *debug* flag sets the initial debug level. Normally this should be zero,
unless trying to debug the filter itself. See "Debugging" below for
ways to change it on the fly.

The *sessionmaxage* tells how long the proxy considers a session valid, in
minutes. This is no guarantee that the underlaying website accepts a session
at that age. If we see request for a session that is older than this number of
minutes, we reject it immediately. Also, on startup, the filter goes through the
session directory, and removes all files older than this.

The *cfconfig* refers to the configuration file that the cf-engine uses when it
proxifies URLs, normally at /etc/cf-proxy/cproxy.cfg. This is so that we don't
have to repeat those configuration settings in two places, and risk that they
could go out of sync. The cproxy has to run on the same machine as the
cf-zserver that produces the proxified links, so this file should (almost)
always be there. The file is pretty simple too:

    proxyhostname: pxy.indexdata.com:9000/XXX/node102
    sessiondir: /tmp
    cfengine: localhost:9001

The *proxyhostname* is not a very good name for the setting, it specifies not
only the hostname, but the whole beginning of a proxified URL. In some early
version it was only the hostname, and the setting name got stuck. The stuff
after the hostname and port is a prefix that gets added to every URL. It can
be used for load balancing, etc. The session number will be the next component
of the URL.

The *sessiondir* setting tells where the session files live.

The *cfengine* setting tells which cf engine to use when making a request for
the content connector, when such is needed.

The cproxy filter works together with the http_rewrite module, which also needs
to be configured right. That can be quite complex, but there is an example
config file in *cpxoryrewrite.xml*  


## Debugging

The cproxy filter has several debug options. Those can be triggered by the debug
setting in the config file, or by adding a special debug component in the URL,
just about anywhere. The component can be plain /cproxydebug/, or it can be
something like /cproxydebug-31/, where the number sets one or more of the debug
bits, or it can be something like /cproxydebug-verbose-dump/ where the different
words correspond to the debug bits. The possible values are

    cproxydebug   1  enable debug
    -verbose      2  more verbose output
    -nomove       4  Do not move the packet to the next filter in the chain,
                     just dump it at the point when the cproxy_filter is done
                     with it
    -keepcontent  8  Do not force the output as text/plain, like the debug flag
                     usually does. This can make the debug output less readable,
                     but renders the page itself more like it should.
    -dump        16  create a dump directory, and dump each file in the session
                     in there.

It is also possible to create the dump directory manually: Append .dump to the
name of the session file (as in /tmp/cf.17.dump/). Make sure the metaproxy
process can write in there. The dump directory will have debug output for each
file that is passing through the filter, within this session. There is a symlink
with the name __start  (two underscores to make sure it sorts first in ls
output), pointing to the first file ever dumped in the session. This is usually
where you would start debugging. Typically it is either the main page itself,
or the first of a (possibly long) chain of redirects.



