The (new) cf-proxy


1) Intro

The cf-cproxy is a content proxy for the CF. The CF can scrape sites that
require authentication. Scraping links from such sites would not be any good
for users, since they point to pages that also require authentication. So
the CF proxifies those links to point to the cf-proxy, and the cf-proxy
re-establishes the authenticated session and fetches those pages for the
user.

Technically, the cf-proxy is implented as a metaproxy route that uses three
separate filters:
 - cproxy filter to handle the CF-specific parts of re-establishing sessions,
   managing cookies, etc
 - rewrite filter to rewrite requests to the target site, and responses so that
   they point back to the cf-proxy
 - http client filter that actually fetches the pages from the site.


2) Installation

The cf-proxy is packaged for CentOs and Debian. It requires a metaproxy
installation on the server. By default the cf-proxy configures metaproxy
to listen on port 80, which can give conflicts if the same machine is
running a web server. This can be changed, see configuration below.


3) Upgrading

This version is a major rewrite of the cf-proxy, the whole thing has been
redesigned. The old cf-proxy relied on Apache to do most of its proxying,
this has been dropped in the new version. So, when upgrading, you need to
remove Apache from the server.

The installation scripts remove the old session files and dump directories
from /tmp - the old version created them as apache, and those would not be
removable by the clean-up code in the new version. 

There is no longer any need for a cron job to clean the session directory,
cf-proxy does that on every startup (f.ex. log rotation)


4) Configuration details
There are 4  configuration files that are installed in /etc/cf-proxy. Symlinks
are created in /etc/metaproxy for
  /etc/metaproxy/ports.d/cf-proxy.port.xml -> /etc/cf-proxy/cf-proxy.port.xml
  /etc/metaproxy/routes.d/cf-proxy.route.xml -> /etc/cf-proxy/cf-proxy.route.xml

The four configuration files are:

* cf-proxy.port.xml
Is a simple xml-fragment, typically only one line. It defines which port the
metaproxy should listen on, most often 80. It also mentiones the route the
packets should take, when arriving to this port. This points to the route
defined in cf-proxy.route.xml.

* cf-proxy.route.xml
Defines the metaproxy route used for proxying. Apart from logging, its main
points are the three filters: cproxy, http_rewrite, and http_client. The cproxy
filter does all the session and cookie management, etc. It has two configuration
settings: cfconfig, which refers to the configuration file for the cf-engine,
for its proxying settings, and sessionmaxage, which defaults to 360 minutes
(six hours) - most sites have session timeouts much shorter than this.

The second filter, http_rewrite is configured from its own file, included from
cf-proxy.route.xml. See below.

Finally, the http_client filter requires no configuration at all.

* cproxyrewrite.xml
is the configuration for the http_rewrite filter. It has all the rules about
what attributes of what tags to proxify, how that is done, etc. It is quite
complex, but there should not be any need to edit in that, it is tuned for
the needs of the cf-proxy.

* cproxy.cfg
is the configuration file for the cf-engine, and controls the way it generates
proxified links in the results. Its format has not changed in this release, but
the new cf-proxy reads the file too, for its own configuration. The important
lines are:
  proxyhostname: localhost:9000/XXX/node102
     Defines the name the cf-proxy host is known under, optionally a port, and
     the prefix to use in the proxified URLs
  sessiondir: /tmp
     Tells where the session files are kept.
  cfengine: localhost:9001
     Tells which cf-engine the cproxy will use in the case it needs to create
     new sessions.

Note that the new cf-proxy has no use for Apache, nor any configuration for it.

* Session files
are in /etc/tmp (by default, set in cproxy.cfg). At every startup (log rotation6) 
etc) cf-proxy goes through the session directory, and removes files that are
older than the sessionmaxage (in cf-proxy.route.xml).


5) Debugging troublesome sites

The cf-proxy can never be perfect, the world is full of sites that find new tricks
that we have to catch up with. Debugging those can be daunting. Here some quick
guidelines:
 * Use FireBug to see the requests the page makes. Pay attention to redirects,
   failures, and hanging requests. Check that the URLs look properly proxified.
   If you see URLs that point to the cf-proxy host, but without the proxy prefix
   and session, you should investigate where they come from (view source may be
   helpful). Most likely some javascript, and/or strangely formatted link that
   did not get proxified. These should trigger a 302 redirect to the properly
   proxified URL.

 * Add debug flags in the URL you are looking at. You can add at any / after
   the hostname. Most useful flags include
    /cproxydebug/          Produce a simple debug output for the page
    /cproxydebug-verbose/  Produce a more verbose output
    /cproxydebug-dump/     Create a dump directory on the server. See below
    /cproxydebug-nomove/   Process only the request, do not pass on to rewrite etc
    /cproxydebug-keepcontent/ Do not force the output to text/plain, but show as is
    /cproxydebug-cookie/   Analyse cookies in the session. See below.
   The debug flags can be combined, as in /cproxydebug-dump-nomove/

 * Use dump directories. If invoked with -dump, the cf-proxy creates a so-called
   dump directory (/tmp/cf.999999.dump in the defautl configuration, with 999999
   replaced by the session number). For every request, it produces a dump file.
   Those contain the same output as if there was a acproxydebug-verbose in every
   request, but the responses are passed on to the browser as usual, so the whole
   sequence of requests is recorded. These files have to be examined locally on
   the server. The dump directory can also be created locally on the server. Be
   sure to make them writable by the metaproxy user! The dump directories contain
   also a symlink __start that points to the first file seen, usually the right
   place to start reading. There is also a _cookietrace, which is kind of a summary
   file of all requests.

 * There is a special tool built in the cf-proxy for analysing cookies. It is a bit
   clumsy to use, first you need to create a dump directory, then you need to make
   a regular request to the result page, in order to collect information in the
   dump directory, and finally you can make a request with /cproxydebug-cookie/.
   This produces a list of all requests, with all cookies that have gone through
   the system, with suspicious cookies flagged. The output can be quite large, if
   a page consists of many requests, and there will be some false alarms.


6) Working with the code

The code is in the src subdirectory of the git checkout. There is a simple makefile,
and two source files: filter_cproxy.cpp and test_filter_cproxy.cpp. The later
has a lot of unit tests, based on the test functionality of YAZ. When working with
a bug or a new feature, write tests for it first!

The filter_cproxy.cpp defines two major classes:

yf::CProxy::Rep that is the internal representation of the filter, mostly
configuration stuff.

yf::CProxy::Handle that is a handle for each request passing through the filter.

The source code is divided into sections:
 - Declarations
 - Helpers, including the Cookie class and some error handling
 - Debug code: logging, dump files, cookie analysis, etc
 - Session management: Find out the current session from the request URL, or
   by various tricks (referer, cookie), or try to create a new session.
   Loading the session file is also here.
 - Modifying the request (passing credentials and cookies from the session, etc)
 - Postprocessing the response (cookies, content)
 - process() function that gets called from metaproxy.
 - Configuration of the filter
 - House keeping (cleaning old session files, etc)




