Internet Tools

From Free Pascal wiki
Jump to navigationJump to search

Internet Tools is a library to process web pages and is intended to be easily usable.

Other Web and Networking Articles

Overview

The Internet Tools provide units to process x/html data and to download them over a http or https connection.

The library is completely implemented in Pascal, thread-safe, GPLed, and does not depend on other libraries (except Synapse on Linux).

HTTP/S connections

The Internet Tools do not implement http connections on its own, but provide wrappers around wininet, synapse and Apache HttpComponents. Wininet (Windows Internet API) is installed on all Windows/WINE systems and supports all urls the Internet Explorer supports. Synapse is a platform independent network library. Apache HttpComponents is the old standard Android network library.

The wrappers are implemented as classes derived from an abstract interface, so the application can easily switch between both backends. However, it is recommended to use the wininet wrapper on Windows, the synapse wrapper on Linux and the Apache wrapper on Android.

All data is uploaded or downloaded as strings, and the Internet Tools will automatically handle checking for local ssl libraries, cookies, referrers or redirections.

X/HTML processing with XPath/XQuery

This processing of X/HTML data is the main focus of the Internet Tools and it provides an X/HTML parser, a tree representator, an XPath 2 / XQuery interpreter and a template matcher.

The X/HTML parser processes the X/HTML data and splits it in to tags and contents, and has a SAX-like interface.

The tree representator find matching start and end tags and stores them in a linked list with additional links, which results in a DOM-like interface. This X/HTML parsing is not fully standard compliant, however it contains a lot of heuristics to parse real world websites, which are usually incorrect anyways and could not be read by a standard compliant parser.

The XPath 2 / XQuery layer implements the XPath 2 and XQuery languages, which can be used to extract values from x/html trees, and is standard compliant, except of support for XML schemas and error codes (it only passes 97.8% of the XQuery Test Suite, because there are several tests for schemas). It also implements the JSONiq (pre-) standard for processing JSON.

It also supports CSS 3 Selectors by converting them into XPath expressions and evaluating those.

The template matcher uses pattern-matching templates to extract several, structured values from a html page. Such a template is like a html file, which has been annotated at the interesting parts (just like a regular expression is a string annotated with capture groups).

Examples

A few example for the simple layer of the Internet Tools (which is not as powerful and customizable as using directly the classes).

Load a web page:

uses simpleinternet;
..
str := retrieve('http://www.google.de');

Download a file:

uses bbutils, simpleinternet;
..
strSaveToFileUTF8(TargetFileNameUTF8, retrieve('http://www.google.de'));

Dealing with Sourceforge HTTP download mirrors

uses bbutils, simpleinternet, internetaccess;
...
var TargetFile, SourceForgeURL, Download: string;
begin
  SourceForgeURL := 'http://sourceforge.net/projects/base64decoder/files/base64decoder/version%202.0/b64util.zip/download';
  TargetFile:='/tmp/download.zip';

  //set user agent (fails without it)
  defaultInternetConfiguration.userAgent:='curl/7.21.0 (i686-pc-linux-gnu) libcurl/7.21.0 OpenSSL/0.9.8o zlib/1.2.3.4 libidn/1.18';

  Download := retrieve(SourceForgeURL);
  if strBeginsWith(Download, '<!doctype html>') then begin
    //Download page
    //(this branch was never taken, when I tested, but since the synapse example has it, I include it)
    SourceForgeURL:=process(Download, '//a[@class="direct-download"]/@href').toString;
    Download := retrieve(SourceForgeURL);
    if strBeginsWith(Download, '<!doctype html>') then raise Exception.create('Multiple redirections');
  end;

  strSaveToFileUTF8(TargetFile, SourceForgeURL);
end.

Extract all links from a page

uses simpleinternet;
...
var link: IXQValue;
...
for link in process('http://www.google.de', '//a/@href') do
  writeln(link.toString);

Use templates to read a form

uses simpleinternet;
...
//html file to process (which you would of course usually not include in the program)
const EXAMPLE_HTML: string =
  '<html><head><title>...</title></head>' +
  '<body>lorem ipsum' +
  '<form>lorem ipsum' +
  'foobar: <input type="text" name="abc" value="123"/>' +
  'foobar: <input type="text" name="def" value="456"/>' +
  'foobar: <input type="text" name="ghi" value="678"/>' +
  '</form>' +
  '</html>';

//template (as you can see it is the html reduces on the relevant parts)
const EXAMPLE_TEMPLATE: string =
  '<form>' +
  '<input type="text" name="abc">{abc:=@value}</input>' +
  '<input type="text" name="def">{def:=@value}</input>' +
  '<input type="text" name="ghi">{ghi:=@value}</input>' +
  '</form>';

begin
  process(EXAMPLE_HTML, EXAMPLE_TEMPLATE);
  writeln(processedVariables.get('abc').toString);
  writeln(processedVariables.get('def').toString);
  writeln(processedVariables.get('ghi').toString);
end.

The newest version has a special "XPath" function form to read a form. E.g. form(//form[1]).url would return ?abc=123&def=456&ghi=678 in the above example. But you still need the templates to do something complex that is not form related.

Calculate primes with XQuery

  uses simpleinternet;
 [...]
  var v: IXQValue;
 [...]
  for v in process('',
    'xquery version "1.0";'                                +
    'declare function local:isprime($p){'                  +
    '  every $i in 2 to $p - 1 satisfies ($p mod $i != 0)' +
    '};'                                                   +
    'for $i in 2 to 30 where local:isprime($i) return $i') do
    writeln(v.toString);


External links