Revision as of 14:22, 27 September 2012

Internet Tools is a library to process web pages and is intended to be easily usable.

Overview

The Internet Tools provide units to process x/html data and to download them over a http or https connection.

The library is completely implemented in Pascal, thread-safe, GPLed, and does not depend on other libraries (except Synapse on Linux).

HTTP/S connections

The Internet Tools do not implement http connections on its own, but provide wrappers around wininet and synapse. Wininet (Windows Internet API) is installed on all Windows/WINE systems and supports all urls the Internet Explorer supports. Synapse is a platform independent network library.

The wrappers are implemented as classes derived from an abstract interface, so the application can easily switch between both backends. However, it is recommended to use the wininet wrapper on Windows and synapse on Linux.

All data is uploaded or downloaded as strings, and the Internet Tools will automatically handle checking for local ssl libraries, cookies, referrers or redirections.

X/HTML processing

This processing of X/HTML data is the main focus of the Internet Tools and it provides an X/HTML parser, a tree representator, an XPath 2 interpreter and a template matcher.

The X/HTML parser processes the X/HTML data and splits it in to tags and contents, and has a SAX-like interface.

The tree representator find matching start and end tags and stores them in a linked list with additional links, which results in a DOM-like interface. This X/HTML parsing is not fully standard compliant, however it contains a lot of heuristics to parse real world websites, which are usually incorrect.

The XPath 2 layer implements the XPath 2 language, which can be used to extract values from x/html trees, and is almost fully standard compliant.

It also supports CSS 3 Selectors by converting them into XPath expressions and evaluating those.

The template matcher uses pattern-matching templates to extract several, structured values from a html page. Such a template is like a html file, which has been annotated at the interesting parts (just like a regular expression is a string annotated with capture groups).

Examples

A few example for the simple layer of the Internet Tools (which is not as powerful and customizable as using directly the classes).

Load a web page:

uses simpleinternet;
..
str := retrieve('http://www.google.de');

Download a file:

uses bbutils, simpleinternet;
..
strSaveToFileUTF8(TargetFileNameUTF8, retrieve('http://www.google.de'));

Dealing with Sourceforge HTTP download mirrors

uses bbutils, simpleinternet, internetaccess;
...
var TargetFile, SourceForgeURL, Download: string;
begin
  SourceForgeURL := 'http://sourceforge.net/projects/base64decoder/files/base64decoder/version%202.0/b64util.zip/download';
  TargetFile:='/tmp/download.zip';

  //set user agent (fails without it)
  defaultInternetConfiguration.userAgent:='curl/7.21.0 (i686-pc-linux-gnu) libcurl/7.21.0 OpenSSL/0.9.8o zlib/1.2.3.4 libidn/1.18';

  Download := retrieve(SourceForgeURL);
  if strBeginsWith(Download, '<!doctype html>') then begin
    //Download page
    //(this branch was never taken, when I tested, but since the synapse example has it, I include it)
    SourceForgeURL:=process(Download, '//a[@class="direct-download"]/@href');
    Download := retrieve(SourceForgeURL);
    if strBeginsWith(Download, '<!doctype html>') then raise Exception.create('Multiple redirections');
  end;

  strSaveToFileUTF8(TargetFile, SourceForgeURL);
end.

Extract all links from a page

uses simpleinternet;
...
sl := TStringList.Create;
//read all link targets as new-line separated list in a string and let the string list split it 
sl.text := process('http://www.google.de', 'string-join(//a/@href,"'#13#10'")')

Use templates to read a form

uses simpleinternet;
...
//html file to process (which you would of course usually not include in the program)
const EXAMPLE_HTML: string =
  '<html><head><title>...</title></head>' +
  '<body>lorem ipsum' +
  '<form>lorem ipsum' +
  'foobar: <input type="text" name="abc" value="123"/>' +
  'foobar: <input type="text" name="def" value="456"/>' +
  'foobar: <input type="text" name="ghi" value="678"/>' +
  '</form>' +
  '</html>';

//template (as you can see it is the html reduces on the relevant parts)
const EXAMPLE_TEMPLATE: string =
  '<form>' +
  '<input type="text" name="abc">{abc:=@value}</input>' +
  '<input type="text" name="def">{def:=@value}</input>' +
  '<input type="text" name="ghi">{ghi:=@value}</input>' +
  '</form>';

begin
  process(EXAMPLE_HTML, EXAMPLE_TEMPLATE);
  writeln(processedVariables.getVariableValueString('abc'));
  writeln(processedVariables.getVariableValueString('def'));
  writeln(processedVariables.getVariableValueString('ghi'));
end.

(update: the newest version has a special XPath function form to read a form. E.g. form(//form[1]).url would return ?abc=123&def=456&ghi=678 in the above example. But you still need the template to do something complex, not form related)

External links

@@ Line 37: / Line 37: @@
 The XPath 2 layer implements the [http://www.w3.org/TR/2011/REC-xpath-full-text-10-20110317/ XPath 2] language, which can be used to extract values from x/html trees,  and is almost fully standard compliant.
-The template matcher uses templates to extract several, structured values from a html page. Such a template is like a html file, which has been annotated at the interesting parts (just like a regular expression is a string annotated with capture groups).
+It also supports CSS 3 Selectors by converting them into XPath expressions and evaluating those.
+The template matcher uses pattern-matching templates to extract several, structured values from a html page. Such a template is like a html file, which has been annotated at the interesting parts (just like a regular expression is a string annotated with capture groups).
@@ Line 124: / Line 126: @@
 end.
 </syntaxhighlight>
+(update: the newest version has a special XPath function <code>form</code> to read a form. E.g. <code>form(//form[1]).url</code> would return <code>?abc=123&def=456&ghi=678</code> in the above example. But you still need the template to do something complex, not form related)
 =External links=

Difference between revisions of "Internet Tools"

Revision as of 14:22, 27 September 2012

Contents

Other Web and Networking Articles

Overview

HTTP/S connections

X/HTML processing

Examples

Load a web page:

Download a file:

Dealing with Sourceforge HTTP download mirrors

Extract all links from a page

Use templates to read a form

External links

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Tools

Search