Internet Tools

From Free Pascal wiki
Jump to navigationJump to search

Internet Tools is a library to process web pages and is intended to be easily usable.

Other Web and Networking Articles

Overview

The Internet Tools provide units to process x/html data and to download them over a http or https connection.

The library is completely implemented in Pascal, thread-safe, GPLed, and does not depend on other libraries (except Synapse on Linux).

HTTP/S connections

The Internet Tools do not implement http connections on its own, but provide wrappers around wininet, synapse and Apache HttpComponents.

  • Wininet (Windows Internet API) is installed on all Windows/WINE systems and supports all urls the Internet Explorer supports.
  • Synapse is a platform independent network library.
  • Apache HttpComponents is the old standard Android network library.

The wrappers are implemented as classes derived from an abstract interface, so the application can easily switch between both backends. However, it is recommended to use the wininet wrapper on Windows, the synapse wrapper on Linux and the Apache wrapper on Android.

All data is uploaded or downloaded as strings, and the Internet Tools will automatically handle checking for local ssl libraries, cookies, referrers or redirections.

X/HTML processing with XPath/XQuery

This processing of X/HTML data is the main focus of the Internet Tools and it provides an X/HTML parser, a tree representator, an XPath 2 / XQuery interpreter and a template matcher.

The X/HTML parser processes the X/HTML data and splits it in to tags and contents, and has a SAX-like interface.

The tree representator find matching start and end tags and stores them in a linked list with additional links, which results in a DOM-like interface. This X/HTML parsing is not fully standard compliant, however it contains a lot of heuristics to parse real world websites, which are usually incorrect anyways and could not be read by a standard compliant parser.

The XPath 2 / XQuery layer implements the XPath 2 and XQuery languages, which can be used to extract values from x/html trees, and is standard compliant, except of support for XML schemas and error codes (it only passes 97.8% of the XQuery Test Suite, because there are several tests for schemas). It also implements the JSONiq (pre-) standard for processing JSON.

It also supports CSS 3 Selectors by converting them into XPath expressions and evaluating those.

The template matcher uses pattern-matching templates to extract several, structured values from a html page. Such a template is like a html file, which has been annotated at the interesting parts (just like a regular expression is a string annotated with capture groups).

Examples

A few example for the simple layer of the Internet Tools (which is not as powerful and customizable as using directly the classes).

Load a web page:

uses simpleinternet;
..
str := retrieve('http://www.google.de');

Download a file:

uses bbutils, simpleinternet;
..
strSaveToFileUTF8(TargetFileNameUTF8, retrieve('http://www.google.de'));

Dealing with Sourceforge HTTP download mirrors

uses bbutils, simpleinternet, internetaccess;
...
var TargetFile, SourceForgeURL, Download: string;
begin
  SourceForgeURL := 'http://sourceforge.net/projects/base64decoder/files/base64decoder/version%202.0/b64util.zip/download';
  TargetFile:='/tmp/download.zip';

  //set user agent (fails without it)
  defaultInternetConfiguration.userAgent:='curl/7.21.0 (i686-pc-linux-gnu) libcurl/7.21.0 OpenSSL/0.9.8o zlib/1.2.3.4 libidn/1.18';

  Download := retrieve(SourceForgeURL);
  if strBeginsWith(Download, '<!doctype html>') then begin
    //Download page
    //(this branch was never taken, when I tested, but since the synapse example has it, I include it)
    SourceForgeURL:=process(Download, '//a[@class="direct-download"]/@href').toString;
    Download := retrieve(SourceForgeURL);
    if strBeginsWith(Download, '<!doctype html>') then raise Exception.create('Multiple redirections');
  end;

  strSaveToFileUTF8(TargetFile, SourceForgeURL);
end.

Extract all links from a page

uses simpleinternet;
...
var link: IXQValue;
...
for link in process('http://www.google.de', '//a/@href') do
  writeln(link.toString);

Get external IP address

uses simpleinternet;

writeln(process('http://checkip.dyndns.org', 'extract(//body, "[0-9.]+")').toString);

Using Google Translate

If you have purchased a Google Translate API key:

uses simpleinternet, internetaccess;
var YourKey, OriginalText, SourceLang, TargetLang, TranslatedText: string;
OriginalText := TInternetAccess.urlEncodeData(OriginalText);
TranslatedText := process('https://www.googleapis.com/language/translate/v2' + 
                             +'?key='+YourKey
                             +'&source='+SourceLang
                             +'&target='+TargetLang
                             +'&q=' + OriginalText, 
                         '$json//translatedText').toString;

If you do not have a key using the webpage:

uses simpleinternet, internetaccess;
var OriginalText, SourceLang, TargetLang, TranslatedText: string;
  OriginalText := StringReplace(OriginalText, '''', '''''', [rfReplaceAll]);
  TranslatedText := process(httpRequest(process('https://translate.google.com',
                                               'form(//form, {"sl": "'+SourceLang+'", "tl": "'+TargetLang+'", "text": "'+OriginalText+'"})')),
                           '#result_box').toString);

Using Google Mail, other Google APIs, or OAUTH2

Send a mail through GMail:

See comments for explanation:

uses LCLIntf, simpleinternet, xquery, base64, strutils, sysutils;
var 
  ClientSecret, ClientId: String;
  UserSecret, accessToken: String;
  response: xquery.IXQValue;
  messageRFC2822: String;
begin
  //First you need to obtain a client id and secret. These values are constant and never change within an application.
  //For Google's services you get them by registering on https://console.developers.google.com/ as "Installed application"
  ClientId := ....;
  ClientSecret := ....;
  
  //------------------------------------

  //Next you need to obtain an access token.
  //This token is specific for a certain user and the function (scope) of the API we want to call. 
  //It will time out after a while. But it can be cached and reused several times. (thus the part surrounded by //--- should only be executed once, or after the token times out)
  //Thereby the user must visit a webpage, which will show the user's secret, which we need to request the access key
   
  //Request the user secret
  Scope    := 'https://mail.google.com/'; //the API function (scope) we want to call
  //This will open a browser window, in which the user will see her secret 
  OpenURL('https://accounts.google.com/o/oauth2/auth?scope='+Scope+'&redirect_uri=urn:ietf:wg:oauth:2.0:oob&response_type=code&client_id='+ClientId);

  //Ask the user for her secret (in a console program, use an edit box or input dialog in a gui application)
  readln(UserSecret); 

  //With this secret, we can request the access token:
  response := process(httpRequest('https://www.googleapis.com/oauth2/v3/token', 'code='+urlHexEncode(UserSecret) + '&client_id='+ClientId+'&client_secret='+ClientSecret+'&redirect_uri=urn:ietf:wg:oauth:2.0:oob&grant_type=authorization_code'), '$json');
  accessToken := response.getProperty('access_token').toString; 

  //The access token will time out.  
  //We can use  response.getProperty('refresh_token').toString to get the refresh token, which will last longer
  
  //------------------------------------

  //Any OAUTH 2 API can now be called by adding '?access_token=' + urlHexEncode(accessToken) after the URL.

  //For example sending a mail through GMail:
   

  //Construct a simple test mail
  messageRFC2822 :=
   'From: THE USER YOU WANT TO SEND A MAIL FROM@googlemail.com'#13#10+
   'Reply-To: THE USER YOU WANT TO SEND A MAIL FROM@googlemail.com'#13#10+
   'To: THE RECIPIENT YOU WANT TO SEND IT TO@example.org'#13#10+
   //'Date: Fri, 4 Jul 2015 0:50:06 +0200'#13#10+ a current date, optional
   'Subject: test mail'#13#10+
   'Content-Type: text/plain'#13#10+
   #13#10+
   'Message body....'#13#10;
   
  defaultInternet.additionalHeaders.text := 'Content-Type: message/rfc822'; //set headers for next request
  //Send the message.  The call looks simple, but it is extremely picky about the parameters
  //When the token has timed out, this will raise an EInternetException with error code 400
  httpRequest('https://www.googleapis.com/upload/gmail/v1/users/me/messages/send?uploadType=media&access_token=' + urlHexEncode(accessToken), messageRFC2822);
  defaultInternet.additionalHeaders.text := ''; //reset headers

end.

Use templates to read a form

uses simpleinternet;
...
//html file to process (which you would of course usually not include in the program)
const EXAMPLE_HTML: string =
  '<html><head><title>...</title></head>' +
  '<body>lorem ipsum' +
  '<form>lorem ipsum' +
  'foobar: <input type="text" name="abc" value="123"/>' +
  'foobar: <input type="text" name="def" value="456"/>' +
  'foobar: <input type="text" name="ghi" value="678"/>' +
  '</form>' +
  '</html>';

//template (as you can see it is the html reduces on the relevant parts)
const EXAMPLE_TEMPLATE: string =
  '<form>' +
  '<input type="text" name="abc">{abc:=@value}</input>' +
  '<input type="text" name="def">{def:=@value}</input>' +
  '<input type="text" name="ghi">{ghi:=@value}</input>' +
  '</form>';

begin
  process(EXAMPLE_HTML, EXAMPLE_TEMPLATE);
  writeln(processedVariables.get('abc').toString);
  writeln(processedVariables.get('def').toString);
  writeln(processedVariables.get('ghi').toString);
end.

The newest version has a special "XPath" function form to read a form. E.g. form(//form[1]).url would return ?abc=123&def=456&ghi=678 in the above example. But you still need the templates to do something complex that is not form related.

Calculate primes with XQuery

  uses simpleinternet;
 [...]
  var v: IXQValue;
 [...]
  for v in process('',
    'xquery version "1.0";'                                +
    'declare function local:isprime($p){'                  +
    '  every $i in 2 to $p - 1 satisfies ($p mod $i != 0)' +
    '};'                                                   +
    'for $i in 2 to 30 where local:isprime($i) return $i') do
    writeln(v.toString);


External links