Difference between revisions of "Internet Tools"
m (→Use templates to read a formular: spelling) |
(mention some new things) |
||
Line 37: | Line 37: | ||
The XPath 2 layer implements the [http://www.w3.org/TR/2011/REC-xpath-full-text-10-20110317/ XPath 2] language, which can be used to extract values from x/html trees, and is almost fully standard compliant. | The XPath 2 layer implements the [http://www.w3.org/TR/2011/REC-xpath-full-text-10-20110317/ XPath 2] language, which can be used to extract values from x/html trees, and is almost fully standard compliant. | ||
− | The template matcher uses templates to extract several, structured values from a html page. Such a template is like a html file, which has been annotated at the interesting parts (just like a regular expression is a string annotated with capture groups). | + | It also supports CSS 3 Selectors by converting them into XPath expressions and evaluating those. |
+ | |||
+ | The template matcher uses pattern-matching templates to extract several, structured values from a html page. Such a template is like a html file, which has been annotated at the interesting parts (just like a regular expression is a string annotated with capture groups). | ||
Line 124: | Line 126: | ||
end. | end. | ||
</syntaxhighlight> | </syntaxhighlight> | ||
+ | |||
+ | |||
+ | (update: the newest version has a special XPath function <code>form</code> to read a form. E.g. <code>form(//form[1]).url</code> would return <code>?abc=123&def=456&ghi=678</code> in the above example. But you still need the template to do something complex, not form related) | ||
=External links= | =External links= |
Revision as of 14:22, 27 September 2012
Internet Tools is a library to process web pages and is intended to be easily usable.
Other Web and Networking Articles
- Networking
- Secure Programming
- Sockets - TCP/IP Sockets components
- Synapse - Serial port and synchronous TCP/IP Library
- lNet - Lightweight Networking Components
- XML Tutorial - XML is often utilized on network communications
- FPC and Apache Modules
- fcl-web - Also known as fpWeb, this is a library to develop web applications which can be deployed as cgi, fastcgi or apache modules.
Overview
The Internet Tools provide units to process x/html data and to download them over a http or https connection.
The library is completely implemented in Pascal, thread-safe, GPLed, and does not depend on other libraries (except Synapse on Linux).
HTTP/S connections
The Internet Tools do not implement http connections on its own, but provide wrappers around wininet and synapse. Wininet (Windows Internet API) is installed on all Windows/WINE systems and supports all urls the Internet Explorer supports. Synapse is a platform independent network library.
The wrappers are implemented as classes derived from an abstract interface, so the application can easily switch between both backends. However, it is recommended to use the wininet wrapper on Windows and synapse on Linux.
All data is uploaded or downloaded as strings, and the Internet Tools will automatically handle checking for local ssl libraries, cookies, referrers or redirections.
X/HTML processing
This processing of X/HTML data is the main focus of the Internet Tools and it provides an X/HTML parser, a tree representator, an XPath 2 interpreter and a template matcher.
The X/HTML parser processes the X/HTML data and splits it in to tags and contents, and has a SAX-like interface.
The tree representator find matching start and end tags and stores them in a linked list with additional links, which results in a DOM-like interface. This X/HTML parsing is not fully standard compliant, however it contains a lot of heuristics to parse real world websites, which are usually incorrect.
The XPath 2 layer implements the XPath 2 language, which can be used to extract values from x/html trees, and is almost fully standard compliant.
It also supports CSS 3 Selectors by converting them into XPath expressions and evaluating those.
The template matcher uses pattern-matching templates to extract several, structured values from a html page. Such a template is like a html file, which has been annotated at the interesting parts (just like a regular expression is a string annotated with capture groups).
Examples
A few example for the simple layer of the Internet Tools (which is not as powerful and customizable as using directly the classes).
Load a web page:
uses simpleinternet;
..
str := retrieve('http://www.google.de');
Download a file:
uses bbutils, simpleinternet;
..
strSaveToFileUTF8(TargetFileNameUTF8, retrieve('http://www.google.de'));
Dealing with Sourceforge HTTP download mirrors
uses bbutils, simpleinternet, internetaccess;
...
var TargetFile, SourceForgeURL, Download: string;
begin
SourceForgeURL := 'http://sourceforge.net/projects/base64decoder/files/base64decoder/version%202.0/b64util.zip/download';
TargetFile:='/tmp/download.zip';
//set user agent (fails without it)
defaultInternetConfiguration.userAgent:='curl/7.21.0 (i686-pc-linux-gnu) libcurl/7.21.0 OpenSSL/0.9.8o zlib/1.2.3.4 libidn/1.18';
Download := retrieve(SourceForgeURL);
if strBeginsWith(Download, '<!doctype html>') then begin
//Download page
//(this branch was never taken, when I tested, but since the synapse example has it, I include it)
SourceForgeURL:=process(Download, '//a[@class="direct-download"]/@href');
Download := retrieve(SourceForgeURL);
if strBeginsWith(Download, '<!doctype html>') then raise Exception.create('Multiple redirections');
end;
strSaveToFileUTF8(TargetFile, SourceForgeURL);
end.
Extract all links from a page
uses simpleinternet;
...
sl := TStringList.Create;
//read all link targets as new-line separated list in a string and let the string list split it
sl.text := process('http://www.google.de', 'string-join(//a/@href,"'#13#10'")')
Use templates to read a form
uses simpleinternet;
...
//html file to process (which you would of course usually not include in the program)
const EXAMPLE_HTML: string =
'<html><head><title>...</title></head>' +
'<body>lorem ipsum' +
'<form>lorem ipsum' +
'foobar: <input type="text" name="abc" value="123"/>' +
'foobar: <input type="text" name="def" value="456"/>' +
'foobar: <input type="text" name="ghi" value="678"/>' +
'</form>' +
'</html>';
//template (as you can see it is the html reduces on the relevant parts)
const EXAMPLE_TEMPLATE: string =
'<form>' +
'<input type="text" name="abc">{abc:=@value}</input>' +
'<input type="text" name="def">{def:=@value}</input>' +
'<input type="text" name="ghi">{ghi:=@value}</input>' +
'</form>';
begin
process(EXAMPLE_HTML, EXAMPLE_TEMPLATE);
writeln(processedVariables.getVariableValueString('abc'));
writeln(processedVariables.getVariableValueString('def'));
writeln(processedVariables.getVariableValueString('ghi'));
end.
(update: the newest version has a special XPath function form
to read a form. E.g. form(//form[1]).url
would return ?abc=123&def=456&ghi=678
in the above example. But you still need the template to do something complex, not form related)