[TAG] Tuppence Tip: URL scraper.
Thomas Adam
thomas at edulinux.homeunix.org
Sun Dec 11 02:10:23 MSK 2005
Hello --
This is really an on-going issue from my post regarding urlview and the
logging of URLs. I've since decided to take a different approach, and thus
far this method works quite nicely. I'm now using 'multi-gnome-terminal'
(MGT), 'multitail', 'gmrun', plus a helper script. The overall aim of all of
this, was to be able to:
* Capture URLs from various #channels that I am in (done via irssi already).
* Open the up in a web browser.
You might wonder what's so hard with this -- the problem is that
X11-forwarding on my server takes forever -- it is only a poor P166 with 64MB
of RAM, after all. The irssi session resides on the server, so I needed a
way of pseudo-opening the URLs as though the request originated on my
workstation.
It's unfortunate that I have to use MGT, since it is a memory hog, but needs
must. I use it because that has the ability to automatically hotlink URLs --
so that actions can be assigned to it when they're clicked on. Based on this
premise, filling in the gaps was easy.
I mount my server's filesystem via 'shfs' -- which I'm now using as a
replacement for NFS. I really like it (and a lot less buggy than lufs, and
its ilk). This way, I can use multitail to keep an eye on that file. The url
logging script that I use from irssi, is "url_log.pl" [1]. All of the logged
entries are in the format:
``
Sat 10 Dec 2005 00:57:08 GMT nick #chan URL
''
... and I wanted multitail to colourise the output, as it does for other
files. That was easy -- just create a new colourscheme for it in
/etc/multitail.conf:
``
###URLlog
colorscheme:urllog
cs_re:green:^... .. ... ....
cs_re:magenta:..:..:.. ...
cs_re:red:.* \#fvwm (http|https|ftp)://.*$
cs_re:blue:.* \#bash (http|https|ftp)://.*$
cs_re:cyan:.* \#elinks (http|https|ftp)://.*$
cs_re:yellow:.* \#hug (http|https|ftp)://.*$
scheme:urllog:/mnt/home/n6tadam/.irssi/urls/url
''
So to break this down a bit, remember a typical entry from this file will look
like:
``
Sat 10 Dec 2005 00:57:08 GMT nick #chan http://myfoo.com
''
Hence: "Sat 10 Dec 2005" will appear in green. "00:57:08 GMT" in magenta,
and the rest of the line will appear in whichever colour is matched by the
#channel the URL was quoted in. So, it looks pretty. :)
``
scheme:urllog:/mnt/home/n6tadam/.irssi/urls/url
''
... should obviously be changed to match whichever file is going to hold the
urls from the url_log,pl script.
The next stage was to determine what happens when I clicked on a URL (I say
click -- the shortcut to opening a URL via MGT is 'CTRL + middleclick'). I
didn't want everything to be sent to my browswer. This is where the "gmrun"
utility comes in useful [2]. For those of you that have never used it, it's a
very handy, and customisable tool. One of the features it has, is pre-defining
prefixes for certain applications. So for instance, I could enter into gmrun:
``
man:bash
''
... and depending on what I had told gmrun to do with the 'man' prefix, it
would open up the bash man page. Neat, eh? So I wanted to have a separate
program to open up images, and URLs (it's quite often the case that people
post links to screenshots, that I don't want to open in a browser, but would
just rather 'see'). I needed to use a helper script to do this, as gmrun
accepts no command-line options. The trick I used (in order to make it
appear directly in the gmrun window, as though I had typed it), was to append
it to gmrun's history file --- if set correctly, gmrun will display the last
entered command. No biggie, here it is:
====== SNIP HERE: runvia.sh.txt ==========
#!/bin/sh
[ "x$(echo \"$1\" | egrep -i '(\.gif|\.png|\.jpg|\.jpeg|\.bmp)')" != "x" ] &&
{
echo "image:$1" >> $HOME/.gmrun_history
} || {
echo "elinks:$1" >> $HOME/.gmrun_history
}
exec gmrun
===== SNIP HERE =========
(Saved as ~/bin/runvia.sh -- and chmod 700 ~/bin/runvia.sh)
So, I'm able to flag to gmrun that if the URL I am clicking on is an image,
then tell it so, else, flag it to open up in elinks (this is my primary
browser -- although I wanted a specific handler for it.) But in order for
that script to process the URL that we clicked on from MGT, we need to tell
MGT to perform that action. This is easier than you'd think, and involves
editing the file: $HOME/.gnome/Gnome, such that:
``
default-show=runvia.sh "%s"
''
Going back to gmrun, we lastly need to tell it what acrtion to take for the
'elinks:' and 'image:' prefixes. That information is stored in /etc/gmrunrc,
although I copy this to ~/.gmrunrc, personally, and edit it, so that for the
image handler:
``
URL_image = sh -c 'feh %s'
''
('feh' has the ability to read images via http).
And for the elinks handler:
``
URL_elinks = sh -c '${TermExec} elinks -remote "%s" && FvwmCommand \
"All (*ELinks*) FlipFocus"'
''
"${TermExec}" is a variable defined further up in the file that looks like
this:
``
Terminal = rxvt
TermExec = ${Terminal} +sb -ls -e
''
... and that's it. It seems to be working really well.
Since I use FVWM, I wanted to (when I had decided to click on a URL) to focus
the webbrowser -- hence the reason why I'm using FvwmCommand. This is
optional of course. Although to continue on a similar theme, the style of
the "gmrun" dialogue window is set to the following:
``
Style Gmrun GrabFocus
''
.. so that when it pops up, I can hit enter, knowing that the Gmrun window
will always have the focus, to execute whatever is inside it.
You can see a screenshot[3] of the url-logger in action.
Hope someone finds this useful, or can derive other ideas from it.
-- Thomas Adam
[1] http://www.irssi.org/scripts/html/url_log.pl.html
[2] www.bazon.net/mishoo/gmrun.epl
[3] http://edulinux.homeunix.org/~n6tadam/fvwm/ss/url-logger.png
--
I'm brutal, honest, and afraid of you.
More information about the TAG
mailing list