[TAG] need help with grep
Benjamin A. Okopnik
ben at linuxgazette.net
Sun Jun 18 19:09:47 MSD 2006
On Sat, 17 Jun 2006 at 14:27:10 -0700, David Martin wrote:
>
> Hi Gurus,
>
> I need urgent help with grep/awk ok please, I've spent over 17hrs
> pulling my hair out over this question. I hope you can help a Linux
> newbie please.
As Thomas said, we don't normally help with homework questions - but it
sounds like you actually have put some time into this one rather than
just dumping it in our laps, which I suppose deserves some
consideration. Again, like Thomas, I'm not going to give you a direct
answer - it is, after all, supposed to be _your_ homework, and you're
supposed to ask your instructor if you just get stuck on an assignment -
but I'll be happy to give you a hint.
> My lecturer has asked me to do the following with this file, But I
> cannot get this on one line, .
I've read the specification (i.e., the question you were asked), and it
doesn't say "on one line"; it says "a single command line". Since Linux
(or, more precisely, the shell CLI) allows you to chain processes and
use statement separators, you could (theoretically) write a 100kB-long
program "on one line" - so it's not much of a problem.
> grep -o "[[:digit:]]*[.][[:digit:]]*[.][[:digit:]]*[.][[:digit:]]*"
> httpdaccess.log > ipaddresses.log
Is there a reason that you want to use character class names instead of
explicit character classes? '[0-9]' works just as well as '[[:digit:]]'
(barring some vague mutterings about the $LANG variable, which doesn't
apply in shell-based scenarios anyway.) As well, the above expression
isn't very useful; if you're trying to match an IP, then something like
``
egrep '\<([0-9]{1,3}\.){3}[0-9]{1,3}\>'
''
Is probably much more useful. On the other hand, matching an IP has
nothing to do with the solution to the stated problem, which, I suppose,
is why I'm giving you a complete answer here. :)
> One thing I did not do was to remove duplicate entries from the
> output. I should have run this from the shell instead.
If you're trying to figure out the busiest dates, then removing
duplicate entries is definitely NOT what you want to do - at least not
initially.
What I'll do here is give you a general idea of how the task is done.
I'm assuming that you understand the available tools well enough to
implement a solution once you understand how to look at the problem (if
you don't, then you're beyond any help that I'm willing to provide.)
The task essentially comes down to creating a *frequency counter*. This
is a fairly standard programming methodology, used a lot in - ta-daa! -
log analysis. What you need is a list of unique dates, and a number of
hits for each of those dates - essentially a line count of anything that
matches them.
I've taken a look at your log (being one of the listadmins has its
privileges :), and it's nothing more than Apache's CLF (Common Log
Format) - i.e.
```
210.49.49.147 - - [18/Apr/2004:22:59:44 +1000] "GET /ASGAP/gif/forest2.gif HTTP/1.1" 200 1857
203.40.195.112 - - [18/Apr/2004:23:01:33 +1000] "GET /ASGAP/gif/bguide.gif HTTP/1.1" 200 288
134.115.68.21 - - [18/Apr/2004:23:03:42 +1000] "GET /ASGAP/gif/forest.gif HTTP/1.0" 304 -
150.214.167.133 - - [18/Apr/2004:23:04:54 +1000] "GET /AFVL/tagasaste.htm HTTP/1.0" 200 3266
203.40.195.112 - - [18/Apr/2004:23:06:03 +1000] "GET /ASGAP/jpg/styphels.jpg HTTP/1.1" 200 5318
'''
in which fields are defined as
``
IP identd user [dy/mon/year:hh:mm:ss zone] "request" status size
''
Matching the date is very easy: it consists of the six characters
following a square bracket. You can isolate those - think of what tool
you need to do that, since that's the main "processing" you need to do!
- and get a unique list of them. Once you've got that unique list, you
can loop over it and simply count anything that matches a square bracket
followed by those characters, then sort the counted output. If you want
to get really fancy, you can report only the first line of the count,
which will give you the largest count - i.e., the busiest day.
There is at least one standard Unix program that allows you to do all
that in one pass; however, using it is probably a bit complex for where
you are at the moment. Implementing it as I described above should work
fine for you, and only requires relatively basic tool knowledge.
> Using an httpd log (which will be provided on the subject forum)
> write a single command line using a pipeline of commands mentioned in
> The complete guide to Linux system administration: Chapter 5 to
> determine the top ten busiest dates (most objects accessed).
* Ben Okopnik * Editor-in-Chief, Linux Gazette * http://linuxgazette.net *
More information about the TAG
mailing list