Goal#
Today I did a practical bash exercise: crawl a web page, extract all domains and resolve them to IP addresses with bash and common GNU/Linux tools.
Script#
Comments#
curl $url -s
: I used curl
rather wget
because we don't need to store the downloaded web page. Then I used the -s
option (--silent
) that will avoid progress bar or error messages to display.
grep -E 'https?://[^"]*'
: -E
(--extended-regexp
) is required for using patterns such as s?
that allow us to match http
as well as https
. For the sake of easiness and efficiency I won't consider domains without the a protocol, when the protocol is capitalized, etc. [^"]*
is for matching all characters but "
, trying to stop after http://example.org">
.
cut -d '/' -f 3
: cut the output using /
as a delimiter and keep only the 3rd columns so what is after http://
.
cut -d '"' -f 1
: will use "
as a delimiter and will keep the first column to help us clean the output avoiding ugly output such as http://example.org">
that [^"]*
couldn't avoid.
uniq
: will keep only unique values.
filename
: I stored the result of dig in a file because it was far more easier to parse. In the loop, concatenating the output in a string won't keep line feed characters and also a domain can match several IP addresses so even storing the result in a bash set will have an unexpected result when dig will answer several IP addresses. Of course I could have only display the result but it won't allow me to sort and filter it.
dig +noall +answer $domain
: I was nearly forced to use dig
rather than the cleaner drill
used on ArchLinux: in our case we are only interested in getting IP addresses and parsing the drill
result to keep only the answer lines would have been difficult and would have required complex regexp because regexp for multi-lines matches along with group matching is hard. So the easier way is to use options that dig
has and that drill doesn't. But since ArchLinux did a dnsutils to ldns migration (for good reasons) I had to install dig
with pacman -S bind-tools
. So +noall +answer
allowed me to display only the answer section of the DNS response.
awk
. awk '/\sA\s/ {print $5}'
: Then we want only the A records (not CNAME for example) so I used \sA\s/
to match only those. First I did this filter using grep -E '/\sA\s/'
but as I was forced to use awk
for the next piping I used the regexp directly in awk
. awk '/\sA\s/ {print $5}'
is cleaner than grep -E '/\sA\s/' | awk '{print $5}'
. I said I was forced to use
awk
and not cut
(which would have been easier) because cut
delimiter mechanism works only when there is a constant number of the delimiter but depending of the domain length dig
will sometimes output one tab and sometimes several, so the column we want to extract will vary. And cut
has no way to remove void columns or consider consecutive delimiter as only one delimiter. So I sued awk
that can manage this pretty well with {print $5}'
.
sort -u
: obviously sort will sort results and I used -u
(--unique
) to keep only unique results without having to pipe sort
to uniq
.
Demo#
Conclusion#
The script is quite short but is very dense. In general bash script are all about piping commands, filtering results and finding the right options.
Links#
- Gist (script): https://gist.github.com/noraj/86df35096aa250ead4f7b3a6e6eb09de
- Asciinema (ASCII video demo): https://asciinema.org/a/243061