arcdump
is a small utility that displays information about the members of an Arcfile, including their metadata and offsets. It also allows you to extract the content of a member based on its URL. In addition it serves as an example of the libarc
API.Usage: arcdump [OPTION]... ARC_FILE -u URL display the contents of the member with URL -m MIME limit reporting to specified MIME type -r list the offsets and sizes for each URL -h display this help and exit -v output version information and exit
There are two "modes" available: Information Display and Member Content Display.
-m
option allows you to select what media types to display.
By default arcdump
displays the MIME type, crawl date, IP address, and URL of each member:
% arcdump -m image/jpeg BT20040528233019-1.arc.gz | head # */* : 551 # image/jpeg: 18 # MIME_Type Date IP_Address URL image/jpeg 2004-05-28 199.88.205.3 http://www.basistech.com/images/bluenew.jpg image/jpeg 2004-05-28 199.88.205.3 http://www.basistech.com/images/yellow-home1.jpg image/jpeg 2004-05-28 199.88.205.3 http://www.basistech.com/images/collage01.jpg image/jpeg 2004-05-28 199.88.205.3 http://www.basistech.com/images/yellow-home2.jpg image/jpeg 2004-05-28 199.88.205.3 http://www.basistech.com/images/software-ag.jpg image/jpeg 2004-05-28 199.88.205.3 http://www.basistech.com/ja/images/bluenew.jpg
The -r
option displays the offset, sizes, and URL for each member:
% arcdump -r -m image/jpeg BT20040528233019-1.arc.gz | head # */* : 551 # image/jpeg: 18 # Start_Offset Raw_Size Size URL 181178 4947 5294 http://www.basistech.com/images/bluenew.jpg 186143 9830 10103 http://www.basistech.com/images/yellow-home1.jpg 195991 74100 74246 http://www.basistech.com/images/collage01.jpg 275389 5584 6041 http://www.basistech.com/images/yellow-home2.jpg 2273298 9732 11108 http://www.basistech.com/images/software-ag.jpg 9123633 4953 5296 http://www.basistech.com/ja/images/bluenew.jpg 9128604 9832 10106 http://www.basistech.com/ja/images/yellow-home1.jpg
The columns in each list are tab-separated, making the output easily processed by Unix text processing tools. This example displays the URL of every PDF file greater than 500 KB:
% arcdump -r -m application/pdf BT20040528233019-1.arc.gz | \ awk '!/^#/ { if ($3 > 512000) { print $4 } }' http://www.basistech.com/papers/unicode/big_dots_little_dots.pdf http://www.basistech.com/papers/chinese/iuc24-emerson-chinese.pdf http://www.basistech.com/papers/unicode/iuc24-emerson-fsa.pdf
-u
flag:
% arcdump -u http://www.basistech.com/site/style.css BT20040528232937-0.arc.gz | head a:link { color: #006699; text-decoration: none; } a:visited { color: #993399; text-decoration: none; }
This option can be used with binary data as well:
% arcdump -u http://www.basistech.com/images/collage01.jpg BT20040528232937-0.arc.gz > collage01.jpg
For example, here is a simple script that extracts all of the big PDF files from an ARC file, mirroring the original directory structure:
#!/bin/sh # extract all of the PDF files that are over 500 KB pdfs=`arcdump -r -m application/pdf BT20040528233019-1.arc.gz | \ awk '!/^#/ { if ($3 > 512000) { print $4 } }'` for p in $pdfs; do # remove the scheme from the URL path=`echo $p | sed 's|http://||'` # create the appropriate directories mkdir -p `dirname $path` echo Extracting $path # extract the data arcdump -u $p BT20040528233019-1.arc.gz > $path done
Obviously a "production" script would need to be smarter than this, but it shows what is possible.
Generated on Tue Jun 8 21:30:14 2004 by |