Main Page | Namespace List | Class List | File List | Namespace Members | Class Members | Related Pages

Using arcdump

Introduction

arcdump is a small utility that displays information about the members of an Arcfile, including their metadata and offsets. It also allows you to extract the content of a member based on its URL. In addition it serves as an example of the libarc API.

Usage

Usage: arcdump [OPTION]... ARC_FILE -u URL display the contents of the member with URL -m MIME limit reporting to specified MIME type -r list the offsets and sizes for each URL -h display this help and exit -v output version information and exit

There are two "modes" available: Information Display and Member Content Display.

Information Display

The -m option allows you to select what media types to display.

By default arcdump displays the MIME type, crawl date, IP address, and URL of each member:

% arcdump -m image/jpeg BT20040528233019-1.arc.gz | head # */* : 551 # image/jpeg: 18 # MIME_Type Date IP_Address URL image/jpeg 2004-05-28 199.88.205.3 http://www.basistech.com/images/bluenew.jpg image/jpeg 2004-05-28 199.88.205.3 http://www.basistech.com/images/yellow-home1.jpg image/jpeg 2004-05-28 199.88.205.3 http://www.basistech.com/images/collage01.jpg image/jpeg 2004-05-28 199.88.205.3 http://www.basistech.com/images/yellow-home2.jpg image/jpeg 2004-05-28 199.88.205.3 http://www.basistech.com/images/software-ag.jpg image/jpeg 2004-05-28 199.88.205.3 http://www.basistech.com/ja/images/bluenew.jpg

The -r option displays the offset, sizes, and URL for each member:

% arcdump -r -m image/jpeg BT20040528233019-1.arc.gz | head # */* : 551 # image/jpeg: 18 # Start_Offset Raw_Size Size URL 181178 4947 5294 http://www.basistech.com/images/bluenew.jpg 186143 9830 10103 http://www.basistech.com/images/yellow-home1.jpg 195991 74100 74246 http://www.basistech.com/images/collage01.jpg 275389 5584 6041 http://www.basistech.com/images/yellow-home2.jpg 2273298 9732 11108 http://www.basistech.com/images/software-ag.jpg 9123633 4953 5296 http://www.basistech.com/ja/images/bluenew.jpg 9128604 9832 10106 http://www.basistech.com/ja/images/yellow-home1.jpg

The columns in each list are tab-separated, making the output easily processed by Unix text processing tools. This example displays the URL of every PDF file greater than 500 KB:

% arcdump -r -m application/pdf BT20040528233019-1.arc.gz | \ awk '!/^#/ { if ($3 > 512000) { print $4 } }' http://www.basistech.com/papers/unicode/big_dots_little_dots.pdf http://www.basistech.com/papers/chinese/iuc24-emerson-chinese.pdf http://www.basistech.com/papers/unicode/iuc24-emerson-fsa.pdf

Member Content Display

You can extract the contents of an ARC member by specifying its URL with the -u flag:

% arcdump -u http://www.basistech.com/site/style.css BT20040528232937-0.arc.gz | head a:link { color: #006699; text-decoration: none; } a:visited { color: #993399; text-decoration: none; }

This option can be used with binary data as well:

% arcdump -u http://www.basistech.com/images/collage01.jpg BT20040528232937-0.arc.gz > collage01.jpg

Advanced arcdump

You can combine the functionality of the info mode and the content moment in interesting ways.

For example, here is a simple script that extracts all of the big PDF files from an ARC file, mirroring the original directory structure:

#!/bin/sh # extract all of the PDF files that are over 500 KB pdfs=`arcdump -r -m application/pdf BT20040528233019-1.arc.gz | \ awk '!/^#/ { if ($3 > 512000) { print $4 } }'` for p in $pdfs; do # remove the scheme from the URL path=`echo $p | sed 's|http://||'` # create the appropriate directories mkdir -p `dirname $path` echo Extracting $path # extract the data arcdump -u $p BT20040528233019-1.arc.gz > $path done

Obviously a "production" script would need to be smarter than this, but it shows what is possible.


Generated on Tue Jun 8 21:30:14 2004 by doxygen SourceForge.net Logo