Using arcdump

Introduction

arcdump is a small utility that displays information about the members of an Arcfile, including their metadata and offsets. It also allows you to extract the content of a member based on its URL. In addition it serves as an example of the libarc API.

Usage


Usage: arcdump [OPTION]... ARC_FILE

  -u URL   display the contents of the member with URL 
  -m MIME  limit reporting to specified MIME type
  -r       list the offsets and sizes for each URL
  -h       display this help and exit
  -v       output version information and exit

There are two "modes" available: Information Display and Member Content Display.

Information Display

The -m option allows you to select what media types to display.

By default arcdump displays the MIME type, crawl date, IP address, and URL of each member:


% arcdump -m image/jpeg BT20040528233019-1.arc.gz | head
# */* : 551
# image/jpeg: 18
# MIME_Type   Date   IP_Address   URL
image/jpeg      2004-05-28      199.88.205.3    http://www.basistech.com/images/bluenew.jpg
image/jpeg      2004-05-28      199.88.205.3    http://www.basistech.com/images/yellow-home1.jpg
image/jpeg      2004-05-28      199.88.205.3    http://www.basistech.com/images/collage01.jpg
image/jpeg      2004-05-28      199.88.205.3    http://www.basistech.com/images/yellow-home2.jpg
image/jpeg      2004-05-28      199.88.205.3    http://www.basistech.com/images/software-ag.jpg
image/jpeg      2004-05-28      199.88.205.3    http://www.basistech.com/ja/images/bluenew.jpg

The -r option displays the offset, sizes, and URL for each member:


% arcdump -r -m image/jpeg BT20040528233019-1.arc.gz | head
# */* : 551
# image/jpeg: 18
# Start_Offset   Raw_Size   Size   URL
181178  4947    5294    http://www.basistech.com/images/bluenew.jpg
186143  9830    10103   http://www.basistech.com/images/yellow-home1.jpg
195991  74100   74246   http://www.basistech.com/images/collage01.jpg
275389  5584    6041    http://www.basistech.com/images/yellow-home2.jpg
2273298 9732    11108   http://www.basistech.com/images/software-ag.jpg
9123633 4953    5296    http://www.basistech.com/ja/images/bluenew.jpg
9128604 9832    10106   http://www.basistech.com/ja/images/yellow-home1.jpg

The columns in each list are tab-separated, making the output easily processed by Unix text processing tools. This example displays the URL of every PDF file greater than 500 KB:


% arcdump -r -m application/pdf BT20040528233019-1.arc.gz | \
    awk '!/^#/ { if ($3 > 512000) { print $4 } }'
http://www.basistech.com/papers/unicode/big_dots_little_dots.pdf
http://www.basistech.com/papers/chinese/iuc24-emerson-chinese.pdf
http://www.basistech.com/papers/unicode/iuc24-emerson-fsa.pdf

Member Content Display

You can extract the contents of an ARC member by specifying its URL with the -u flag:


% arcdump -u http://www.basistech.com/site/style.css BT20040528232937-0.arc.gz | head
a:link    {
color: #006699; 
text-decoration: none;
                }

a:visited {
color: #993399; 
text-decoration: none;
                }

This option can be used with binary data as well:


% arcdump -u http://www.basistech.com/images/collage01.jpg BT20040528232937-0.arc.gz > collage01.jpg

Advanced arcdump

You can combine the functionality of the info mode and the content moment in interesting ways.

For example, here is a simple script that extracts all of the big PDF files from an ARC file, mirroring the original directory structure:


#!/bin/sh

# extract all of the PDF files that are over 500 KB
pdfs=`arcdump -r -m application/pdf BT20040528233019-1.arc.gz | \
         awk '!/^#/ { if ($3 > 512000) { print $4 } }'`

for p in $pdfs; do
    # remove the scheme from the URL
    path=`echo $p | sed 's|http://||'`
    # create the appropriate directories
    mkdir -p `dirname $path`
    echo Extracting $path
    # extract the data
    arcdump -u $p BT20040528233019-1.arc.gz > $path
done

Obviously a "production" script would need to be smarter than this, but it shows what is possible.

Generated on Tue Jun 8 21:30:14 2004 by