Main Page | Namespace List | Class List | File List | Namespace Members | Class Members | Related Pages

Programming with libarc

Libarc Concepts

All libarc functionality is contained in the libarc namespace.

There are three primary class abstractions used by libarc:

  1. ARC_File: An archive (.arc.gz) file (or "ARC file") generated by Heritrix or other tool (such as the Nedlib to ARC converter). An Archive file is generally compressed using GZIP. Note: libarc only handles compressed ARC files.

  2. ARC_Member: A document in the ARC file, called a 'member' or 'record'. This represents a file from a crawled website, and could be an HTML document, JPEG image, PDF file, or anything else.

  3. Member_Iterator: The member iterator allows you to access the members in an ARC file. A number of utility functions are provided to facilitate this process, and you can restrict the iterator so that it only returns certain media types: just PDF files, for example.

There is one other important class, MediaType, that represents a media type as defined in Section 3.7 of the HTTP 1.1 specification. The library uses this class to provide content-type information for each member, and to constrain the types of members returned by the iterator.

Note:
libarc does not provide any ability for creating ARC files or for editing the contents of an existing ARC file.

Working with ARC_File's

Creating the ARC_File

To operate on the members of an ARC file you create an instance of ARC_File using ARC_File::Create:

#include <libarc.h> using namespace libarc; // ... ARC_File* arc_file; try { arc_file = ARC_File::Create("BT20040528232937-0.arc.gz"); } catch (Unsupported_ARC_Fomat_Exception& e) { // Unknown ARC file format or the file is corrupt } catch (std::exception& e) { // Some other exception appeared, check with e.what() }

The named ARC file is opened and its contents scanned to build the member index.

Getting Member Information

You can determine the number of members in the ARC file have a particular media type using the ARC_File's GetMemberCount member function:

// get the number total number of members in the file // (could also have specified "*/*" as the media type. off_t total_members = arc_file->GetMemberCount(); // get the number of HTML files in the archive off_t html_members = arc_file->GetMemberCount("text/html"); // get the number of images in the archive off_t image_members = arc_file->GetMemberCount("image/*");

Note that you can specify media-type wild-cards to limit the type of member that is counted.

Accessing ARC Members

You get access to the members of an ARC file by asking the ARC_File instance for a member iterator. You can specify a media-type to limit the type of member returned.

You call the iterator's Next() member function repeatedly until it returns 0. On each call it returns the next Archive member meeting the media-type criteria.

// Get an iterator for all the members in the ARC members Member_Iterator* all_iter = arc_file->GetMemberIterator(); // Get an iterator for the PDFs in the ARC file Member_Iterator* pdf_iter = arc_file->GetMemberIterator("application/pdf"); const ARC_Member* m; while ((m = pdf_iter->Next()) != 0) { // process the member }

Accessing member information is covered in Working with ARC Members.

Finishing Up

When you are done with the iterator, you must call its Destroy() member function to delete any resources it is holding.

pdf_iter->Destroy();

When you are done with the ARC_File instance you have to delete it using its Destroy member function:

arc_file->Destroy();

Working with ARC Members

Getting Metadata

The ARC_Member class contains three categories of accessors:

  1. Meta-data
  2. Positional Data
  3. Content Data

The Metadata

The crawl date and the IP addresses are returned as raw data-types (time_t and in_addr_t respectively), allowing you to work with and display them however you want.

The meta data values are extracted from the URL Record associated with each ARC member. Only Version 1 URL records are supported (see the ARC File Format for details.)

About Member Lengths and Offsets

Compressed ARC files are composed of multiple individual GZIPed ARC members concatenated together. The offset reported by the GetOffset member function is the offset to the start of each ARC member, accounting for the structural overhead of the GZIP format (the header and trailer.)

There are two different lengths available: the first, returned by GetRawSize, is the number of bytes in the compressed ARC member, not including the GZIP header or trailer. The second, returned by GetSize, is the number of bytes in the uncompressed member.

For example:

% arcdump -m text/html -r BT20040528233019-1.arc.gz 150 4772 15961 http://connect.basistech.com/protected/s2t/cookies.html 4940 2176 7846 http://demos.basistech.com/site/404.html?404;http://demos.basistech.com:80/s2t 7134 280 434 http://demos.basistech.com/jla 7432 2068 6677 http://www.basistech.com/clients/index.html

The first URL starts 150 bytes into the file, is 4,772 bytes long compressed, and expands to 15,962 bytes.

The next member starts 4,940 bytes in. This is 4,790 bytes beyond the start of the preceeding document, and 18 bytes longer than the length of first URL. These 18 bytes are composed of the 10 byte GZIP header and the 8 byte GZIP checksum footer.

Accessing ARC Members

You access the content data, response headers, and HTTP result code of a member by first calling the member's GetData() member function. In addition to returning a pointer to the content, it also decodes the HTTP response headers and returns the length of the uncompressed data (the same value returned by GetSize).

The HTTP result is part of the response headers so you cannot access this value until GetData() is called.

When you are finished with the member, you must call its ReleaseData member function to free up any storage used for it.

You can access query the values of certain response headers by using the GetResponseHeader member function.

For example, you can display the last-modified modified of all documents that have one:

const ARC_Member* member; while ((member = all_iter->Next()) != 0) { off_t real_size; // we do not care about the data itself, but need to call this // to extract the HTTP response headers. (void) member->GetData(real_size); // see if the header is there, and if it is, dump the info std::string last_mod = member->GetResponseHeader("Last-Modified"); if (!last_mod.empty()) { std::cout << last_mod << "\t" << member->URL() << std::endl; } member->ReleaseData(); }

A Complete Example

The following code displays the modification date and URL of all files in an ARC file that were returned with a result code of 200.

using namespace libarc; int main(int argc, char* argv[]) { ARC_File* arc_file = 0; try { arc_file = ARC_File::Create(argv[1]); // get an iterator for all HTML archive members Member_Iterator* iter = arc_file->GetMemberIterator("text/html"); // iterate over them and only return information on non-error // pages that have a Last-Modified header const ARC_Member* m; while ((m = iter->Next()) != 0) { off_t length; (void) m->GetData(length); if (m->HTTPStatus() == 200) { std::string lms = m->GetResponseHeader("Last-Modified"); if (!lms.empty()) { std::cout << lms << "\t" << m->URL() << std::endl; } } m->ReleaseData(); } iter->Destroy(); arc_file->Destroy(); } catch (Unsupported_ARC_Format_Exception& e) { std::cerr << "Unknown ARC file format, or ARC file is corrupt." << std::endl; return 1; } catch (std::exception& e) { std::cerr << "Exception: " << e.what() << std::endl; return 1; } return 0; }

Helper Functions

There are several template functions defined in libarc.h that simplify operating on the members of an ARC file: for_each, for_each_if, find_if, and count_if.

The for_each function takes a Member_Iterator and a function or callable object and applies it to every member accessible by the iterator. The function should take a single const ARC_Member* argument, which it calls for each member:

static void x_dump_last_modified(const ARC_Member* m) { off_t real_size; (void) m->GetData(real_size); std::string last_mod = m->GetResponseHeader("Last-Modified"); if (!last_mod.empty()) { std::cout << last_mod << "\t" << m->URL() << std::endl; } member->ReleaseData(); } // ... { Member_Iterator* iter = arc_file->GetMemberIterator(); for_each(iter, x_dump_last_modified); }

The _if functions take a predicate, which takes a const ARC_Member* and returns a bool indicating whether the member should be processed or not.

In this example we display the URL and size of every PDF file that is larger than 100 KB:

static void x_dump_document_info(const ARC_Member* m) { std::cout << m->URL() << "\t" m->GetSize() << std::endl; } class x_CheckSizeValue { public: x_CheckSizeValue(int kb) : m_Size(kb * 1024) {} bool operator() (const ARC_Member* m) const { return m->GetSize() >= m_Size; } private: int m_Size; }; // ... { Member_Iterator* iter = arc_file->GetMemberIterator("application/pdf"); for_each_if(iter, x_CheckSizeValue(100), x_dump_document_info); }

Notice that the predicate in this example is implemented as a class, allowing it to be trivially reused for different sizes.


Generated on Tue Jun 8 21:30:14 2004 by doxygen SourceForge.net Logo