libarc
functionality is contained in the libarc
namespace.
There are three primary class abstractions used by libarc
:
.arc.gz
) file (or "ARC file") generated by Heritrix or other tool (such as the Nedlib to ARC converter). An Archive file is generally compressed using GZIP. Note: libarc
only handles compressed ARC files.
There is one other important class, MediaType
, that represents a media type as defined in Section 3.7 of the HTTP 1.1 specification. The library uses this class to provide content-type information for each member, and to constrain the types of members returned by the iterator.
libarc
does not provide any ability for creating ARC files or for editing the contents of an existing ARC file.
#include <libarc.h> using namespace libarc; // ... ARC_File* arc_file; try { arc_file = ARC_File::Create("BT20040528232937-0.arc.gz"); } catch (Unsupported_ARC_Fomat_Exception& e) { // Unknown ARC file format or the file is corrupt } catch (std::exception& e) { // Some other exception appeared, check with e.what() }
The named ARC file is opened and its contents scanned to build the member index.
// get the number total number of members in the file // (could also have specified "*/*" as the media type. off_t total_members = arc_file->GetMemberCount(); // get the number of HTML files in the archive off_t html_members = arc_file->GetMemberCount("text/html"); // get the number of images in the archive off_t image_members = arc_file->GetMemberCount("image/*");
Note that you can specify media-type wild-cards to limit the type of member that is counted.
You call the iterator's Next() member function repeatedly until it returns 0. On each call it returns the next Archive member meeting the media-type criteria.
// Get an iterator for all the members in the ARC members Member_Iterator* all_iter = arc_file->GetMemberIterator(); // Get an iterator for the PDFs in the ARC file Member_Iterator* pdf_iter = arc_file->GetMemberIterator("application/pdf"); const ARC_Member* m; while ((m = pdf_iter->Next()) != 0) { // process the member }
Accessing member information is covered in Working with ARC Members.
pdf_iter->Destroy();
When you are done with the ARC_File instance you have to delete it using its Destroy member function:
arc_file->Destroy();
time_t
and in_addr_t
respectively), allowing you to work with and display them however you want.The meta data values are extracted from the URL Record associated with each ARC member. Only Version 1 URL records are supported (see the ARC File Format for details.)
There are two different lengths available: the first, returned by GetRawSize, is the number of bytes in the compressed ARC member, not including the GZIP header or trailer. The second, returned by GetSize, is the number of bytes in the uncompressed member.
For example:
% arcdump -m text/html -r BT20040528233019-1.arc.gz 150 4772 15961 http://connect.basistech.com/protected/s2t/cookies.html 4940 2176 7846 http://demos.basistech.com/site/404.html?404;http://demos.basistech.com:80/s2t 7134 280 434 http://demos.basistech.com/jla 7432 2068 6677 http://www.basistech.com/clients/index.html
The first URL starts 150 bytes into the file, is 4,772 bytes long compressed, and expands to 15,962 bytes.
The next member starts 4,940 bytes in. This is 4,790 bytes beyond the start of the preceeding document, and 18 bytes longer than the length of first URL. These 18 bytes are composed of the 10 byte GZIP header and the 8 byte GZIP checksum footer.
The HTTP result is part of the response headers so you cannot access this value until GetData() is called.
When you are finished with the member, you must call its ReleaseData member function to free up any storage used for it.
You can access query the values of certain response headers by using the GetResponseHeader member function.
For example, you can display the last-modified modified of all documents that have one:
const ARC_Member* member; while ((member = all_iter->Next()) != 0) { off_t real_size; // we do not care about the data itself, but need to call this // to extract the HTTP response headers. (void) member->GetData(real_size); // see if the header is there, and if it is, dump the info std::string last_mod = member->GetResponseHeader("Last-Modified"); if (!last_mod.empty()) { std::cout << last_mod << "\t" << member->URL() << std::endl; } member->ReleaseData(); }
using namespace libarc; int main(int argc, char* argv[]) { ARC_File* arc_file = 0; try { arc_file = ARC_File::Create(argv[1]); // get an iterator for all HTML archive members Member_Iterator* iter = arc_file->GetMemberIterator("text/html"); // iterate over them and only return information on non-error // pages that have a Last-Modified header const ARC_Member* m; while ((m = iter->Next()) != 0) { off_t length; (void) m->GetData(length); if (m->HTTPStatus() == 200) { std::string lms = m->GetResponseHeader("Last-Modified"); if (!lms.empty()) { std::cout << lms << "\t" << m->URL() << std::endl; } } m->ReleaseData(); } iter->Destroy(); arc_file->Destroy(); } catch (Unsupported_ARC_Format_Exception& e) { std::cerr << "Unknown ARC file format, or ARC file is corrupt." << std::endl; return 1; } catch (std::exception& e) { std::cerr << "Exception: " << e.what() << std::endl; return 1; } return 0; }
The for_each function takes a Member_Iterator and a function or callable object and applies it to every member accessible by the iterator. The function should take a single const ARC_Member*
argument, which it calls for each member:
static void x_dump_last_modified(const ARC_Member* m) { off_t real_size; (void) m->GetData(real_size); std::string last_mod = m->GetResponseHeader("Last-Modified"); if (!last_mod.empty()) { std::cout << last_mod << "\t" << m->URL() << std::endl; } member->ReleaseData(); } // ... { Member_Iterator* iter = arc_file->GetMemberIterator(); for_each(iter, x_dump_last_modified); }
The _if
functions take a predicate, which takes a const ARC_Member*
and returns a bool
indicating whether the member should be processed or not.
In this example we display the URL and size of every PDF file that is larger than 100 KB:
static void x_dump_document_info(const ARC_Member* m) { std::cout << m->URL() << "\t" m->GetSize() << std::endl; } class x_CheckSizeValue { public: x_CheckSizeValue(int kb) : m_Size(kb * 1024) {} bool operator() (const ARC_Member* m) const { return m->GetSize() >= m_Size; } private: int m_Size; }; // ... { Member_Iterator* iter = arc_file->GetMemberIterator("application/pdf"); for_each_if(iter, x_CheckSizeValue(100), x_dump_document_info); }
Notice that the predicate in this example is implemented as a class, allowing it to be trivially reused for different sizes.
Generated on Tue Jun 8 21:30:14 2004 by |