Web archiving

From WhyNotWiki

Jump to: navigation, search

Error! Missing argument!


The primary form of Web archiving that I care about is on-demand archiving.

Contents

[edit]

http://en.wikipedia.org/wiki/Web_archiving

Web archiving is the process of collecting the Web or particular portions of the Web and ensuring the collection is preserved in an archive, such as an archive site, for future researchers, historians, and the public. Due to the massive size of the Web, web archivists typically employ web crawlers for automated collection. The largest web archiving organization based on a crawling approach is the Internet Archive which strives to maintain an archive of the entire Web. [...]

[edit] Collecting the Web

Web archivists generally archive all types of web content including HTML web pages, style sheets, JavaScript, images, and video. They also archive metadata about the collected resources such as access time, MIME type, and content length. This metadata is useful in establishing authenticity and provenance of the archived collection.

[edit] Methods of collection

[edit] On-demand

[...]

[edit] Database archiving

Database archiving refers to methods for archiving the underlying content of database-driven websites. It typically requires the extraction of the database content into a standard schema, often using XML. Once stored in that standard format, the archived content of multiple databases can then be made available using a single access system. This approach is exemplified by the DeepArc and Xinq tools developed by the Bibliothèque nationale de France and the National Library of Australia respectively. DeepArc enables the structure of a relational database to be mapped to an XML schema, and the content exported into an XML document. Xinq then allows that content to be delivered online. Although the original layout and behavior of the website cannot be preserved exactly, Xinq does allow the basic querying and retrieval functionality to be replicated.

[edit] References

Brown, A. (2006). Archiving Websites: a practical guide for information management professionals. Facet Publishing. ISBN 1-85604-553-6. 

Day, M. (2003). "Preserving the Fabric of Our Lives: A Survey of Web Preservation Initiatives". Research and Advanced Technology for Digital Libraries: Proceedings of the 7th European Conference (ECDL): 461-472.

Eysenbach, G. and Trudel, M. (2005). "Going, going, still there: using the WebCite service to permanently archive cited web pages". Journal of Medical Internet Research: Volume 7: Issue 5.

Fitch, Kent (2003). "Web site archiving - an approach to recording every materially different response produced by a website". Ausweb 03.

Lyman, P. (2002). "Archiving the World Wide Web". Building a National Strategy for Preservation: Issues in Digital Media Archiving.

Masanès, J. (ed.) (2006). Web Archiving. Springer-Verlag. ISBN 3-540-23338-5. 

[edit] Existing solutions

http://en.wikipedia.org/wiki/Web_archiving

  • WebCite, a service specifically for scholarly authors, journal editors and publishers to permanently archive and retrieve cited Internet references (Eysenbach and Trudel, 2005).
  • Archive-It, a subscription service, allows institutions to build, manage and search their own web archive
  • hanzo:web is a personal web archiving service created by Hanzo Archives that can archive a single web resource, a cluster of web resources, or an entire website, as a one-off collection, scheduled/repeated collection, an RSS/Atom feed collection or collect on-demand via Hanzo's open API.
  • Spurl.net is a free on-line bookmarking service and search engine that allows users to save important web resources.

I didn't realize that this problem had already been solved for the most part until I stumbled upon http://en.wikipedia.org/wiki/WebCite on 2007-02-04 21:19.

I wish I'd known about it sooner. But I guess it hasn't been popular/commonly used until around 2005, so I haven't been missing out for too long.

Now that I know about it, I probably won't bother writing my own solution.

[edit] Copyright issues

http://en.wikipedia.org/wiki/Web_archiving

Not only must web archivists deal with the technical challenges of web archiving, they must also contend with intellectual property laws. Peter Lyman (2002) states that "although the Web is popularly regarded as a public domain resource, it is copyrighted; thus, archivists have no legal right to copy the Web." Some web archives that are made publicly accessible like WebCite's or the Internet Archive’s allow content owners to hide or remove archived content that they do not want the public to have access to. Other web archives are only accessible from certain locations or have regulated usage. WebCite also cites on its FAQ a recent lawsuit against the caching mechanism, which Google won.

http://www.webcitation.org/faq / http://en.wikipedia.org/wiki/Wikipedia_talk:Citing_sources#WebCite

Caching and archiving webpages is widely done (e.g. by Google, Internet Archive etc.), and is not considered a copyright infringement, as long as the copyright owner has the ability to remove the archived material and to opt out. WebCite® honors robot exclusion standards, as well as no-cache and no-archive tags. Please contact us if you are the copyright owner of an archived webpage which you want to have removed. A U.S. court has recently (Jan 19th, 2006) ruled that caching does not constitute a copyright violation, because of fair use and an implied license (Field vs Google, US District Court, District of Nevada, CV-S-04-0413-RCJ-LRL, see also news article on Government Technology).

Field vs Google (http://www.webcitation.org/query?id=1138565968295754). Retrieved on 2007-02-04 22:03.

Based upon the papers submitted by the parties and the arguments of counsel, the Court finds that Google is entitled to judgment as a matter of law based on the undisputed facts. For the reasons set forth below, the Court will grant Google’s motion for summary judgment: (1) that it has not directly infringed the copyrighted works at issue; (2) that Google held an implied license to reproduce and distribute copies of the copyrighted works at issue; (3) that Field is estopped from asserting a copyright infringement claim against Google with respect to the works at issue in this action; and (4) that Google’s use of the works is a fair use under 17 U.S.C. § 107. The Court will further grant a partial summary judgment that Field’s claim for damages is precluded by operation of the “system cache” safe harbor of Section 512(b) of the Digital Millennium Copyright Act (“DMCA”). Finally, the Court will deny Field’s cross-motion for summary judgment seeking a finding of infringement and seeking to dismiss the Google defenses set forth above.

[edit] Will WebCite be around for awhile?

http://en.wikipedia.org/wiki/Wikipedia_talk:Citing_sources#WebCite

Regarding the second point, concerns on WebCite being discontinued, WebCite is used and supported by over 200 academic journals, as well as permanent preservation partners, such as libraries - whose primary mission really is archiving and preserving material. U of T library, which backs this project, will certainly be around for the next hundreds of years. The academic journals who are members of the consortium are using WebCite to cite URLs, and they have a vested interest in keeping this service alive. Together, they act a guarantors and custodians for the service. Yes, you can wait 50 years to see if WebCite is still around, but then it is too late to preserve cited material. Besides, no harm is done in caching cited URLs prospectively beginning right now. If the format www.webcitation.org/?url=URL&date=DATE is used to link to snapshots (rather than the alternate short format www.webcitation.org/ID using the snapshot ID) then it can always be reverted to the original link should WebCite cease to exist. Thirdly, regarding the last criticism that authors sometimes want to link to the live version, it is up to the citing author to decide whether he wants to use WebCite or link to the "live" version instead. I would say in the majority of cases the author is more interested in having a stable link/snapshot. A future version of WebCite will be able to compare the cached version and the live version, displaying the changes, similar to Wiki's history feature.

[edit] Writing my own solution

For more information about my proposed solution to this problem: Cached resources database

Closely related to: Link rot

Personal tools