NHacker Next
  • new
  • past
  • show
  • ask
  • show
  • jobs
  • submit
Internet Archive Infrastructure (archive.org)
tkgally 1136 days ago [-]
This is a recent video presentation by Jonah Edwards, who runs the Core Infrastructure Team at the Internet Archive. He explains the IA’s server, storage, and networking infrastructure, and then takes questions from other people at the Archive.

I found it all interesting. But the main takeaway for me is his response to Brewster Kahle’s question, beginning at 13:06, about why the IA does everything in-house rather than having its storage and processing hosted by, for example, AWS. His answer: lower cost, greater control, and greater confidence that their users are not being tracked.

7800 1135 days ago [-]
I have ADD and typically eschew watching video if I can get the same, quality content faster in text.

I loved this and watched it to the end.

To any that feel the need to hide or disparage this because it seems to promote doing things on your own vs in the cloud, this isn’t some tech ops Total Money Makeover where you read a book and you’re suddenly in some sort of anti-credit cult. This is hard shit, and it’s the basics of the hard shit that I grew up with as being the only shit.

Yes, you can serve your own data. No one should fault you for doing that if you want. It takes the humble intelligence of the core team and everyone at IA to pull that off at this scale. If you don’t want to do the hard things, you could use the cloud. There are financial reasons also for one or the other, just as there are reasons people live with their family, rent, lease, and buy homes and office space- an imperfect analogy of course.

I hope that some of the others that could go on to work at the big guys or have been working there and want a challenge consider applying to IA when there’s an opening. They’ve done an incredible job, and I look forward to the cool things they accomplish in the future.

jonah-archive 1135 days ago [-]
Thank you for this! It was definitely geared towards an internal audience but it makes me very happy to know that it was enjoyed and appreciated more broadly.

I am going to get a transcript done and up soon as well -- I just gave the talk on Friday so haven't had time to do so yet.

ignoramous 1135 days ago [-]
Speaking of Infrastructure, it is amazing that the initial set of the Apache BigData projects started at Internet Archive [0] whilst Alexa Internet, a startup Brewster Kahle sold to Amazon in 1999, formed the basis of Alexa Web Information Service, one of the first ''AWS'' products [1] which is still up? https://aws.amazon.com/awis/

[0] http://radar.oreilly.com/2015/04/coming-full-circle-with-big...

[1] https://web.stanford.edu/class/ee204/Publications/Amazon-EE3...

benaadams 1136 days ago [-]
I assume the 200PB of storage and 60Gbps egress bandwidth 24/365 they do would be _extremely_ pricy on AWS...
walrus01 1136 days ago [-]
On the scale of big hosting operations, 60Gbps outbound is not that much. If you're buying full tables IP transit from major carriers at IX points, I've seen 10GbE for $700-900/mo, and 100GbE circuits for under $7k/month. Of course you wouldn't want to have just one transit, but I'm fairly sure if somebody said to me 'here's $20,000 a month to buy transit', on the west coast, it's within the realm of possible.

Ideally of course they should be able to meet a fairly wide number of downstream eyeball ISPs at the major IX points in the bay area and offload a lot of traffic with settlement-free peering.

60Gbps outbound from AWS, Azure or GCP would be astronomically incredibly expensive.

tutfbhuf 1135 days ago [-]
It seems that companies can be too big for the cloud and too small for the cloud (don't need k8s). I wonder what's the sweet spot.
867-5309 1135 days ago [-]
horizontal and vertical scaling are the latest push, but diagonal would be the sweet spot
jonah-archive 1135 days ago [-]
Exactly this. There are some logistical complexities (e.g. some of our bandwidth is funded by the E-Rate Universal Service Program for libraries which runs on a July-June fiscal year and so rapid upgrades on that front aren't possible), but by and large egress bandwidth isn't our primary challenge. Intersite links as I noted in the video are the current big one, and that can and does involve occasional time-consuming construction -- but honestly over the past year, a combination of total blowout of my usual capacity planning (including equipment budgets) plus the logistical complexities of lockdown have resulted in slowness to upgrade as fast as we'd like to.
rsync 1135 days ago [-]
Two words: hurricane electric.
dweekly 1135 days ago [-]
God bless HE Fremont. They are the unsung story of the Internet backbone. If one were to make a list of companies that at some point had a major fraction of their hosted physical infrastructure at HE I suspect it would make people's jaws drop.

It's been a huge blessing for a whole generation of startups to have a radically well-connected space that just about anyone can drop in equipment at multi-gig unmetered (albeit admittedly extremely constrained on power and cooling). It is honestly a part of what has made Silicon Valley great. Well, that and being able to cobble together a few replacement servers from Fry's components (RIP!) or schlep out to some ex-CTO's Sunnyvale garage in the middle of the night to offload some lightly used VA Linux 1U's...

Even today, you can get 15A and a 42U cabinet to call your own with unmetered gigabit for $400/mo - and probably less if you ask nicely.

owenmarshall 1135 days ago [-]
IME with cloud in the small and in the large: network prices are artificially high on the cloud providers and are very easy to get discounted if you are a big spender.
Daho0n 1135 days ago [-]
So the small pay for the large?
jeromegv 1135 days ago [-]
Yes, that's typically how it works for any industry. You get discount on bulk orders.
xhrpost 1135 days ago [-]
It seems they are regularly maxing out their network infrastructure. If it's so cheap, how come they don't just buy more? Is it the cost of the actual hardware? (I know they recently upgraded)
brewsterkahle1 1135 days ago [-]
We are upgrading again-- the pandemic has made us kind-of popular.

Because of budget we tend to use things up 100% before putting more in.

mike_d 1135 days ago [-]
They are maxing out the fiber links between their own datacenters, which is in the process of being addressed. If the bits can't get from the datacenter full of hard drives to the datacenter that connects to the internet, not much point in buying additional transit capacity.
charcircuit 1136 days ago [-]
I have no clue how people can afford what AWS charges for bandwidth. I did the math once for migrating a project to AWS and the bandwidth alone costed 10x my entire current infrastructure for that project, which is something I run for free.
polote 1135 days ago [-]
Because people have nothing to compare their AWS cost to. They don't know how much it would cost them to host their service outside of AWS.

And it is not only a cost comparison. You need different kind of people to manage in-house vs cloud, not that you need less or more, just different skills

blacksmith_tb 1135 days ago [-]
Absolutely - AWS is this generation's IBM, no one was ever fired for buying services from them, so to speak.
johannes1234321 1135 days ago [-]
Not only that: It is really hard to predict AWS cost. So many variables go in. And starting with a small side project in AWS is easy, and then each additional step is a small step ...
dilyevsky 1135 days ago [-]
For lots of orgs I’ve seen it creeps up slowly until you’re paying 10-50x of full transit without any peers but by that point you’re too locked in to do anything
remus 1135 days ago [-]
Depends on whether bandwidth is important to what you're doing. In many applications it isn't so even at the inflated prices charged by AWS et al don't really matter in the context of other expenses.
Sebb767 1135 days ago [-]
To be fair here, when you're pouring that much money into AWS you probably have a better contract and can negotiate the price down quite a bit. Additionally, you could use CloudFront to further reduce your bandwidth costs.

That's not to say that it wouldn't be incredibly expensive, but probably far less than what you see on the pricing page.

tutfbhuf 1135 days ago [-]
A hybrid solution might be possible. Put your core infra on AWS, but get a very cheap CDN (or custom solution) in front of it to handle the 60Gbps so that only a small fraction will hit your AWS infra. Do the same for storage, e.g. build your own ceph cluster on bare-metal instead of Amazon S3.
Ansil849 1135 days ago [-]
> and greater confidence that their users are not being tracked.

And that right there is why I continue to donate to IA. I am sick and tired of services offloading my data to destructive companies like Amazon.

samstave 1135 days ago [-]
Tracking should be STRICTLY illegal and ONLY acceptable with a verifiable OPT-IN and a transparency on to WHOM the data was sent/read-by/received. With STILL an option to selectively opt-out.
Shared404 1135 days ago [-]
I agree with the sentiment, but there needs to be some degree of leeway.

For example are server logs considered tracking? It seems unreasonable to require that logs not be kept.

Edit: On further thought, I don't even know that tracking should be banned. Instead, I would argue that advertising in the way enabled by tracking should be banned. That way the incentive is removed with less bureaucracy.

samstave 1135 days ago [-]
bureaucracy IS the incentive.

We need to kill bureaucratic interests in EVERY THING.

Politicians should be conscripted servants with no method for empowering or financing themselves.

They should run on policy ALONE.

Shared404 1135 days ago [-]
> bureaucracy IS the incentive

I don't agree there. There is no reason that someone has building bureaucracy as they're goal. The goal is to get something out of it, which is accomplished via the bureaucracy.

> We need to kill bureaucratic interests in EVERY THING.

While I don't necessarily disagree, this is orthogonal to the original issue.

> Politicians should be conscripted servants with no method for empowering or financing themselves.

I don't think conscripted means what you think it means...

Also, that would be impossible to implement. Either you don't try to cut off every path, in which case you have the option of limiting bureaucracy, or you try to cut off every path, increasing bureaucracy.

> They should run on policy ALONE.

I agree. Also orthogonal to the topic at hand. Also, please let me know if you find a way of implementing this without requiring bureaucracy as a critical component.

GuB-42 1134 days ago [-]
> Tracking should be STRICTLY illegal

And yet, the opposite is happening. Government agencies are happy to tap into user data. And sometimes it is mandatory to keep data depending on what you are doing and your country legislation.

AFAIK the US is among the countries where you have the least such requirements, but there are still some sectors where logging is mandatory, like financial services. Many other first world countries (never mind dictatorships) require ISPs to keep data for a year or more.

janjim 1135 days ago [-]
Curious about the cost. Does that already include manpower and various acquisition cost of constructing their internal network (hardware, fiber link between site)?

I guess the biggest downside is the speed of scaling that they can do. As it is limited by how fast they can purchase and install new storage device. But with the use case of Internet Archive, that shouldn't matter much.

Nextgrid 1135 days ago [-]
I've seen this argument a lot but I'm not sure how well it holds. The price for performance ratio on cloud providers is so poor that you can overprovision in advance (to mitigate the extra delay involved in adding extra hardware) and still come out ahead.

Also, bare-metal doesn't necessarily mean owning the hardware. You can rent it too. There are providers that provide bare-metal in one-click and sometimes available within minutes.

Sebb767 1135 days ago [-]
> The price for performance ratio on cloud providers is so poor that you can overprovision in advance (to mitigate the extra delay involved in adding extra hardware) and still come out ahead.

It really depends on what scale you're talking about. When you're a startup and suddenly land on the front page of HN, you might need 100x or 1000x your current capacity - in which case AWS will be useful to no end.

If, on the other hand, you're an established name with quite a bit of traffic already and the maximum uptick you will reasonably experience is 2x-3x, the argument holds far less water.

ddorian43 1135 days ago [-]
Feel free to point to these cases where people scaled 1000x when they hit the HN frontpage. Especially their database.
ddorian43 1135 days ago [-]
He said his storage pricing is 2-5x cheaper than google archive line. That is $1.2/3x= $0.4/TB/month. Compare that against $20/month S3. He has 50x less cost. He can afford to overprovision.
bnewbold 1135 days ago [-]
If you are interested in cost modeling for long-term digital preservation, check out this blog series: https://blog.dshr.org/2019/02/economic-models-of-long-term-s...
nodesocket 1135 days ago [-]
What happens when there is a major earthquake in the bay area though? Not being geographically dispersed seems like a large risk.
nikisweeting 1134 days ago [-]
They have backups in Canada and Egypt.
nikisweeting 1134 days ago [-]
partial* backups^
101008 1135 days ago [-]
When I was a kid, my dreams was to work at Google, Microsoft, Apple or any of these big companies. Now that I am reaching 30 (and becoming very nostlagic of the old web), I think the company that would make me most happy of login every morning to get some work done would be Internet Archive.
JKCalhoun 1135 days ago [-]
I love archive.org.

Hate their "front end". One of the most disorganized user-facing sites.

I would love to see an effort to address that.

Would it be possible for archive.org to offer an API to allow other sites to present archive.org using their own front-end? We could see lots of specialty sites that focus on the user experience for some slice of archive.org.

bnewbold 1135 days ago [-]
There are APIs and it would be great if more people and organizations built on top of them, and specifically build content or collection-specific interfaces.

Here is the entry point for API documentation: https://archive.org/services/docs/api/

Hot linking, CORS, and other things to support third-party integration are usually supported, though there are a lot of special cases for security or to prevent malicious use. If you run in to technical problems we are usually responsive to the main contact routes on the archive.org site.

The system is not designed to allow multi-party curation and editing of metadata, but there is nothing stopping folks from building third-party catalogs on top of the content stored (and served) from IA. That is sort of what openlibrary.org is for books. The same thing could be done for music, video, specific documents, etc.

Note: I work at IA but not on the APIs or archive.org collections

mycall 1135 days ago [-]
Besides using the cli, how do you upload HTML URLs to be indexed?
fireattack 1135 days ago [-]
Probably a good place to ask: is there a reason why Wayback Machine's archives often take forever to load?

Most of content on IA loads pretty fast, so WB is a notable exception.

npunt 1135 days ago [-]
I get the impression Wayback Machine data is stored in powered down drives and they only spin them up when someone accesses the data. That would explain the several second delay and it'd make sense that an archive wouldn't need 95% of its data ready to go at a moments notice since that'd be a terrible waste of power.
bnewbold 1135 days ago [-]
The disks are spinning all the time, and most disks are seeing fairly frequent reads to some content or another. A lot of content is very rarely accesses, but almost every disk has some content which gets accessed. If spinning disks had only frequently-accessed content, they would be unable to keep up with the read rate or read throughput, things balance out reasonably on average.

Wayback content is on the same disks as most other content, in the form of WARC files, with individual records fetched out of the middle of WARC files via HTTP range request.

Note: I work at IA but am not on core infrastructure team

npunt 1135 days ago [-]
Interesting thanks for the insights! Then would the few second delay be more a matter of time it takes to decompress the contents, or that files are stored on disks which are being accessed a lot, or something else? Always been curious about it.
vermilingua 1135 days ago [-]
And disk life.
philjohn 1135 days ago [-]
perversely, wouldn't spinning drives up and down impact drive life?
chemicalnovae 1135 days ago [-]
Probably yes, but only if it was happening a lot; I don’t know what the cross-over point would be though that you’re better off just keeping the drives spinning...
kilroy123 1135 days ago [-]
This is my big gripe as well, it's so painfully slow.

Still, I strongly support the work they do and think it's very important work. I also think they do a good job for their size and resources.

beckman466 1135 days ago [-]
I'm surprised to read this and OP's comment, for me it's always quite fast. Are there specific websites your request - are they media heavy?
coldpie 1135 days ago [-]
I picked a random article from my browser's history, linked below. Just loading the snapshot year pages took about 10 seconds, then the snapshot hover took another 10 seconds. Finally, fully loading the snapshot page took about 50 seconds. So that's roughly 90 seconds to go from inserting a URL into the search bar and actually having the rendered page. Not unacceptable, but certainly slow by modern standards.

http://web.archive.org/web/20120801000000*/http://blogs.msdn...

carapace 1135 days ago [-]
90s is better than ∞ (infinity) which is what it would be if the Archive didn't exist, eh?
coldpie 1134 days ago [-]
The original comment was "the Archive is painfully slow", the reply to that was "for me it's always quite fast", so I gave it some actual hard data explaining that 90s is in fact quite slow and maybe even painfully so in 2021. I have absolutely no idea how you got from that discussion to a hypothetical where the Archive doesn't exist. Obviously 90s is better than it not existing. I even said in my comment that it was not unacceptable.

Why did you even make this comment?

carapace 1134 days ago [-]
> Why did you even make this comment?

I find it crass to gripe about Archive being slow. It's still much better than nothing. If folks want it to be faster than can donate rather than gripe, eh?

coldpie 1133 days ago [-]
I wasn't griping, I was responding to someone saying they had no speed issues. I donate $5/mo and have done for years, thanks for asking.
carapace 1132 days ago [-]
Not you, kilroy123.

Good for you. Seriously (no joke, no sarcasm.) Good for you, that's awesome. (And yeah, that was the obvious question: "Have you tried giving them money?" but I wasn't feeling quite that salty.)

Well met. Sorry for being a grouch.

beckman466 1134 days ago [-]
Completely agree. It’s not like there’s alternatives to Archive.org available.
shaunparker 1135 days ago [-]
They want you to experience what it was like to browse wayback in the early 90s. Sorry, I couldn’t resist the joke :)
bnewbold 1135 days ago [-]
Performance is fun!

One aspect is that our data centers are in California, with no CDN. If you are on the other side of the world, you will have higher round-trip latency on every request, for all services.

Another is layers of caching. Popular or recently requested Wayback content is more likely to be in either an explicit cache (eg, redis), or implicitly in kernel page caches across all layers of the request.

Every wayback replay request hits several layers of index indexes (sorted by domain, path, and timestamp), which are huge and thus actually served from spinning disk over HTTP (!). This includes a timeline summary for the primary document, to display the banner. Then the actual raw records are fetched from another spinning disk over HTTP. This may result in one or more layers of internal redirect (additional fetches) if there was a "revisit" (identical HTTP body content, same URL, different timestamp). Then finally the record is re-written for replay (for HTML, CSS, Javascript, etc, unless the raw record was requested). Some pages will have many sub-resources, so this process is repeated many times, but that is the same as page load and you can see which resources are slow or not.

As mentioned in the video, depending on where we are in the network hardware upgrade lifecycle, sometimes outbound bandwidth is tight also, which slows down transfer.

And of course most of these services operate without a ton of overhead, so if there is a spike in traffic everything will slow down a bit. There is a lot of multi-tenancy-like situations also, so if there is a very popular zip file or Flash game getting served from the same storage disk as the WARC file holding a wayback resource, the replay for that specific resource will be slow due to disk I/O contention.

If you are curious about why a specific HTML wayback replay was slow, you can look in the source code of the re-written document and see some timing numbers.

Several organizations run large web archives that operate similarly to web.archive.org, and have described cost/benefit trade offs for different components. Eg, National Library of Australia has an alternative CDX index called OutbackCDX, which uses RocksDB on SSDs. I believe other folks store WARC files in S3 or S3-like object storage systems. The Wayback Machine is somewhat unique in the amount of (read) traffic it gets, the heterogeneity of archived content (from several crawlers, in older ARC as well as WARC), volume of live crawling ("save paper now" results show up pretty fast in the main site, which is black magic), running on "boring" general purpose hardware, and deep integration with our general purpose storage cluster.

Note: I work at IA but not on the Wayback system

ignoramous 1135 days ago [-]
Wayback Machine started life at Amazon's Search project. May be it never recovered from that :) https://archive.is/2B2ts
Bestia0728 1132 days ago [-]
why
jmiskovic 1135 days ago [-]
I see incredible value in IA collection of books, videos and software. OTOH I'm puzzled by lack of organization.

Take for example this newer document: https://archive.org/details/manualzilla-id-5695071 The document has horrible name and useless tags, and the content seems to be only section 2 of some SW manual. How would I ever hope to find it if I needed that exact document?

Obviously such huge archive cannot be categorized and annotated by small team, so it would make sense to crowdsource the labeling process. Yet, as registered user I can only flag the item, or write a review. Why doesn't IA let users label content and build their own curated collections of items?

JKCalhoun 1135 days ago [-]
Commented above before I saw yours. I agree, and wonder further if archive.org could play host instead to any number of spinoff sites that try to better organize/present the data (or a subset of the data) on archive.org.
ant6n 1135 days ago [-]
It would be nice if there was a good search engine for the wayback engine. Browsing in the past is pretty cumbersome right now, you need to know the URLs or at least websites that used to have the information you want.
niea_11 1135 days ago [-]
To be fair the linked document was uploaded just today to a collection that seems be considered a "waystation" collection so probably the document will be moved later to a permanent collection.

And I think they have bots that process the uploaded documents to do OCR and create previews.

jmiskovic 1135 days ago [-]
Even for quite mature documents, allowing users to curate them would benefit everyone.

This mature archived item https://archive.org/details/whattodrawhowtod00lutz/ is well described, but it is missing tags and it's not part of any relevant collection that would help us discover other similar books.

niea_11 1135 days ago [-]
I agree that getting more people involved (by crowdsourcing or other ways) will be beneficial to the project. But it'll depend on how the effort will be organized.

My experience with the website is that the original uploader can edit some part of the "metadata" (not all) of the uploaded document like title, description, topics/tags ..., but they can't move it to a different collection (initially the document is uploaded the "community" collection).

If they want to put it in a different collection they have to contact archive.org's staff.

Sometimes the staff notice the new files and move them to the correct collection or even create a new one for them (the latter case happened to me).

caslon 1136 days ago [-]
The Internet Archive is incredibly commendable! It's impressive that such a small organization can do so much. I wish they would fix the Wayback Machine being broken in Firefox, though.
jolmg 1136 days ago [-]
> Wayback Machine being broken in Firefox

Are you getting "Fail with status: 498 No Reason Phrase"? You might have your Referer header disabled.

If that's the case, you can fix it by going to about:config and setting network.http.sendRefererHeader to 2 (or pressing the reset button to the right).

caslon 1136 days ago [-]
I am! Thank you! Why doesn't it work without REFERER headers? I haven't changed any settings from the Firefox default, and allowing it breaks links to jwz's site.
jolmg 1136 days ago [-]
> Why doesn't it work without REFERER headers?

Don't know. I just spotted that it's the only meaningful difference I had in my traffic between visiting with Firefox and with Chromium. IIRC, it's the fetching of the capture timestamps that fails when the Referer header is missing.

> I haven't changed any settings from the Firefox default

You might have forgotten like I did once.

> and allowing it breaks links to jwz's site.

Do you mean this one?

https://www.jwz.org/blog/

Seems to be working fine for me with the Referer header enabled. Maybe it was a temporary glitch?

jraph 1136 days ago [-]
It won't if you click a jwz link from Hacker News. It shows you a testicle in an egg cup. This is on purpose.
ExtraE 1135 days ago [-]
How does this work? I was under the impression that when I clicked a link in a webpage it just opened that link as if I'd typed it into my URL bar. How does jwz know were I'm clicking from?
jolmg 1135 days ago [-]
https://en.wikipedia.org/wiki/HTTP_referer

> The HTTP referer (a misspelling of referrer) is an optional HTTP header field that identifies the address of the webpage (i.e., the URI or IRI), which is linked to the resource being requested. By checking the referrer, the new webpage can see where the request originated.

https://tools.ietf.org/html/rfc7231#page-45

> The Referer header field allows servers to generate back-links to other resources for simple analytics, logging, optimized caching, etc. It also allows obsolete or mistyped links to be found for maintenance. Some servers use the Referer header field as a means of denying links from other sites (so-called "deep linking") or restricting cross-site request forgery (CSRF), but not all requests contain it.

ExtraE 1134 days ago [-]
Thanks!
jolmg 1136 days ago [-]
Can confirm. :(
1136 days ago [-]
1135 days ago [-]
jasoncartwright 1135 days ago [-]
I didn't hear it mentioned, but is there a backup outside the Bay Area?
tkgally 1135 days ago [-]
According to an IA blog post in 2016 [1], they had partial backups in Egypt and the Netherlands then and were planning to establish a partial mirror in Canada as well. Perhaps someone at the IA can tell us what the current status of those mirrors is.

Brewster Kahle does mention in the video (around 26:50) that one reason they use paired storage is so that the two disks in a pair can be in different countries. He then goes on to praise “Linux and the wonder of open source”; he recently blogged about that, too [2].

[1] http://blog.archive.org/2016/12/03/faqs-about-the-internet-a...

[2] http://blog.archive.org/2021/02/04/thank-you-ubuntu-and-linu...

ahrs 1135 days ago [-]
If anyone from the IA reads this, are there any plans for IPv6 support? Archiving IPv4 for future generations is an important goal so it's perhaps fitting that in 2021 the IA is still running a historical Internet Protocol but it would nice to have IPv6 support for those running IPv6-only networks.
chimbosonic 1135 days ago [-]
Does the IA have Data sites that are not in SF? When he shows the map of sites they all seem very close to each other and a natural disaster could wipe out alot of the archive.
ajdude 1135 days ago [-]
They are mostly all in California, though not entirely in SF. They send some of their data to other parts of the world too, but I'm concerned that the don't have the redundancy needed. There was a project in attempt to back up the internet archive, but it became unmaintained in 2019 and only about 200tb were actually being backed up http://iabak.archiveteam.org/
vermilingua 1135 days ago [-]
Might be worth adding [video] to the title.
lprd 1135 days ago [-]
This is awesome! I love seeing companies run their own infrastructure. I wonder if they are using ZFS or just traditional RAID?
88 1135 days ago [-]
This is mentioned in the video, but they don’t use any form of RAID, just paired/mirrored drives in different physical locations.

This is preferred for its simplicity and performance, and if I were in their position I would do the same thing.

jonah-archive 1135 days ago [-]
Exactly correct. I suspect that in the not-too-distant future we will need to move to multi-disk filesystem-level clustering on the storage nodes for some of the reasons laid out in the talk but it's not at all unlikely that we retain the "disconnected mirror" abstraction for redundancy.

Additionally, as I alluded to in the video, we regularly end up going down into the details and when this happens being able to closely examine and understand a disk's contents as written directly at the LBA using fibmaps and similar tooling is invaluable. I have personally been involved in the discovery of multiple possible-data-loss hard drive firmware bugs, and our catalog system is very paranoid about integrity checking. Modern hard drives are computers unto themselves and layering additional complexity on top of that (particularly complexity which might obscure or silently correct errors in the underlying datastream) is something I approach very carefully.

lprd 1134 days ago [-]
Hey Jonah,

Perhaps you already covered this in the video and I missed it, but I was wondering how the team went about hardware upgrades/disposal? I maintain a just few servers in my own homelab so upgrading is pretty trivial for me, but I can't imagine what its like on that scale. Also, which hypervisor are you using to manage all of those VMs?

Thanks again for the talk, it was very insightful and fun to watch!

jonah-archive 1134 days ago [-]
Thank you, so glad you enjoyed it!

We try to keep a regular upgrade cycle (to hold to our tight budget), typically with a tick-tock of adding new hardware (expanding within our existing footprint) and cycling out old hardware and drives as they reach the far end of the works/doesn't work spectrum. We have a local partner who takes care of some disposal for us, but we also have no shortage of physical storage space, so we will also accumulate (sometime intentionally -- nearly our entire "red box" deployment -- see "previous version": https://archive.org/web/petabox.php -- is packed into a shipping container. We don't like to throw things away!).

For a hypervisor, we use Ganeti (running over KVM). Because our fleet is so heterogenous we need to be able to control a lot of VM parameters in order to efficiently pack our computational resources, and Ganeti is kind of in a sweet spot for us in terms of providing a lot more tooling than a bunch of virsh scripts, and being much smaller than systems like OpenStack geared towards large, homogenous deployments).

silicon2401 1135 days ago [-]
what's the difference between RAID and paired/mirrored drives? is the latter better for NAS?
88 1135 days ago [-]
RAID is an abstraction layer on top of the physical disks.

RAID 1 (mirrored drives) is similar to what the IA have described, but it sounds like they are creating their mirrors using simple file system commands (or tools like rsync) rather than introducing the complexity and overhead of hardware/software RAID.

Other forms of RAID (e.g. RAID 5/6) where data is spread across an array of drives with parity, would provide the IA with additional redundancy but at the expense of significantly increased cost and complexity.

jgowdy 1134 days ago [-]
Honestly, I've lost a bit of confidence in the Internet Archive project's judgement and long term stability, given the risk they took in distributing books during Covid-19. I feel that they risked the entire project with massive (almost certainly fatal) copyright violation fines in order to distribute extra copies of books. That lawsuit is still pending and we don't know what the outcome of that lawsuit will be.

If they're willing to risk everything they've accomplished to date in order to issue a few extra copies of books, I don't see them surviving long term, nor do I feel comfortable donating to the project.

https://www.npr.org/2020/06/03/868861704/publishers-sue-inte...

bestboy 1135 days ago [-]
If I understood it correctly, then they are using simple physical disk mirrors for redundancy. To me that seems like a huge waste of disk space. Parity based redundancy schemes like RAID-Z3 are way more space efficient. I do understand that parity based schemes need more time to heal/rebuild on drive replacements, but that does not seem to outweigh the huge amount of wasted disk space IMHO.
88 1135 days ago [-]
The paired disks are in a different physical location so they also provide a degree of geographic redundancy.

Short of splitting a RAID array across two physical locations (a terrible idea), your proposal would require them to run mirrored RAID arrays in both locations.

This would give them greater redundancy, but would be a less efficient use of raw disk space than their current solution. It would also be more complex and difficult to maintain, and have performance impacts.

nwmcsween 1135 days ago [-]
There are tables for bandwidth and iops vs differing raid levels, mirroring is generally the best.
bestboy 1135 days ago [-]
True, but I do not consider nearline archiving to be a workload that requires lots of IOPS.
notacoward 1135 days ago [-]
Besides the cross-DC issue others have mentioned, erasure coding everything can also exacerbate CPU or memory bottlenecks. Not sure if this is an issue for IA, but on my last project data would be initially replicated and then transparently converted to erasure codes after some time. I believe that some other exabyte-scale storage systems work similarly.
1135 days ago [-]
bsmith0 1135 days ago [-]
It wasn't clear to me, but I might have just missed it. Is there a backup strategy beyond just duplicate data?

Have they lost data during rebuild?I know he briefly talked about that risk with different HD sizes.

1135 days ago [-]
bobnarizes 1135 days ago [-]
I was wondering if they use some Machine Learning/Artificial Intelligence to prevent hard drives for failing, or move the data more efficiently around,...
acidburnNSA 1136 days ago [-]
I've been looking into Ceph a lot recently and was just wondering if they used it. Apparently not. Perhaps too abstract given their value of simplicity.
ddorian43 1136 days ago [-]
Check this experiment they did https://github.com/internetarchive/sandcrawler/blob/master/p... . This was just a part of the infrastructure.

Point is ceph & friends have a lot of overhead. Example in ceph: by default, a file in S3 layer is split into 4MB chunks, and each of those chunks is replicated or erasure-coded. Using the same erasure coding as wasabi,b2-cloud, which is 16+4=20 (or 17+3=20), each of those 4MB chunks is split into 20 shards of ~200KB each. Each of those shards ends up having ~512B to 4KB of metadata.

So from 10KB to 80KB of metadata for single 4MB chunk.

antongribok 1135 days ago [-]
You will always have overhead.

In the video they mention storing everything in regular files on the filesystem. A regular filesystem would have inode overhead as well. XFS by default has 512 byte inodes (it can be more if you format it with bigger inodes, like you would for Ceph's Filestore backend).

For a lot of workloads Ceph's default erasure coding scheme (and Bluestore) would still be a lot more efficient than mirroring a file on top of a regular filesystem.

ddorian43 1134 days ago [-]
> For a lot of workloads Ceph's default erasure coding scheme (and Bluestore) would still be a lot more efficient than mirroring a file on top of a regular filesystem.

Yes that's correct, it's why Bluestore was created in the first place.

ddorian43 1136 days ago [-]
The same 4MB chunks in seaweedfs, would be ~40B * 20 = ~800Bytes.

Note: seaweedfs doesn't actually support 16+4, it's set in 10+4 in source-code. But the architecture makes the low overhead possible.

known 1135 days ago [-]
archive.today is hosted on a single server :)
sidpatil 1135 days ago [-]
But it only replicates the functionality of the Wayback Machine, which is just a part of the Internet Archive's offerings.
iDATATi 1135 days ago [-]
I'm looking for some badass engineers to join our team to create something that is going to take over the advertising model. YES, its true and it has tobe done....

Want to be part of changing the world?

coopreme 1135 days ago [-]
No thank you.
Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact
Rendered at 19:40:45 GMT+0000 (Coordinated Universal Time) with Vercel.