Update: Downloading all archive.org metadata

BermudaHighball@lemmy.dbzer0.com · edit-2 4 days ago

Yes, exactly why I wanted to start this project. It’s nice to have the Internet Archive but we cannot trust that content won’t be taken down eventually. Even just storage costs might become an issue in the future for data that gets maybe 30 total views over many years. But it is nice to hear some of the data you were looking at is coming back.

Long term, it would be nice for a community of users to create a decentralized index of Internet Archive metadata so it cannot get taken down and has the torrent files of the content so people can share it and participate in the seeding for the content they care about. The Internet Archive might cooperate to make it easier to do this, for example by using Bittorrent v2 which would help us detect file duplication and not have to use padding files since all files are aligned to pieces in v2.

Currently there is little incentive for people to seed the Internet Archive content but no doubt it will become more important to do that in the future.

BermudaHighball@lemmy.dbzer0.com · 4 days ago

Update: Downloading all archive.org metadata

BermudaHighball@lemmy.dbzer0.com · 12 days ago

The link to the above release post has the wrong caption for me. Its title says “Ambulance hits Oregon cyclist, rushes him to hospital, then sticks him with $1,800 bill, lawsuit says - Divisions by zero”

BermudaHighball@lemmy.dbzer0.com · 13 days ago

Yes, I think so. I’ll definitely use the example for downloading some of the files (.torrent, metadata file) once I have some items. But first I need to find all the items ever uploaded.

BermudaHighball@lemmy.dbzer0.com · 13 days ago

Thank you for the tips. I am actually interested in enumerating metadata for all the “items” as defined by the API page ever uploaded. For example, one item = one ID:

Archive.org is made up of “items”. An item is a logical “thing” that we represent on one web page on archive.org. An item can be considered as a group of files that deserve their own metadata.

You did cause me to look at the API docs again, though, and I think I found something that does enumerate all item names, and as a bonus, it will keep you updated when changes are made: https://archive.org/developers/changes.html

We’ll see how much progress I can make. It might take a while to get through all the millions of them.

BermudaHighball@lemmy.dbzer0.com · edit-2 12 days ago

Downloading all archive.org metadata

BermudaHighball@lemmy.dbzer0.com · 11 months ago

Whatever happened to DNA-based storage research?

BermudaHighball@lemmy.dbzer0.com · 1 year ago

This was something I suggested for this instance, since there is even a guide for hosting an onion service: https://lemmy.dbzer0.com/post/135234

Maybe /u/db0 will have more time after the spam settles down, but it seems he’s got a lot on his plate at the moment between being an admin and doing AI stuff.

BermudaHighball@lemmy.dbzer0.com · 1 year ago

I often look for older or niche content, and even for that I still often have plenty of takers on public trackers. That my machine is port forwarded might have something to do with it. I’d say I have a “medium” amount of disk space and only stop seeding when I delete the files, but sometimes I limit the upload rate to keep some for other activities.

BermudaHighball@lemmy.dbzer0.com · 1 year ago

Prediction: AT-style decentralized hoarding of the web

BermudaHighball@lemmy.dbzer0.com · edit-2 1 year ago

Have OSes evolved enough that encrypted DNS is available? If so, would someone with enough technical knowledge link a guide on how to set it up within a popular OS?

I imagine that even if you plug in one of the suggested DNS provider IP addresses into your network settings, the OS is still going to make plaintext requests that your ISP can snoop on unless you require it to be encrypted somehow.

BermudaHighball@lemmy.dbzer0.com · 1 year ago

Depending on the content, 10 or 20 comes quick

BermudaHighball@lemmy.dbzer0.com · 1 year ago

Note that H.264 and H.265 are the video compression standards and x264 and x265 are FOSS video encoding libraries developed by VideoLAN.

BermudaHighball@lemmy.dbzer0.com · 1 year ago

I agree, and with FOSS you have the opportunity to contribute back to the software. One time I was using commercial software and reached out to the company about how to decode a special file format for use in a script and the response was that it was “proprietary”. If it was FOSS or even if they just had given me the information, I would have contributed to growing the ecosystem.

BermudaHighball@lemmy.dbzer0.com · 1 year ago

Software could have trojans. But why not music?

BermudaHighball@lemmy.dbzer0.com · 1 year ago

It must be a bug. For me, I didn’t see the subscribe button at all yesterday, just a plaintext “Subscribe” that I couldn’t click. When visiting one of the posts, the button finally appeared today.

BermudaHighball@lemmy.dbzer0.com · 1 year ago

New account created today, yeah that’s fishy.

Torrents use cryptographic hashes to verify the torrent content, so if he seeds it to you, then your torrent client will validate data he gives you. If the data doesn’t verify or if he wants you to do anything else like clicking a link, avoid and report.

It’s sometimes possible to find the same files on other download sites, but “retrieving dead torrents” in general isn’t possible without having the same data.

BermudaHighball@lemmy.dbzer0.com · 1 year ago

Warez: Do you pirate software or just use FOSS?

BermudaHighball@lemmy.dbzer0.com · 1 year ago

This was data from pushshift before Reddit nuked it in March. You can find this torrent (called “Reddit comments/submissions 2005-06 to 2022-12”) and others, including 2023-01 and 2023-02, on https://academictorrents.com by user Watchful1.

BermudaHighball@lemmy.dbzer0.com · edit-2 1 year ago

Thanks! For anyone curious, the links to academictorrents version of the Reddit archives are available on /r/datahoarder and probably their lemmy.ml instance too.

BermudaHighball@lemmy.dbzer0.com · edit-2 1 year ago

Note that Mozilla VPN uses Mullvad’s network under the hood. Also, depending on your device you should be able to block connections that don’t use the VPN. On Android, the “kill switch” can be found in the settings as described here: https://mullvad.net/en/help/using-mullvad-vpn-on-android/#block-without-vpn

BermudaHighball@lemmy.dbzer0.com · 1 year ago

Pushshift is down now? Is there a data hoarder who has a backup of all the historical Reddit data that we can seed?

BermudaHighball@lemmy.dbzer0.com · 1 year ago

Use Tor Browser if you need anonymity, which isn’t offered by private browsing mode or most other extensions. In case you don’t want to route through the Tor network, Mullvad Browser offers the same fingerprinting resistance techniques as Tor Browser.

BermudaHighball@lemmy.dbzer0.com · edit-2 1 year ago

Proton is a good service, but their years of reluctance to include more anonymous payment methods such as Monero and the inability to register an account from an anonymous IP address without a phone number makes me question the relative benefit of using them as a VPN.

These do not by themselves result in a compromise of anonymity if Proton is trustworthy and the Swiss laws still enable them to disassociate your identity (given via payments) and your account usage, but regulation and governments tend to become stricter rather than looser over time and I would demand more from a service you are entrusting with all your internet traffic.

BermudaHighball@lemmy.dbzer0.com · edit-2 1 year ago

If you want to learn Python, the tutorial in the documentation is a thoroughly excellent starting point. Reading the documentation (the most up-to-date, deliberate content) will make you far more of a Python wizard than codecademy ever could.