Over a decade ago, I pointed out that as Google kept trying to worm its way deeper into our lives, a key Achilles’ heel was its basically non-existent customer service and unwillingness to ev…
Just some advice to anyone who finds themselves in this specific situation, since I found myself in almost the exact same situation:
If you really, really want to keep the data, and you can afford to spend the money (big if), move it to AWS. I had to move almost 4.5PB of data around Christmas of last year out of Google Drive. I spun up 60 EC2 instances, set up rclone on each one, and created a Google account for each instance. Google caps downloads per account to 10TB per day, but the EC2 instances I used were rate limited to 60MBps, so I didn’t bump the cap. I gave each EC2 instance a segment of the data, separating on file size. After transferring to AWS, verifying the data synced properly, and building a database to find files, I dropped it all to Glacier Deep Archive. I averaged just over 3.62GB/s for 14 days straight to move everything. Using a similar method, this poor guy’s data could be moved in a few hours, but it costs, a couple thousand dollars at least.
Bad practice is bad practice, but you can get away with it for a while, just not forever. If you’re in this situation, because you made it, or because you’re cleaning up someone else’s mess, you’re going to have to spend money to fix it. If you’re not in this situation, be kind, but thank god you don’t have to deal with it.
4.5PB holy shit. You need to stop using UTF2e32 for your text files.
I’d be paranoid about file integrity. Even a 0.000000000022% (sic) chance of a single bitflip somewhere along the chain, like a gentle muon tickling the server’s drive bus during the read, could affect you. Did you have a way of checking integrity? Or were tiny errors tolerable (eg video files)?
They were using rclone so all of the transfers would be hash checked. Whether the file integrity on either side of the transfer could be relied upon is in some ways a matter of faith, but there a lot of people relying on it.
Don’'t even need an ec2 instance if all you do is moving the data to Amazon s3. rclone can do direct cloud-to-cloud transfer, the data won’t hit the computer where the rclone running, so it should be very fast. You’re going to have an eye watering s3 bill though. Once the data in an s3 bucket, you can copy them to glacier later.
Server side copies will only be attempted if the remote names are the same
It sounds like that’s only for storage systems that support move/rename operations within themselves, and isn’t able to transfer between different storage providers.
Just some advice to anyone who finds themselves in this specific situation, since I found myself in almost the exact same situation:
If you really, really want to keep the data, and you can afford to spend the money (big if), move it to AWS. I had to move almost 4.5PB of data around Christmas of last year out of Google Drive. I spun up 60 EC2 instances, set up rclone on each one, and created a Google account for each instance. Google caps downloads per account to 10TB per day, but the EC2 instances I used were rate limited to 60MBps, so I didn’t bump the cap. I gave each EC2 instance a segment of the data, separating on file size. After transferring to AWS, verifying the data synced properly, and building a database to find files, I dropped it all to Glacier Deep Archive. I averaged just over 3.62GB/s for 14 days straight to move everything. Using a similar method, this poor guy’s data could be moved in a few hours, but it costs, a couple thousand dollars at least.
Bad practice is bad practice, but you can get away with it for a while, just not forever. If you’re in this situation, because you made it, or because you’re cleaning up someone else’s mess, you’re going to have to spend money to fix it. If you’re not in this situation, be kind, but thank god you don’t have to deal with it.
4.5PB holy shit. You need to stop using UTF2e32 for your text files.
I’d be paranoid about file integrity. Even a 0.000000000022% (sic) chance of a single bitflip somewhere along the chain, like a gentle muon tickling the server’s drive bus during the read, could affect you. Did you have a way of checking integrity? Or were tiny errors tolerable (eg video files)?
They were using rclone so all of the transfers would be hash checked. Whether the file integrity on either side of the transfer could be relied upon is in some ways a matter of faith, but there a lot of people relying on it.
deleted by creator
Wow. That’s a lot of “homework”.
I’m just curious how someone even gets to 4 Petabytes of data. It’s taking me years to fill up just 8 TB. And that’s with TV and movies.
Don’'t even need an ec2 instance if all you do is moving the data to Amazon s3. rclone can do direct cloud-to-cloud transfer, the data won’t hit the computer where the rclone running, so it should be very fast. You’re going to have an eye watering s3 bill though. Once the data in an s3 bucket, you can copy them to glacier later.
You’re right. Server side copy is only done when syncing between google drive.
AWS is very expensive. There are other compatible storage options, like Backblaze B2 and Wasabi, that are better for this use case
Seems a few thousand is worth it for your life’s work
Jesus
How much do you pay for that in aws
Removed by mod