How I Back Up Google Photos
Using ZFS and Photoprism
- Prerequisites and Setup
- Step 1 - Create Takeout Archives
- Step 2 - Downloading Takeout Archives
- Step 3 - Extracting Takeout Archives
- Step 4 - Deduplicating Media Files
- Cleanup
Google Photos is wonderful. It backs up my photos and videos, tags them, and makes them available to share with my family.
But I also don’t trust that Google will never accidentally “loose” my photos. Here is the process I use to back up my Google Photos data to a self-hosted instance of Photoprism.
This guide covers the process I use to back up my Google Photos content, in its original quality and with sidecar metadata, to Photoprism.
Prerequisites and Setup
My Photoprism instance runs on a TrueNAS Scale server as a kubernetes app. While yours doesn’t have to run in the exact setup, it’s worth describing. Photoprism has three primary storage folders, “Originals”, “Imports”, and a data path for sidecar files, multimedia caches, and the database file (if you’re using the built-in Sqlite storage, though I’d strongly caution you to set up an external MariaDB instance if you have a large photo library). Photoprism treats “Originals” as essentially a read-only store of media - anything you put in here will be imported by Photoprism as-is. Meanwhile any media put in “Imports” will be ingested by Photoprism and moved to the Originals folder; however you will have no control over the organization of these imported media. For our purposes, I strongly recommend putting your Google Photos media in “Originals” to preserve the folder structure Takeout outputs.
In my TrueNAS setup the Originals and Import folders are mounted onto ZFS datasets on the host, which allows me to
access and administer the media files while Photoprism isn’t running. Specifically, “Originals” is mounted to
/mnt/Bulk Storage/Photoprism/Originals
, which we’ll just call $ORIGINALS
from here on. First and most importantly,
we will be using ZFS extended attribute i
mmutable to ensure that any files already in our “Originals” folder won’t
be overwritten when we extract the Takeout GZip archives. To do this, we can run the following command:
# Find all regular files in our Originals folder and mark them as +i (immutable).
find "$ORIGINALS" -type f -print0 \
| sudo xargs -0 -n 1 chattr +i
We’ll also be using a few tools not distributed with TrueNAS Scale (or more Linux distros, for that matter). Most can be installed via package managers, the rest can be built from source:
pigz
- Parallel GZip implementation (homepage).progress
- Coreutils’ progress viewer (not necessary - just nice; homepage).jdupes
- Jody Bruchon’s enhancedfdupes
clone (homepage).
Step 1 - Create Takeout Archives
- Go to Google Takeout.
- Click .
- Scroll down to “Google Photos” and select its checkbox.
- Scroll to the bottom and click .
- For “File type” select
.tgz
, and for “File size” select 50GB. - Click .
Step 2 - Downloading Takeout Archives
Once your Google Takeout files are ready you will get an email (you can also check the Takeout site for completed archives). Depending on the size of our Google Photos library you may have multiple ~50GB files. I usually have around 10-15. I prefer to download one file at a time (depending on free disk space). Clicking the file will start a browser download, but I like to pause this download, right-click it and “copy download URL”, and then enter that into a download tool like Aria2.
Step 3 - Extracting Takeout Archives
Once I have some Takeout archive files downloaded (you don’t have to download them all at once, or even in order!) we’ll start extracting them into our “Originals” folder. Your Photoprism instance shouldn’t index Originals unless you tell it to, but I usually shut Photoprism down while performing these steps just to be safe.
Here's the script I use.
This will use pigz
to extract the specified archive in parallel (for speed) and then extract the Google Photos files
into our “Originals” folder:
#!/usr/bin/env sh
# Usage:
# sudo ./extract-to-originals.sh takeout-archive.tgz
if [ "$(id --user)" -ne 0 ]; then
echo "Please run with sudo!" 1>&2;
exit 1;
fi
pigz --decompress <"$1" \
| tar --extract --strip-components=2 --skip-old-files --directory="$ORIGINALS" \
& progress --wait -monitor --command pigz;
chown -R photoprism:photoprism "$ORIGINALS";
Let’s break down what’s happening here:
if [ "$(id -u)" -ne 0];
- checks if the script was run asroot
(or withsudo
).pigz --decompress <"$1"
- decompress the GZip file specified in the first passed to the script.tar --extract
… - extract files from the Tar archive thatpigz
decompressed. There’s a few extra options:- …
--strip-components=2
…: remove some unnecessary path components that Takeout adds. - …
--skip-old-files
…: skip files that already exist on disk. - …
--directory
: extract files into our$ORIGINALS
folder.
- …
progress --wait --monitor --command pigz
- monitor the progress of thepigz
command in decompressing the input file.chown -R photoprism:photoprism "$ORIGINALS"
- Becausetar
attempts to preserve the ownership of the files it extracts, we need to correct this after extraction. Obviously update theuser:group
to suit your own needs.
Because the files already in our “Originals” folder are marked i
mmutable tar can’t overwrite them - and in fact we
also provide tar with the --skip-old-files
option to avoid a lot of error output about not being able to extract
existing files. We do this so that the next time you want to back up your Google Photos, any files that existed between
the last backup and the new one aren’t re-written and we can safely skip extracting them from the archive. This will
save a ton of time.
Step 4 - Deduplicating Media Files
Repeat step 3 for all of the Takeout archives - you can delete each one as you go once it’s been extracted. Once you’re
done, we can inspect how many files we have that are new by counting the files that are not i
mmutable.
Here's a little script to do just that.
#!/usr/bin/env sh
if [ "$(id -u)" -ne 0 ]; then
echo "Please run with sudo!" 1>&2;
exit 1;
fi
find "$ORIGINALS" -type f -print0 \
| xargs --null lsattr \
| grep -- "----------------------" \
| wc --lines;
Breaking it down:
if [ "$(id -u)" -ne 0 ]
- again, checks that we’re running asroot
or withsudo
.find "$ORIGINALS" -type f -print0
- finds files in our originals folder. We use-print0
to safely handle filenames with weird characters in them.xargs --null lsattr
- for each file thatfind
outputs we want to runlsattr
- this will list any extended attributes on each file.grep -- "----------------------"
- for each line (file) of output fromlsattr
, we’re only interested in lines (files) that have no extended attributes set (files that havei
mmutable set would show as----i-----------------
).wc --lines
- count the number of lines (files).
Neat! Now let’s deduplicate all of these files. As mentioned in the aside above, Takeout will include multiple copies
of each piece of media, as each one can appear in multiple albums. To deduplicate this and save you a bunch of hard
drive space we’ll use jdupes
. This tool will search for files that are identical and (in our case) turn them into
filesystem hardlinks - which means that the same data on disk can be pointed to by multiple file names.
First we need to change all files back from immutable. This can be done quickly with sudo chattr -R -i "$ORIGINALS"
.
Then we run jdupes
, which has a lot of options, but the ones we want are:
sudo jdupes --link-hard --recurse "$ORIGINALS"
Aka:
--link-hard
- we want to hard-link all identical files together.--recurse
- consider all files within subdirectories of$ORIGINALS
recursively.
When this is done we’ll have potentially saved a lot of disk space!
Want to find out how much?
Here’s a short command that will tell you exactly how much space has been saved!
find "$ORIGINALS" -type f -links +1 -printf '%i %s\n' \
| awk 'a[$1]++{sum+=$2}END{print sum}' \
| numfmt --to=iec-i --suffix=B
Here’s the gist of what’s happening:
find "$ORIGINALS" -type f
… - find allf
iles in$ORIGINALS
;- …
-links +1
… - that have more than one hardlink; - …
-printf '%i %s\n'
- and output the file’sinode
number and file size (in bytes).
- …
awk 'a[$1]++{sum+=$2}END{print sum}'
- this line is a bit heavy, but here’s what theawk
script is doing:-
a[$1]++
Increment the value stored in
a[$1]
(where$1
is going to be theinode
number from each line). Theinode
is the unique data on disk - multiple file paths hardlinked to the same data will have the sameinode
number. -
{sum+=$2}
The
{sum+=$2}
part is a conditional clause - that means that if the value before it is “truthy” it will execute. The first time a uniqueinode
is encountereda[$1]++
will post-increment to 1, returning 0. This is falsy, so the conditional clause won’t run. Each time an inode is encountered after that, however, the value with be 1 or more, which is a truthy value and the conditional clause will run. All the clause actually does (when run) is increment the valuesum
(which starts at 0) by$2
(the file size in bytes). -
END{print sum}
- when the script ends, print the value insum
.
-
numfmt --to=iec-i --suffix=B
- this converts the value ofsum
output byawk
to an IEC size (e.g.,1024
would be converted to1.0KiB
).
So as an example, say our “Originals” folder only contains three files:
- 🖼️
a.jpg
- 🖼️
b.jpeg
- 📄
sidecar.json
And a.jpg
and b.jpeg
are both hardlinks to the same 1MiB of data, and sidecar.json
is a 1KiB standalone file. When
we run our script, the find
portion will output three lines:
12345 1048576
12345 1048576
67890 1024
When awk
gets these lines, it will perform the following:
- Line 1: Post-increment
a[12345]
(from 0 to 1), returning 0. 0 is falsy, so{sum+=1048576}
doesn’t run. - Line 2: Post-increment
a[12345]
(from 1 to 2), returning 1. 1 is truthy, so{sum+=1048576}
runs - sum is now1048576
. - Line 3: Post-increment
a[67890]
(from 0 to 1), returning 0. 0 is falsy, so{sum+=1024}
doesn’t run. - End of input:
END{print sum}
runs, outputting1048576
.
Finally, numfmt
converts 1048576
to 1.0MiB
, and we see our space savings is 1 megabyte! This is correct, as if
b.jpeg
wasn’t hardlinked to a.jpg
it would take an additional 1 megabyte.
Cleanup
Finally, we can re-mark our files as i
mmutable:
find "$ORIGINALS" -type f -print0 \
| sudo xargs -0 -n 1 chattr +i
Now we turn Photoprism back on, and tell it to re-index new files in the Originals folder. It will even intelligently use the sidecar JSON files that Takeout exports to enrich each media item with additional metadata.
Webmentions ♥️0
No webmentions were found.
Webmention support enabled by webmention.io and Webmentions for Jekyll.
-
{% for webmention in webmentions %}
-
{{ webmention.content }}
{% endfor %}
No bookmarks were found.
{% endif %}-
{% for webmention in webmentions %}
- {% endfor %}
No likes were found.
{% endif %}-
{% for webmention in webmentions %}
-
{{ webmention.content }}
{% endfor %}
No links were found.
{% endif %}-
{% for webmention in webmentions %}
- {{ webmention.title }} {% endfor %}
No posts were found.
{% endif %}-
{% for webmention in webmentions %}
-
{{ webmention.content }}
{% endfor %}
No replies were found.
{% endif %}-
{% for webmention in webmentions %}
- {% endfor %}
No reposts were found.
{% endif %}-
{% for webmention in webmentions %}
- {% endfor %}
No RSVPs were found.
{% endif %}-
{% for webmention in webmentions %}
-
{% if webmention.author %} {% endif %}{% if webmention.content %} {{ webmention.content }} {% else %} {{ webmention.title }} {% endif %}
{% endfor %}
No webmentions were found.
{% endif %}
Comments from Mastodon
You can leave a comment by replying to this Mastodon post from any ActivityPub-capable social network that can exchange replies with Mastodon.
Comment support inspired by Cassidy James (@cassidy@blaede.family) and some code borrowed from Julian Fietkau (@julian@fietkau.social).