How do you feel about the legality and morality of “borrowing” copyrighted data for machine learning purposes?
This has become somewhat of a hot topic these days, largely due to mass exploitation of human cultural output by companies which train and host large language models. As such, I’m worried of writing something with nuance because of how heated the conversation around it has become, particularly due to concerns over the impact on labor and employment1. However, I’m not here to discuss generative AI or LLMs, not directly at least. Rather, I’d like to try and shine a different light on the broader issues of data mining, through an example rooted in a popular technology which has enabled an explosion of creative approaches in the landscape of online streaming.
One day, some years ago, I decided to look at the data used to train OpenSeeFace. OpenSeeFace is the most popular open source face tracking solution for virtual YouTubers. It is supported by both open source and commercial model rendering tools; in particular, VTube Studio allows using it as an option for webcam tracking.
Step 1: OpenSeeFace
Helpfully, OpenSeeFace provides a list of data used to train its models in the README:
- The primary facial landmark detection model uses a privately modified version of data derived from two data sets: LS3D-W and WFLW.
- The gaze and blink detection model uses the MPIIGaze data set, combined with synthetic data from UnityEyes.
- The face detection model uses “random 224x224 crops from the WIDER FACE dataset”.
Step 1.1: LS3D-W
Data often gets created from other data, and LS3D-W is a great example. Created by Adrian Bulat and Georgios Tzimiropoulos, it provides synthetic face annotations created on top of 230,000 images from four data sets: AFLW, 300VW, 300W and FDDB.
Step 1.1.1: AFLW
AFLW, or Annotated Facial Landmarks in the Wild, is a manually annotated database of 380,000 points on about 25,000 face photos developed at the Graz University of Technology. The source of these photos was the popular website Flickr.
Step 1.1.2: 300VW, 300W
The following two data sets were created by Imperial College London:
- 300-VW, or 300 Videos in the Wild, and
- 300-W, or 300 Faces in the Wild, built by augmenting data from LFPW, AFW, and HELEN sets, as well as 135 additional images provided by the creators of this set.
Both were created to be used in competitions for building improved facial landmark detection algorithms.
Step 1.1.2.1: LFPW
LFPW, or Labeled Face Parts in the Wild, is a data set created by Kriegman-Belhumeur Vision Technologies, LLC. which consists of annotations on photos sourced from Google, Flickr and Yahoo. The annotations were created and verified by employing labour through Amazon’s Mechanical Turk.
Step 1.1.2.2: AFW
AFW, or Annotated Faces in the Wild, is a data set of face images and annotations created as part of face detection research at UC Irvine.
etc.
I could continue an exhaustive iteration through the list of data sets, but I feel that just by browsing the links already provided above, some observations already make themselves apparent:
- Multiple data sets listed include clear licensing restrictions, such as:
- “The […] database is available for non-commercial research purposes only.”
- “Commercial use (i.e., use in training commercial algorithms) is not allowed.”
- Multiple data sets listed were themselves built on material which the authors did not have permission to redistribute or exploit2:
- “Due to copyright issues, we cannot distribute image files in any format to anyone.”
- “Any use of the images must be negociated with the respective picture owners […] In particular, you agree not to reproduce, duplicate, copy, sell, trade, resell or exploit for any commercial purposes, any portion of the images and any portion of derived data.”
In a world where machine learning on copyrighted data is transitive, as many hope to be the case, the resulting answer would be clear: OpenSeeFace’s facial tracking models, and possibly other models, are built on a matryoshka doll of borrowed data. This data was often scraped without permission, with many parties along the way explicitly disclaiming commercial use. As face tracking is crucial to many forms of “avatar puppeting”, such a conclusion would put a massive dent in the idea of open source VTubing, at least today.
Takeaway
I don’t want this to be an attack on OpenSeeFace. It is just one example of a particularly laissez-faire approach to training data set usage pervasive throughout the entire space of machine learning. This approach has simply become normalized.
I also don’t think there is an obvious stance to take here. The conclusion one would take, in my view, is highly dependent on one’s value system. Some options that come to mind:
- Apple had its own data set, right? I’d assume - hope? - that the origin of data used to train ARKit on iPhones is better vetted than the current “open source” solutions. Maybe you care about data ethics more than software freedom?
- You could decide that this isn’t really a problem, of course. Maybe data mining for non-generative purposes is more palatable. Maybe your problem with machine learning has anti-capitalist or ecological roots, not ones borrowing from copyright law. Maybe you just like VTubing and hate OpenAI, so it’s fine?
- One could try to build a face tracking/annotation data set with proper open provenance, of course. Attempts to do so exist throughout the machine learning space, after all. However, this requires labor, volunteers, and - probably - capital. Maybe you’re willing to be the change you’d like to see?
Other conclusions proposed by readers include:
- Annotating a face, while laborious, is not by itself a creative task with original human input. As such, reproducing facial landmarks should not be subject to copyright, unlike training on and deriving from creative works like writing, art or music (or translation?).
In the end, my only takeaway is that so much of what we love about modern technology has been built on some form of data exploitation. We don’t give it much thought until it’s something we hate that is built with the same approaches instead. Perhaps it’d be good for our growth as a culture and society to interrogate that once in a while.
-
Personally, I see a lot of anti-AI backlash as a labor dispute, with usage of generative AI being the de facto picket line and those who end up relying on it being de facto scabs in the eyes of those protesting it. This, then, powers the vitriol - who likes a scab? There’s also an important element of broader frustration with Big Tech and their vision for our world in there that should not go unmentioned - it’s been brewing for a while, and now there’s a technology which amplifies key issues of that vision. But I digress - this is not a post about generative AI. ↩︎
-
This raises not just a copyright issue, for whatever it’s worth, but also one of privacy and consent - what if your face was used as a puzzle piece in the broader VTuber industry and this post is how you found out? ↩︎