Written by Stephan Erdmann
the Chairman of the Stuttgart District Association of the “Pirate Party” Germany
What are upload filters?
The term comes from the IT world, which is why many don’t really have an idea of what it means and exactly what consequences an upload filter has. This should be an attempt to explain the terms and relationships without much prior knowledge.
First of all, the two terms filter and upload need to be explained separately.
The term upload comes from network technology and describes the process of transferring files from a client to a server. IT terms again …
Okay let’s start again:
A computer is a calculating machine. More precisely, it is a so-called binary calculating machine.
That means he’s pretty stupid because he only knows two possible answers. 1 and 0, yes or no.
The calculations possible today by computers are based on the fact that every task that a computer is supposed to perform has been broken down into an enormous sequence of queries that can always be answered completely with the answers yes or no, zero or one.
Let’s do an example:
When you see an image on your monitor, it is made up of a lot of pixels, dots, if you will, in different colors.
Each of these color points is in turn composed of a very long query as to whether this individual color value is mixed in.
You see the computer is very stupid.
Even after so many years, he still can’t answer a question that has more options than yes or no, zero or one. It can be programmed, i.e. a sequence of questions can be asked that can be answered in order to map a more complex task.
Ultimately, however, that always means that he has to go through all of these questions, that he needs time to answer these questions, that he needs computing power to answer as many questions as quickly as possible, and that he needs memory to handle all of them, to be able to remember answers until he can tell us the overall result.
Computers have grown enormously in all of these areas over the decades and yet the performance capabilities of computers are reaching their limits – at least if they are to remain affordable.
So that it remains affordable and so that it could be as small as a smartphone or tablet, the so-called server and client model was developed.
The basic idea here is that the client, the end device that the user uses – a PC, a laptop, a tablet or a smartphone – does not have to calculate so enormously if he is only supposed to ask the question that another computer calculates and serves him the answer.
This kind of shared work is the basic principle of every computer network and ultimately also of the Internet, the largest network of all.
You all know this from everyday life.
You write an e-mail that your mail server transmits to another mail server and gives another person to view on their device.
So what is the upload now?
English is the lingua franca in computer science.
The term upload comes from network technology and describes the process of transferring files from a client to a server.
It’s about the transfer of data, requests, and files from a client to the server. So the moment when you transfer something to the Internet, for example.
If you have taken a closer look at your internet connection, you will have come across the terms downstream and upstream. This indicates the speed at which you can receive (download) or send (upload) data from the network.
What is a filter?
It gets easier here. Filters are known from coffee machines or vacuum cleaners.
A filter is a mechanism that lets some things through and not others.
You also know this from your computers or smartphones.
For example, if you call up a search engine of your choice and search for websites, then you are actually looking at a catalog of websites that this search engine knows.
The search term then filters out the websites that are associated, even marked, with this term.
Another type of filter is your virus scanner.
This checks all files on your computer and filters out the files that contain program code that is identical or sufficiently similar to known computer viruses.
So what is an upload filter?
Well, it is not a defined term but a suitcase word that describes software that prevents files that meet specified criteria from being uploaded to a server and then offered by this server.
The term has developed into a hot political issue in the course of the debate about the EU Copyright Directive from 2019 and currently in connection with the upcoming TERREG regulation.
So hot that politicians who advocate it use all sorts of paraphrases for the term. The term itself is only a description of this process.
It does not matter whether this process is referred to as an upload filter, a technical measure or, as is currently the case in the leaked version of the Federal Ministry of Justice of germany , as “future blocking”. It is about the process that is hereby described.
Where is the problem?
What can you even filter?
Let’s remember. The computer is stupid!
Make a note of the sentence, because it is the most important insight you can ever learn about computers! Intelligent Computers are part of Science Fiction and not a part of reality no matter what politicians and business men try to sell to you!
The computer can only answer questions that can be answered with yes or no, zero or one.
A filter can only be a comparison of whether something is identical or at least identical in large enough parts!
We can ask a computer whether a text, an image or a video is identical or whether parts of texts, videos or images are identical.
We can even ask the computer whether minor changes have been made, such as a different file format, a different font, or whether a video is running as an image in the video.
For this purpose, the file is broken down into individual pieces of information that belong together, to which we individually ask “Are you identical to …?”
The computer can do that, because that’s stupid querying.
What can he not do?
Well he can’t ask if something is similar enough to have to filter!
The computer is stupid!
If you misspelling a search term, it has no chance of finding out what we are really looking for.
Hey wait a minute you will think yourselves!
I keep making mistakes and still Google shows me the pages I’m looking for! How does it work when the computer is so stupid?
Here a few more terms come into play that we need to explain:
Machine learning, artificial intelligence, big data.
These terms have been used quite often in recent years and have so often been misunderstood.
Basically, it is about guessing the correct answer using statistics.
As I said, the computer is stupid!
He is not able to execute even the simplest creative thought processes. Not even as simple a line of thought as that the word “Timate” actually means “tomato”.
Then how does the spell checker correct it?
It tries until no one corrects it anymore!
The easiest way to explain this is probably by throwing a dice.
We Homo Sapiens, we thinking people are instinctively able to foresee things creatively.
When we throw a dice, we instinctively know that the resulting pair of eyes will be evenly distributed over all possible outcomes in the long run.
We see that a dice has exactly 6 possible outcomes and that each one has the same probability.
We also instinctively recognize that if we roll 2 dice this distribution changes, since the outcomes 2 and 12 are only possible in a single combination, whereas the 7 results in many combinations.
We don’t instinctively know that the probability of 2 or 12 is around 3%, whereas the 7 has a probability of exactly 20%. We don’t have to, because we are creative and can work with inaccuracies.
We appreciate and live with the fact that we are not exactly right.
The computer, on the other hand, is not creative, it is stupid and it cannot find a promising strategy!
We humans do this all the time!
We don’t know exactly to the nanosecond what time it is or at what exact second the S-Bahn leaves and yet we get on on time every day.
We humans can work creatively with inaccuracies and derive sufficiently correct behavior from them.
The computer doesn’t! He tries it out and notes the result as the preliminary probability, as wrong as it may be.
He corrects the preliminary probability on the next try, no matter how wrong it may be.
The more often he repeats this try and error procedure, the sooner this preliminary probability approaches the actual probability and the sooner he will be able to make a statement that approximates what an average person is capable of.
This process, this feeding of data to improve accuracy, is called training.
Artificial intelligence, machine learning, big data is based on the fact that a computer has been fed with enough data that its predictions are almost as accurate as when a person simply guesses.
If for the first time in our life we play Ludo and even after the third round we still have no stone on the field, we are high hopes that we will soon finally roll a six.
The computer to which no one has programmed a series of questions that they can answer with yes or no, which results in this prediction?
The computer, which has now made notes nine times, has learned nine times that if you throw a dice, you won’t get a six?
The computer stays stupid!
So what can an upload filter actually filter?
He can filter something that is identical or at least partially identical and he can learn to guess almost as well as a human if you have fed him with enough data.
It should be noted here, of course, that we are talking about statistics, the theory of probability and queues. These are disciplines of mathematics where it is assumed that an identical – identical, not just similar – random experiment has been carried out millions of times.
It is not enough to have typed any search term in some wrong way to be able to predict which actual term was meant.
You have to repeat the same typo for the same search term, in the same context, enough times to be able to guess this reasonably correctly as a computer.
Because the computer is stupid!
What does this mean for the topic of upload filters?
Most readers will be familiar with the term in the context of the Article 13 Debate of 2019. For those who did not understand why hundreds of thousands of people took to the streets to protest against the EU Copyright Directive, now a little background knowledge:
Copyright has been around for a long time and has only been slightly modernized over time.
Basically, it comes from a time when for creative, artistic work such as that of a painter, a writer, a photographer or a director and his actors, it was essential that someone published this art, these creative works.
A writer could write Goethe’s Faust and still not earn a cent if no publisher printed this work.
Those days are long gone, because the computer age and especially the internet has given everyone the opportunity to reach their own audience.
In the past you could only publish an article in a newspaper, today a blog is sufficient.
The copyright and in particular the right of exploitation derived from it was no longer acceptable.
The publishing industry, which has developed into profitable corporations over the decades, has been losing revenue for a long time because its business model is no longer up-to-date, and its activities no longer have the significance and value it once had.
In addition, the Internet as a network of networks has no natural borders and a provider in the network was confronted with a patchwork of different regulations regarding copyright.
The Copyright Directive should address this by specifying an EU-wide directive on how the member states have to adapt their laws regarding copyright. This must be done by 2021.
The copyright directive was accompanied by a large lobby-supported campaign.
In addition to collecting societies such as GEMA or VG Wort, the large newspaper publishers and media groups had also shown a considerable interest in creating a guideline that would secure their position on the market.
A guideline was created that goes beyond the existing practice of the notice-and-take-down principle – that is, an established copyright infringement is reported and then has to be removed – a more extensive notice-and-stay-down principle was created.
This means that not only must reported copyright violations be removed, but also that they cannot be discontinued again.
From a technical point of view, only an upload filter comes into consideration here, because even small forums, where images or texts for which copyright exists can be posted at any time, receive corresponding uploads in such large quantities that manual checking is impossible.
But now we are faced with the problem that the computer can only guess roughly as well as humans and it will never be possible to completely program every necessary query for equality for all works that are there or will ever be created.
The blocking is mandatory and associated with high penalties, the inadmissible blocking of legal content on the other hand …
The platforms can write a clause in their general terms and conditions at any time – and most have had this already for a long time – that there is no entitlement to fulfillment or that a blocked post can be released again after an objection and tedious manual check.
Looking at this simple truth, it is obvious that the platforms will and will have to protect themselves by blocking too much rather than too little.
But what does too much mean? What is to be expected?
Let’s go back to our examples:
The search engine of your choice and the search term is imprint.
You are designing a website and have heard that there is an obligation to publish an imprint in Germany and you want to find out what actually has to be done. Under websites that are displayed to you, you will certainly also find pages that describe this very well.
But you can also find the entire imprint pages of countless websites that are out there under the results.
These false positives – results that are incorrectly displayed – incorrectly blocked by the upload filter – are basically a multiple of the copies sought, because there are always multiple options.
But it gets worse.
The directive provides that there are exceptions.
The right to quote, caricature and pastiche is explicitly given.
It is fairly obvious to a human when it is a copy and when it is a quote.
But the computer?
We remember the computer is stupid and cannot understand such a context.
It does not understand whether a frame around a video is an attempt to get this copy through or whether the frame is used to caricature or quote the video.
The computer stays stupid!
So not only do we get tons of false positives, wrongly blocked uploads, because they are simply too technically similar with content that needs to be blocked, no, we also get tons of false positives, tons of wrongly blocked uploads, because the computer is not able to interpret the necessary context of a content.
The obligation to upload filters for so-called user-generated content platforms – in IT we have been talking about Web 2.0 for decades, where websites simply offer the possibility that users can not only consume content but also design and present content themselves – arise but a whole host of other, bigger problems.
- Upload filters are expensive
The guideline stipulates that state-of-the-art technology is used. So it is not enough to just take the cheapest one to find, it has to be competitive.
The alphabet company, the parent company of Google, YouTube and a whole range of other services, operates the most technically sophisticated system of its kind called Content ID.
The development of this filter system cost around € 60 million. To assume that small European providers would be able to invest comparable amounts is absolutely illusory.
The cheapest providers of upload filters as a bookable service go into the six-digit range.
This acquisition is expected for every platform that has outgrown a very short development phase.
- Upload filters are prone to errors
As we have already seen in the previous explanations, filters basically have to filter content with a certain degree of fuzziness. Files can be technically changed very easily.
Images or videos can be converted into another format, texts can easily be changed automatically without this leading to any noticeable impairment when viewing.
However, this blurring, which prevents us from overlooking an image with a little deviation, inevitably ensures that images that are technically too similar are also blocked and similar are always many times more than identical.
Uploading makes things even more difficult.
The upload is limited in time, which means that we have a finite period of time in which we can check. The intelligent person makes mistakes under time pressure, especially the stupid computer does too.
- Upload filters are a censorship tool
We have to make it very clear that upload filters are not only suitable for filtering out important and legal content, but that this has already happened frequently.
Whether or not a content is filtered out ultimately only depends on the question of what the content to be filtered out of the database is fed with, or whether the content is technically similar enough.
A very famous example occurred in the Notre Dame fire.
YouTube’s Content ID filtered out a video on this brand because it was too technically similar to a video for an American television station to claim the rights. The said video was a recording of the attacks on the World Trade Center.
From a purely technical point of view, the upload filter actually worked perfectly here.
It was a shot of a burning building collapsing. All of the context that the filter could use has been interpreted correctly.
Better than that is not to be expected, because Content ID had the advantage over the upload filters that it did not have to recognize this during the upload.
The filter had a lot more time to calculate, and much more was invested in its development – both in money and in the required know-how – than other providers can provide.
The future results can only be worse than that.
One thing is also noticeable here, which is of the greatest importance on the Internet.
Timeliness is important! Information loses value if it is distributed too late.
For example, a video that discusses a film that is currently out in the cinema is much more interesting than the same video that is released a week after the film is no longer shown in theaters.
The latter, however, is the norm, because the video is often initially blocked because the upload filter mistook cited content for a copy and was only released after an objection and manual review.
Especially the potential as a censorship instrument must be taken seriously in the context of further developments when it comes to upload filters.
Now in the approaching year 2021, when the member states are obliged to incorporate the directive into national laws, we see the attempt to further expand upload filters.
The EU is currently negotiating the TERREG regulation.
In contrast to a directive that requires the member states to make laws that implement this directive, a regulation would represent directly applicable law in all member states.
The TERREG regulation is about preventing the spread of terrorist content on the Internet. It not only includes immediate cross-border deletion orders, but also upload filters.
So we see that the upload filters are not even applicable law and should already be expanded.
How the filter should learn what terrorism is and what reporting about it is, remains open.
Because the computer stays stupid.