Last fall, after a lower court ruled that HiQ was within its rights to circumvent LinkedIn’s API program by “scraping” LinkedIn’s Web pages, a US Court of Appeals upheld the ruling and paved the way for irreparable harm to the API economy. What this means for the future of the API economy remains to be seen. But, in upholding the lower court’s decision, the US Court of Appeals not only made it exceedingly difficult for companies to justify (through monetization) the provision of public APIs, it may have stifled organizational interest in making important data available to the public.
Web scraping is a practice where software (often in the form of a bot) running on one system uses web browser technology to access a web site as though it were a human. Then, once the software’s browser has opened that page, it siphons any data it finds back to its own systems. The most sophisticated versions of web scrapers use machine learning to maintain an understanding of a web page’s structure, thereby easily identifying the various fields of data that might be found on a page. As a very rudimentary example, if a product page on a retailer’s web site has a thousand characters on it, a web scraper designed to scrape that page might know that the product name starts at 350th character and the price can be found at the 475th character.
The ruling didn’t just declare open season on web scraping. It also disallowed any technical countermeasures designed to prevent web scrapers from accomplishing their tasks. For example, while it’s physically impossible for a human to visit 100 web pages in a minute, that sort of scale is child’s play for a good scraper. To prevent scraping, web site operators have been known to set thresholds that limit bots to a minimal quantity of pages views akin to what a human might normally consume. Or, the web site operator might block certain inbound IP addresses once they’ve been discovered to be the source of a scraper.
Although the ruling seemed to limit the precedent to “information [already] available to the General public,” there’s a lot of ambiguity around what data is technically public and what is not. Ergo, what is hacking and what is not?
The potential damage to the API economy and the harm to the efficacy of APIs in general is not to be underestimated. Apart from their potential to be a source of revenue, APIs also represented a technical solution to a very thorny problem. Just like at Christmas time when shopping-related traffic tends to spike many e-commerce sites, a bot working at scale can easily load down a web site as though hundreds or even thousands of humans all showed up at once. Unlike with Christmas however, where web site operators can anticipate the traffic spike and adjust their system capacities accordingly, any number of scrapers can show up at any time to scrape a web site.
Maybe your site can take the load of one or two scrapers. But what if ten scrapers showed up simultaneously. Chances are your site would crash under the load and not only cause an expensive fire drill for you, but your regular human users would also very likely be denied access while that fire drill is taking place. The net result is the same as a deliberate Denial of Service attack where the attackers overwhelm your systems to the point that they’re no longer available to normal users.
APIs, on the other hand, offered a better route to the same data that scrapers were after. It’s like the difference between the grocery store’s front door (designed to accommodate average human traffic), and the same store’s loading dock that’s designed for trucks to unload and load in bulk (something the front entrance simply isn’t capable of handling). Like the grocery store’s two entry points however, good governance means that the bulk entry point (the API) isn’t for everybody like the front door is.
You need some idea of who is showing up and what they’re taking or leaving behind. Good governance typically means the deployment of an API management system that’s capable of issuing easily recognized identification credentials, authenticating users when they arrive, keeping track of what they take or deliver, and automatically scaling for the type of traffic that’s typical for APIs. One of the best parts of providing this second entry point is that you can limit or even revoke access based on the user’s identity.
Seems reasonable, right?
Well, not anymore. Whereas before, where organizations like LinkedIn rerouted the truck drivers (think truckloads of data) to the loading dock (the API), now, the US Court of Appeals has told LinkedIn and all other web site operators that they have to let the truck drivers come through the front door, even if that door isn’t designed for that sort of bulk access. Furthermore, the trucks can show up at any time they like, 24/7/365, and need not identify themselves thereby circumnavigating all that good governance (Who is coming? How often? What are they taking? etc.).
When a scraper shows up at your web site, you have no idea what data it’s really interested in. Or, what will be done with that data once the scraper has exfiltrated it back to Siberia?
Of course, if you’re a software developer that’s accustomed to using APIs the way truck drivers like to use loading docks, you’re probably hailing this decision because now, you don’t have to identify yourself, state your intentions, or most importantly pay for your transactions. You can now legally circumvent the API, go through the front door, and take what you want in bulk for free regardless of the impact it has on the web site operator or the other customers.
One reason the API economy is called “The API Economy” is that API providers like to monetize their APIs. Running a great API isn’t cheap. When you do a good job monetizing your APIs, you not only get to recoup the cost of operating them (mainly for the benefit of the “truck drivers” who want easy access), in some situations, you might even earn a profit.
But now, with this ruling, that may no longer be the case. This will inevitably force API providers back to the drawing board the same way it must be forcing LinkedIn to rethink its entire business. The grand majority of LinkedIn’s unique selling proposition is dependent on the data it collects and how it organizes that data for access by others. LinkedIn, a subsidiary of Microsoft, invests millions of dollars to run the web site and recoups that cost by monetizing access to what it has built. Now, however, with last Fall’s decision by the US Court of Appeals, virtually anyone looking to compete with LinkedIn is free to come and take advantage of that investment at no cost.
It’s hard to know what LinkedIn will do next. Over at LinkedIn, I reached out to Microsoft president Brad Smith (to whom I’m connected) for comment. Smith and I go back to the old days (2003) when he and I were guests on the Charlie Rose Show to talk about spam. At the time, I was a journalist for CNET and the founder of an industry initiative called JamSpam. Smith was the General Counsel for Microsoft and for many of the years since he has been its Chief Legal Officer. However, I have not heard back. I will report back if I do.
Meanwhile, it is very difficult to know how this will play out for the API economy. Fortunately, the efficacy of APIs is not limited to one use case where they are offered publicly and monetized accordingly. Apart from offering them publicly, APIs are the primary enablers of legacy modernization, digital transformation, and game-changing customer experiences. The number of organizations that have successfully offered and/or monetized public APIs is vastly outnumbered by the number of organizations that use them for internal purposes or partnering with other organizations.
Go to Source
Author: <a href="https://www.programmableweb.com/user/%5Buid%5D">david_berlind</a>