Web Crawling for Data-Driven Decision-Makers


The following guest article was written by Dallán Ryan, Content & Advisory Lead, Data Strategy at Eagle Alpha, an alternative data aggregation platform that also provides supporting advisory services for data buyers and vendors.


Undoubtedly, alternative data has become more accessible to data buyers of all shapes and sizes, with the most accessible avenue into alternative data being web crawling or web scraping, as it’s also known. At Eagle Alpha, we define web crawling as a means of aggregating data via a computer program that requests information from public URLs. The data can be collected in-house or by companies specializing in customized data collection.

Web crawling has been utilized by hedge funds, private markets, and corporates for several years now and is considered by many to be the “gateway” into the alternative data ecosystem. This is because web crawling is a ‘low-risk, high return’ data source, offers perfect conditions for buy-in and resourcing, and can be implemented with a small team of analysts and engineers.

For several years, hedge funds have implemented web crawling in financial services to analyze web data from various sectors, including autos, retailers, e-commerce sites, real estate listings, and job boards. And now, the utility of this data source is becoming increasingly popular with corporates and private equity firms. These firms are using alternative data to provide additional insight into revenue, operations, and expansion projects, among many others. Web crawling is also widely implemented by data providers across other alternative data categories such as social media data, employment data, store location data, pricing data, and ratings & reviews data.

The Growing Adoption

Ninety-two percent of investment organizations—including hedge funds, private equity, and venture capital—that use alternative data to inform their decision-making do so either to a moderate or significant extent. That’s according to the December 2021 survey by New York law firm Lowenstein Sandler (figure 1), which revealed that investment leaders expect larger budgets and even higher use of alternative data going into 2022.

Figure 1: What are your sources of alternative data? (Source: Lowenstein Sandler)

A June 2021 white paper published by Oxylabs titled ‘The Growing Importance of Alternative Data in the Finance Industry’ found that of 252 senior data decision-makers surveyed, 63% relied on web crawling activities and other alternative data sources to improve their decision-making process. Web crawling is considered one of the most popular categories of alternative data because of the breadth of application, ease of access, and low-price points.

It is interesting to note the various methods that asset managers use to collect their data. The Oxylabs.io survey also points to integrations with third-party databases coming out on top (59.5%). Automated ETL processes (52.4%) and manual data collection (51.6%) were also highlighted. The study saw an equal split among those who outsource (36%) and have in-house scraping teams (36%), with 27% doing both.

While there are various benefits to implementing web-crawled data into the decision-making process, several challenges must also be understood (figure 2). If engaging in web crawling projects, teams must accept that historical data may not be available for specific sites. This concerns firms looking to e-commerce sites where historical pricing and product availability are essential KPIs to track, creating a challenge.

Figure 2: What are the challenges you face in your web scraping activities, if any? (Source: Oxylabs.io)

Legal and Compliance Considerations

From our experience, the most significant challenge associated with web crawling is the legal and compliance scrutiny that must be considered when pursuing a project. This is also one of the most recurring subjects we speak about during our monthly legal workshops as part of our Data Strategy advisory platform, as our clients are most interested in keeping up-to-date with nuances surrounding web-crawled data.

Best Practices

As a first step, firms are generally recommended to create a one-page document that outlines their web scraping policies. This can then be presented in the case of an investigation from the SEC or other similar bodies.

The most widely accepted rule is only to crawl data from the public sections of a website and to avoid crawling any information from areas that require a login. One reason is that users are generally asked to indicate their acceptance of a site’s T&Cs when logging in. The generally accepted best practices for web scraping are as follows:  

  • Create a one-pager outlining your crawling policies 
  • Do not crawl password-protected information  
  • Crawl only necessary information, avoiding copyrighted materials and personally identifiable information (PII) 
  • Try to minimize the impact on site performance  
  • Abide by cease & desist orders  
  • Carefully negotiate agreements with third-party scrapers and crawlers 

Another consideration where there is no clear legal guidance is circumventing the target websites’ technical measures to block access, for example, avoiding CAPTCHAs or masking your identity.

ACA Group’s whitepaper titled “Does the Use of Alternative Data Require Alternative Compliance Measures?” outlined the SEC’s Rule 206(4)-7. They state that the rule “requires registered investment advisers to adopt and implement written policies and procedures reasonably designed to prevent violations by the adviser and its supervised persons of the Investment Advisers Act and the rules thereunder.” Also known as ‘The Compliance Rule,’ it ensures firms implement annual reviews of their policies and procedures in determining their adequacy and effectiveness.

In response to the SEC’s April Risk Alert highlighting “notable deficiencies” in handling material non-public information (MNPI) by investment advisers, investors, and other market participants, Eagle Alpha hosted a webinar exploring the topic. In this session, Kelly Koscuiszka from Schulte, Roth & Zabel noted: “If we have vendors that aren’t familiar with the latest rulings (e.g., the hiQ Vs. LinkedIn case), don’t think about being behind a paywall or a login if they are not monitoring their scraping activities; if they don’t have a thoughtful response on bypassing CAPTCHA – those are all the things that will make us pause.”

Some critical considerations for remaining compliant while collecting web data include:

  • Is your identity masked? The site owner must be able to identify and contact the entity crawling the information on a site. The firm’s name can be added to the crawler, a routing request through proxies, and any vendor renting proxies must have up-to-date contact information to deal with complaints.
  • Are fake accounts being used? Suppose data is accessed from behind a login using a client’s login information. In that case, it is also essential to monitor the email inbox in case of complaints from the site owner.
  • Volume considerations: There must be measurable and auditable methodologies for crawling a site. For example, one data provider monitors the site traffic and average daily page loads, and a web crawler may aim to stay within 1% of this number.
  • Copyrightable information: If there is copyrightable information on the site. Certain information like price and product attributes are considered facts and are not copyrightable.
  • CAPTCHAs: To certain institutions like SEC-governed organizations, CAPTCHAs are a consideration as many compliance teams may view automating the captcha as risking material non-public information.

A Review of LinkedIn vs. hiQ Labs

hiQ Labs’ entire business operation focused on working with organizations to provide insights into human capital and job moves. LinkedIn member profiles have been the primary source of hiQ’s data, with the firm crawling the entirety of LinkedIn’s database. However, hiQ received a cease-and-desist letter from LinkedIn in May 2017 and so complied and started to scrape only publicly available data. LinkedIn then decided to prevent any scraping – including public domain – and put technological barriers in place.

First Ruling and LinkedIn’s Appeal

In June 2017, hiQ commenced an action for an injunction to allow it to continue to scrape public profiles. The United States District Court for the Northern District of California agreed with hiQ. LinkedIn appealed this decision to the United States Court of Appeals for the Ninth Circuit.

On September 9th, 2019, the Ninth Circuit rejected LinkedIn’s effort to stop hiQ from using information crawled from LinkedIn’s website. The resulting decision stated that selectively banning potential competitors from accessing and using publicly available data can be considered a breach of Californian law. An appeal to the ruling then progressed to the US Supreme Court.

Petition and Response

After relaying an intent not to file any opposition to LinkedIn’s March 2020 petition, the court requested a response which hiQ filed on June 25th, 2020. Here, hiQ stressed its belief in the original ruling correctly interpreting the critical aspect of the case of “unauthorized access.” In the response, hiQ stated: “The interpretation of the phrase ‘without authorization to exclude viewing and gathering public information—access to which requires no permission—flows naturally from the plain meaning of the phrase.” 

LinkedIn argued that the court ruling from the Ninth Circuit was counter to the First Circuit ruling in a 2003 precedent. The response from hiQ was that this interpretation from LinkedIn was misleading. The argument, the ET Cultural Travel vs. Zefer Corporation web crawling case in 2003 was not the same as “unauthorized access” to publicly available data on a public-facing website.

hiQ contended that LinkedIn’s arguments detailing hiQ’s violation of personal privacy issues were “disingenuous” since LinkedIn had previously allowed hiQ to scrape user data. hiQ also argued that privacy issues are complex and should be decided by legislation at the state and federal levels.

Case Sent Back to the Ninth Circuit

In June 2021, following the decision made in the Van Buren vs. the United States case, which resulted in the definition of unauthorized access being further clarified, the hiQ vs. LinkedIn case was sent back to the Ninth Circuit Court for further consideration.

In April 2022, The U.S. Ninth Circuit of Appeals affirmed its original decision in hiQ vs. LinkedIn. Ultimately, the court decided that profiles were not LinkedIn’s property as users elected to put them on the platform. Essentially, there was no expectation of privacy, and that information was, therefore, in the public domain. That is why web crawling can be considered legal, as the information is in the public domain, and a password is not required to view the information.

Final Thoughts

In a recent conversation with a chief legal and compliance officer, they stated that the hiQ vs. LinkedIn case is far from being concluded as LinkedIn would likely start the injunction process. They also highlighted another case that is being closely followed – Southwest Airlines vs. Kiwi.com. The latest case update is that the federal judge in Dallas barred Kiwi.com from scraping information from Southwest Airlines’ website.

Web crawling activities will continue to be an important driver for the adoption of alternative data sources for firms new to the space, but it is not without its challenges and legal considerations.

For more information on deep dives into global regulations for web crawling, how specific firms use web crawling activities as part of their decision-making process, and to access case studies into eCommerce, employment, autos, travel, real estate, and more, you can email dallan.ryan@eaglealpha.com for a copy of our complete white paper titled, “Web Crawling for Data-Driven Decision-Makers.”


About Author

Mike Mayhew is one of the leading experts on the investment research industry. In addition to founding Integrity Research, Mike is on the board of directors of Investorside Research Association, the non-profit trade association for the independent research industry, and a frequent speaker on research industry trends and developments. Mike has over thirty years of research industry experience. Email: Michael.Mayhew@integrity-research.com

Leave A Reply