The Investment Data Standards Organization (IDSO), a non-profit consortium of alternative data industry practitioners, released draft web crawling best practice standards for industry review. Meanwhile a seminal court case that may dramatically impact web harvesting procedures is still pending.
IDSO’s 17-page report, which is available to non-members, reviews risk identification, categorization, and minimization of online data collection practices known as web crawling, web harvesting or web scraping. The report focuses specifically on web crawling techniques used within the alternative data industry, which is an information economy focusing on mining investment insights from big data.
“We see IDSO’s alternative data best practices as a necessary addition to the industry,” said Justin Zhen, co-founder of Thinknum, an alternative data provider which is a leading provider of web data. “The standards will help shed light on the compliance-related gray areas of leveraging powerful web harvested data such as ours to generate unique alpha.”
IDSO also released a checklist of web crawling compliance practices and a members-only document for assessing web crawling data risks.
“Web crawling risk management is one of the most complex and fragmented aspects of alternative data compliance,” said Gene Ekster, a principal at Alternative Data Group, an IDSO member company, who has previously worked with alternative data at Point72. “IDSO standards are the best way for compliance teams across the industry to converge on common risk mitigation best practices.”
The Investment Data Standards Organization (IDSO) is a non-profit launched in January 2018 to develop standards, frameworks, and best practices for the use of alternative data by the investment community. Membership is open to all industry constituents, including originators, intermediaries, and data consumers such as hedge funds, mutual funds and other asset management organizations. Activities are primarily organized around working groups tasked with specific compliance objectives, which are prioritized by the community of members.
Initial versions of the IDSO’s standards for compliance best practices with Personally Identifiable Information (PII) and web harvesting were first released in January 2018. The organization formalized its PII standards in May 2018 and is currently working on defining compliance level of datasets that contain sensitive information (SI) in the alternative data industry. The IDSO standards are best practices complied through working groups consisting of over 15 buy-side organizations and multiple data vendors. More information about IDSO can be found at https://www.investmentdata.org/.
Industry surveys such as Greenwich Associates’ latest consistently show web data as one of the most popular forms of alternative data. Web-scraped data is a bunny-slope entry point for alternative data given that the web is accessible to anyone and bots are plentiful. Firms like Import.io, which recently acquired competitor Connotate, offer off-the-shelf web scraping tools that are relatively cheap and easy to use. For historical web-scraped data one must pay fees to firms like Thinknum or Yipitdata which have been collecting web data for some years.
However, web scraping is not without risk as we outlined in our white paper “Mitigating Legal Risks Associated with Alternative Data”. Case law pertaining to web scraping is sparse and sometimes contradictory. Although a DC district court has ruled that some forms of web crawling are protected by the First Amendment, an influential case involving LinkedIn’s attempt to restrict scraping by hiQ Labs of users’ public profiles is still pending even though arguments were heard over a year ago.
As alternative data usage grows, so do the attendant compliance risks. Web scraping, one of the most common procedures for obtaining alternative data, is a case in point which is why broadly recognized and adopted compliance best practices are vital.