Doing the web crawl

The ins and outs of open data

These days, the answer to a question is just a few clicks away. Search engines like Google make this possible by discovering, indexing, and ranking websites using algorithm-driven virtual spiders. Without these ‘web crawlers,’ navigating through the billions of websites that comprise the World Wide Web would be a daunting task.

Still, for overarching questions about the trends and connections described by web crawl data itself, individuals would need access to data storage and power that until recently was available only to Google. Lisa Green, the director of the non-profit open data initiative Common Crawl, spoke at the RPM Startup Centre in Griffintown last week about how her organization is simplifying the process of data analysis for all kinds of ‘curious coders.’

The talk, organized by Montreal Girl Geeks, focused on the philosophy of open data and its utility to small-scale researchers, educators and entrepreneurs.

Gil Elbaz, Silicon Valley database engineer and the co-creator of Google-acquired software Applied Semantics, founded Common Crawl in 2008 with the mission of democratizing access to the web. According to the organization’s website, Common Crawl “produc[es] and maintain[s] an open repository of web crawl data that is universally accessible.” The corpus covers approximately 300 terabytes of data corresponding to 8 billion web pages to date, all stored on Amazon S3 cloud storage service. Also, in keeping with the objective of a freer web, the entire crawl algorithm is published and publicly available on GitHub, a repository for coders to publish, store, and share code.

The Common Crawl Foundation has facilitated many success stories. In 2012, Matthew Berk of Zyxt Labs, Inc. tested around 1.3 billion URLs from crawled web data. After discovering that almost a fifth of the websites contained references to Facebook URLs, he founded a new social media start-up called Lucky Oyster that allows users to make recommendations to friends based on information from networking websites.

In the same year, Common Crawl hosted a code contest that showcased the breadth of crawl-data applications in different fields. Data Publica, a Paris-based open data directory, mapped the key players in the world of French open data and their connections to each other in the virtual sphere. Another group mapped the probable definition of a word based on its appearance in Wikipedia entries. The possibilities are truly staggering.

Green acknowledges the appeal of Common Crawl to business and startups, but is more inspired by the social implications of an openly accessible data repository. Individuals can now seek data-based, computational solutions for the greater good. Next month’s écoHACK Montréal, for example, partners experts in urban sustainability with tech-savvy coders to collaborate on sustainability projects in the city. Easier access to knowledge will also provide useful tools “for the two guys in the basement with a good idea,” Green added.

Opening up databases can even precipitate unexpected windfalls for taxpayers. When the National Health Service (NHS) in the UK opened prescriptions data up to the public last year, certain interested third parties discovered that an average of £27 million per month was spent by doctors prescribing proprietary (i.e. patented) cholesterol-lowering statins to patients, when generically available drugs were equally effective. A switch to cheaper drugs would save the NHS £200 million a year.

Changing the status quo would also make open data an appealing alternative to the fastidiously guarded copyrights of the printing age. Creative Commons, where Green was formerly chief of staff, is a non-profit organization that offers copyright licenses for creative and academic material. It has reshaped the possibilities of copyright protection on the internet for large-scale collaborative organizations like Wikipedia and independent artists alike. Admittedly, there is at present a significant lack of case law regarding data to render a Creative Commons approach to open data feasible.

The ‘open’ movement extends well beyond data and into the realm of open education, global access licensing for medicines, and open access to research. The movement has also gained traction at McGill with clubs such as Universities Allied for Essential Medicines advocating for the University’s adoption of global access policies, which would ensure generic production of all McGill-affiliated medical innovations.

As the information available on the internet rapidly expands, open data is becoming an increasingly important tool for the computer-literate generation. Leann Brown, the organizer of the open data event, is passionate about spreading the ‘open’ message to people in the technological world: “That’s what Montreal Girl Geeks is about – encouraging you to teach and enable yourself and share that knowledge in the community.”