Last updated in: April, 2020 by Musa Kurhula Baloyi
Data mining is a key business process which feeds into decision making. Every company and industry has a use for it. It is normally the precursor to more advanced analytics tasks. One form of data mining is web scraping.
There are alternatives to web scraping such as manually copying and downloading data, using crowd sourcing platforms such as Amazon Mechanical Turk, or hiring data capturers. If you are a coder, or can pay for one, web scraping is your best bet. The best part is that once you build the script, it can be re-run many times. Only when a website (not its data) gets updated will you need to update your pipeline or script. Rare events like a change in the API (method signature) of the library you use may also call for a rewrite.
To scrape the web, first you have to crawl it. Web crawling means navigating the Internet and finding the URLs that contain the data that you require. Search engines do this frequently. Sometimes web crawling is not necessary, more especially when you know the source(s) that contain your data. Still, you will need to collect the URLs and/or pages that contain the actual data.
To retrieve an identified page, you can use libraries such as request, cURL or any other library that can make HTTP calls, either via the browser, command line or through a third party application such as SoapUI or Postman. Once you have the page, you want to inspect it to understand its structure, how data is represented, device a way to grab it, and even guard against errors. This is done using a combination of a general-purpose programming language like Python and a parser like BeautifulSoup. An understanding of CSS classes and id's becomes critical here. Grabbing data via its XPath is also an option.
Your data will likely be placed in more than one place within the website. For example men's and women's clothing may appear on different pages, laptops and desktops may each have their own page, and each user on Twitter or Facebook will have their own timeline. For this reason, one of your post-processing steps may involve deduplication of the data since the same data may be represented in multiple pages depending on the criteria the web developer has used and the commonalities among the different data items.
It turns out, in practice, that some content is stored in PDF files, so using a traditional web scraping tool like BeautifulSoup, is no longer sufficient. BeautifulSoup parses an HTML page by exploiting its tag structure. PDF's cannot be scraped this way. This is where tools like PDFMiner come in.
PDFMiner's pdf2txt module can be run over the command line. But there are ways to run it inside a script if your programming language is able to invoke the command line via bash or other.
You might have seen that some PDF's do not follow the Adobe standard. There are image files which are stored as PDF, and others with the .ps (PostScript) extension. This is a harder problem to solve and requires us to venture into the realm of image recognition, specifically optical character recognition (OCR). OCR is still an active area of research. In fact, Kaggle has a nice starter tutorial on this. There are many tools to do OCR, e.g. OpenCV, Tesseract and AWS Textract. Machine learning can also be employed if you want to do something more novel. But often existing libraries can give a satisfactory result.
Of course it does not end there. The scraped data needs to be stored in some format at some location. This again, depends on your use case, as in whether you are a lone developer or you work for a multinational company, and how that data will be accessed and used. You can store the data as a CSV on your local disk, or on a NoSQL database on the cloud. The options are endless.
Yourself, a member of your team, or your client may then use that data, wherever it is stored, to do further processing, such as transformations, merging, analysis, visualisation, and even input into machine learning algorithms.