2b: Getting data

Expectations

You're not training to be a software engineer. The way I teach you is different from how I teach computer science students.

Still, my technical background might bias me.

You might find the material challenging. Or possibly easy.

Note my biases and calibrate your expectations to what you need!

Javascript

When I code, my language of choice is javascript, because it can be used very simply without any tooling, plus it is native to working with web data.

Other languages like Python or R typically need more tooling setup, and the use case is different (data science, machine learning).

Besides coding, we can also achieve outcomes with low code / no code tools. Of course it will often not be as powerful or flexible.


Comprehensive Resource: javascript.info

Editor

For an editor I will often be using notepad++. I am old-school and like simple editors.

But I can suggest Visual Studio Code(by Microsoft) because it works well with things like Github, which Microsoft has bought.

As long as you have any text editor (atom, emacs, vi, etc.), it is sufficient.

Do download one if you do not have an editor installed.

ParseHub

Please download a free version of ParseHub, a freemium scraping tool.

It takes time to download and we'll be using this later.

Data formats and APIs

CSV

CSV (comma separated values) is a text-based file format.

As compared to binary-file format XSL (Excel) files, CSV is just text and can be opened up with any text editor.

Example of CSV file

JSON

JSON (Javascript Object Notation) is also a text-based file format.

It is structured data in javascript notation. Can encapsulate hierarchical data in a tree-like structure rather than just a flat table.

Example of JSON file

JSON, continued

JSON comprises of key-value pairs, and is a combination of objects and arrays to describe a data structure.

  • Objects are denoted by curly braces { }
  • Arrays by straight braces [ ]

Use online JSON validators to see double check structure.

APIs

APIs are application programming interfaces. They are basically how software "talk" to each other.

A large percentange of the web use JSON as the data format for exchanging information.

For example, data.gov.sg API on hourly PM2.5 readings (Schema)

Getting API data (non code)

Some APIs are more complicated and require you to send headers or authenticate.

There are tons and tons of tools out there that can do this that we'll discuss later.

Using Curl

One of the easiest method to test an API is using curl, a command line tool in Windows and Linux.

Let's try this on SG's 2-hr weather forecast

You can of course access the API using code, or even cloud-based tools like Swagger and Postbox.

HTML

A very, very simple basic HTML template.


<!DOCTYPE html>
<html>
<head>
  <meta charset="UTF-8">
  <title>Hello World</title>
  <style></style>
</head>
<body>
  <h1>Hello World!</h1>
  <script></script>
</body>
</html>
                    

Hopefully you have seen this before.

JS: Fetch

Simple code snippet for HTML fetch:


    fetch('https://api.data.gov.sg/v1/transport/carpark-availability')
      .then(response => response.json())
      .then(data => console.log(data))
                        

Once you get the data you can of course do more with it, e.g. show all car parks with no lots available.


    fetch('https://api.data.gov.sg/v1/transport/carpark-availability')
      .then(response => response.json())
      .then(data => {
        let rows = data.items[0].carpark_data;
        let results = rows.filter(d => parseInt(d.carpark_info[0].lots_available) == 0)

        console.log(rows);
        console.log(results);
      })
                        

Data : Scraping

Why scrape?

Because often we need the base underlying data to do a data science exploratory analysis or visualization project.

If I asked you to collect data from a site would you know how?

Example 1: NEA Dengue website

Data.gov.sg data vs. actual NEA site

Example 2: Property Listings

Let's go to simpler examples.

If it was a simple table, collecting it is easy. You could in fact cut and paste the text into an editor, and use the editor to format.

But if I asked you to collect data from a more complicated example, would you know how to do so?

Property Lim Brothers

Scraper: ParseHub

ParseHub is only one of many cloud-based low code/no code scraping tools out there that you can use to scrape websites.

Please download a free version of the ParseHub.

Scraper: ParseHub 2

ParseHub has an extremely well thought out beautiful tutorial (and user interface) on how to use their tool to scrape their mock movie listing site.

Please go through this.

Putting it together

And now, we'll go back to the previous examples and scrape the previous sites.

Let's start with example 2 first and then example 1.

Code vs low code/no code

  • Coding is way more powerful, but requires you to understand the code structure of the site
  • The more javascript the webpage has, the more likely low code tools will run into issues
  • Server setup and scheduling - you want to pass off as typical browsing traffic

Data scraping: discussion

  • Confidence in getting data
  • Blacklists
  • Ethics

Questions?

Chi-Loong | V/R