Data collection

Bootcamp day 1

Please follow along at https://chi-loong.github.io/HASS02.526/

An email was sent to you last week with the bootcamp content link. Thanks!

Hello

My name is Chi-Loong and I'll be your instructor for this 3-day bootcamp.

Just call me Chi-Loong!

Experience

I am a hybrid that has 20+ years of work experience in tech, business dev and storytelling (journalism and PR) across various roles in start-ups, SMBs, GLCs and MNCs.

I run my own consulting-based visualization / code studio at V/R, and have done so for the last decade.

Teaching

I teach SUTD's MUSPP interactive data visualization module as an adjunct lecturer, and this is my 4th year teaching this module.

I also taught at other universities like SIT and code schools like General Assembly, all around user experience topics, e.g. frontend or visualization.

Tech background

I have taken on technical roles like head of product / engineering, heading up engineers and designers to help shape and build products.

Personally though, I prefer the more human side of tech, like user experience and product development.

Tech philosophy

Tech is an enabler, and it should not be an arcane gatekept domain only for engineers.

With the advent of AI LLMs, the more important questions will increasingly become how and why instead of what.

Setting the stage

And thus, especially for your masters (MUSPP) which is not technically focused, the way I look at teaching tech is from a holistic perspective.

From low code / no code cloud platforms to programming languages / frameworks, what is important is exposure and a feel of what can be done.

Survey time

Please complete this survey if you have not done it.

Survey results

Coding

It will take weeks to go through programming fundamentals.

  • Conditionals, loops, variables
  • Objects, arrays
  • Functions and scoping
  • Debugging

That is not the scope of this course.

Expectations

You're not training to be a software engineer. The way I teach you is different from how I teach computer science students.

Still, my technical background might bias me.

You might find the material challenging. Or possibly easy.

Note my biases and calibrate your expectations to what you need!

Javascript

I will only very lightly touch on coding, focusing more on low code / no code tools.

When I code, my language of choice is javascript, because it can be used very simply without any tooling, plus it is native to working with web data.

Other languages like Python or R typically need more tooling setup, and the use case is different (data science, machine learning).

Comprehensive Resource: javascript.info

Editor

For an editor I will often be using notepad++. I am old-school and like simple editors.

But I can suggest Visual Studio Code(by Microsoft) because it works well with things like Github, which Microsoft has bought.

As long as you have any text editor (atom, emacs, vi, etc.), it is sufficient.

Do download one if you do not have an ediotr installed.

ParseHub

Please download a free version of ParseHub, a freemium scraping tool.

It takes time to download and we'll be using this later.

Data formats and APIs

CSV

CSV (comma separated values) is a text-based file format.

As compared to binary-file format XSL (Excel) files, CSV is just text and can be opened up with any text editor.

Example of CSV file

JSON

JSON (Javascript Object Notation) is also a text-based file format.

It is structured data in javascript notation. Can encapsulate hierarchical data in a tree-like structure rather than just a flat table.

Example of JSON file

JSON, continued

JSON comprises of key-value pairs, and is a combination of objects and arrays to describe a data structure.

  • Objects are denoted by curly braces { }
  • Arrays by straight braces [ ]

Use online JSON validators to see double check structure.

APIs

APIs are application programming interfaces. They are basically how software "talk" to each other.

A large percentange of the web use JSON as the data format for exchanging information.

For example, data.gov.sg API on hourly PM2.5 readings (Schema)

Getting API data (non code)

Some APIs are more complicated and require you to send headers or authenticate.

There are tons and tons of tools out there that can do this that we'll discuss later.

HTML

A very, very simple basic HTML template.


<!DOCTYPE html>
<html>
<head>
  <meta charset="UTF-8">
  <title>Hello World</title>
  <style></style>
</head>
<body>
  <h1>Hello World!</h1>
  <script></script>
</body>
</html>
                    

Hopefully you have seen this before.

JS: Fetch

Simple code snippet for HTML fetch:


    fetch('https://api.data.gov.sg/v1/transport/carpark-availability')
      .then(response => response.json())
      .then(data => console.log(data))
                        

Going to traverse the JSON

Python: requests

Simple code snippet for requests:


    import requests

    r = requests.get("https://api.data.gov.sg/v1/transport/carpark-availability")
    print(r.json())
                        
                        

Not going to go through Python's tooling

Example Youtube: API

Often APIs are a lot more complicated, e.g. Youtube's API.

One of the easiest method to test is using curl, a command line tool in Windows and Linux.

You can of course access the API using code, or even cloud-based tools like Swagger and Postbox (more for API testing)

Data : Scraping

Why scrape?

Because often we need the base underlying data to do a data science exploratory analysis or visualization project.

If I asked you to collect data from a site would you know how?

Example 1: NEA Dengue website

Data.gov.sg data vs. actual NEA site

Example 2: Property Listings

Let's go to simpler examples.

If it was a simple table, collecting it is easy. You could in fact cut and paste the text into an editor, and use the editor to format.

But if I asked you to collect data from a more complicated example, would you know how to do so?

Property Lim Brothers

Scraper: ParseHub

ParseHub is only one of many cloud-based low code/no code scraping tools out there that you can use to scrape websites.

Please download a free version of the ParseHub.

Scraper: ParseHub 2

ParseHub has an extremely well thought out beautiful tutorial (and user interface) on how to use their tool to scrape their mock movie listing site.

Please go through this.

Putting it together

And now, we'll go back to the previous examples and scrape the previous sites.

Code vs low code/no code

  • coding is way more powerful, but requires you to understand the code structure of the site
  • the more javascript the webpage has, the more likely low code tools will run into issues
  • server setup and scheduling - you want to pass off as typical browsing traffic

Questions?

Chi-Loong | V/R