• Martin Thoma
  • Home
  • Categories
  • Tags
  • Archives
  • Support me

jq

Contents

  • jq
    • Installation
    • Usage Example
    • Workflows
    • See also

jq is a command line tool for processing JSON data. It is useful when executing small ad-hoc tasks. In this sense, it is similar to sed, awk and grep. It is very good for the first step in exploratory data analysis: Getting an idea what you actually have.

Installation

# Ubuntu
$ sudo apt-get install jq

Check if it worked:

$ jq --version
jq-1.5-1-a5b5cbe

Usage Example

There are tons of JSON dataset examples. Let's use American movies scraped from Wikipedia.

# Download the data
$ wget https://raw.githubusercontent.com/prust/wikipedia-movie-data/master/movies.json

# What do we have here?
$ wc movies.json
      0  144969 3386161 movies.json

# Everything is in one super long line. Is it a dict or a list?
$ head -c 80 movies.json
[{"title":"After Dark in Central Park","year":1900,"cast":[],"genres":[]},{"titl%

# List. How many movies do we have?
$ cat movies.json | jq '. | length'
28795

# Let's look at the first 5 examples to see which data fields we have:
$ cat movies.json | jq '.[:5]'
[
  {
    "title": "After Dark in Central Park",
    "year": 1900,
    "cast": [],
    "genres": []
  },
  {
    "title": "Boarding School Girls' Pajama Parade",
    "year": 1900,
    "cast": [],
    "genres": []
  },
  {
    "title": "Buffalo Bill's Wild West Parad",
    "year": 1900,
    "cast": [],
    "genres": []
  },
  {
    "title": "Caught",
    "year": 1900,
    "cast": [],
    "genres": []
  },
  {
    "title": "Clowns Spinning Hats",
    "year": 1900,
    "cast": [],
    "genres": []
  }
]

At this point, I would usually switch to Pandas. However, let's see how far we get with jq.

I want to see the titles. To do so, I first extract each list element. Then I extract the title field from each of them. To allow me scolling through it, I apply less:

$ cat movies.json | jq '.[] | .title' | less
"After Dark in Central Park"
"Boarding School Girls' Pajama Parade"
"Buffalo Bill's Wild West Parad"
"Caught"
"Clowns Spinning Hats"
"Capture of Boer Battery by British"
"The Enchanted Drawing"
"Feeding Sea Lions"
"How to Make a Fat Wife Out of Two Lean Ones"
"New Life Rescue"
"New Morning Bath"
...

Now I want to find errors in the data:

$ cat movies.json | jq '.[] | .title | select(length<=2)'
"It"
"M"
"W"
"FM"
"10"
"Q"
"Da"
"It"
"Ed"
"54"
"Pi"
"Go"
"CQ"
"O"
"RV"
"P2"
"21"
"W."
"9"
"Up"
"42"
"XX"
"It"

$ cat movies.json | jq '[.[] | .title | length] | max'
110
$ cat movies.json | jq '.[] | .title | select(length>=100)'
"Cornell-Columbia-University of Pennsylvania Boat Race at Ithaca, N.Y., Showing Lehigh Valley Observation Train"
"The Green Goods Man; or, Josiah and Samanthy's Experience with the Original 'American Confidence Game'"

# Alternatively:
$ cat movies.json | jq 'max_by(.title | length).title'
"Cornell-Columbia-University of Pennsylvania Boat Race at Ithaca, N.Y., Showing Lehigh Valley Observation Train"

WTF, The Green Goods Man... and Cornell-Columbia-University ... are actually movies! Also 54, RV, M and W are actually titles!

Ok, let's see which value range the year covers:

$ cat movies.json | jq '[.[] | .year] | min'
1900
$ cat movies.json | jq '[.[] | .year] | max'
2018

Let's see which ones were published in 1900:

$ cat movies.json | jq '.[] | select(.year == 1900) | .title'
"After Dark in Central Park"
"Boarding School Girls' Pajama Parade"
"Buffalo Bill's Wild West Parad"
"Caught"
"Clowns Spinning Hats"
"Capture of Boer Battery by British"
"The Enchanted Drawing"
"Feeding Sea Lions"
"How to Make a Fat Wife Out of Two Lean Ones"
"New Life Rescue"
"New Morning Bath"
"Searching Ruins on Broadway, Galveston, for Dead Bodies"
"The Tribulations of an Amateur Photographer"
"Trouble in Hogan's Alley"
"Two Old Sparks"
"The Wonder, Ching Ling Foo"
"Watermelon Contest"

Now, what about the cast field? How often is there at least one value? What is the maximum number of cast members we have?

$ cat movies.json | jq '[.[] | .cast | length] | max'
35

$ cat movies.json | jq '[.[] | .cast | select(length>=1)] | length'
27871
$ cat movies.json | jq '[.[] | .cast | select(length>=2)] | length'
26145
$ cat movies.json | jq '[.[] | .cast | select(length>=3)] | length'
13755
$ cat movies.json | jq '[.[] | .cast | select(length>=10)] | length'
394
$ cat movies.json | jq '[.[] | .cast | select(length>=20)] | length'
23
$ cat movies.json | jq '[.[] | .cast | select(length>=30)] | length'
3

# Get the titles for which we have at least 30 actors:
$ cat movies.json | jq '.[] | select(.cast|length>=30) | .title'
"Cars 2"
"The Twilight Saga: Breaking Dawn - Part 2"
"Anchorman 2: The Legend Continues"

Now I would like to see for which actors we have most titles. Also, I would like to get a list of all actors in the dataset. However, I have no clue how to do this with jq.

Workflows

When I see json in the internet, I want my browser to display it in a nice way. JSONView is good for that, but fails if the JSON becomes huge.

Also, usually I start with opening it in Sublime Text. JSON Reindent helps me to see the structure.

See also

  • Official Website and git repository

Published

Nov 3, 2019
by Martin Thoma

Category

Code

Tags

  • Data analysis 2
  • EDA 3
  • JSON 4
  • JSONL 1

Contact

  • Martin Thoma - A blog about Code, the Web and Cyberculture
  • E-mail subscription
  • RSS-Feed
  • Privacy/Datenschutzerklärung
  • Impressum
  • Powered by Pelican. Theme: Elegant by Talha Mansoor