jq
is a command line tool for processing JSON data. It is useful when
executing small ad-hoc tasks. In this sense, it is similar to sed
, awk
and
grep
. It is very good for the first step in exploratory data analysis:
Getting an idea what you actually have.
Installation
# Ubuntu
$ sudo apt-get install jq
Check if it worked:
$ jq --version
jq-1.5-1-a5b5cbe
Usage Example
There are tons of JSON dataset examples. Let's use American movies scraped from Wikipedia.
# Download the data
$ wget https://raw.githubusercontent.com/prust/wikipedia-movie-data/master/movies.json
# What do we have here?
$ wc movies.json
0 144969 3386161 movies.json
# Everything is in one super long line. Is it a dict or a list?
$ head -c 80 movies.json
[{"title":"After Dark in Central Park","year":1900,"cast":[],"genres":[]},{"titl%
# List. How many movies do we have?
$ cat movies.json | jq '. | length'
28795
# Let's look at the first 5 examples to see which data fields we have:
$ cat movies.json | jq '.[:5]'
[
{
"title": "After Dark in Central Park",
"year": 1900,
"cast": [],
"genres": []
},
{
"title": "Boarding School Girls' Pajama Parade",
"year": 1900,
"cast": [],
"genres": []
},
{
"title": "Buffalo Bill's Wild West Parad",
"year": 1900,
"cast": [],
"genres": []
},
{
"title": "Caught",
"year": 1900,
"cast": [],
"genres": []
},
{
"title": "Clowns Spinning Hats",
"year": 1900,
"cast": [],
"genres": []
}
]
At this point, I would usually switch to Pandas. However, let's see how far
we get with jq
.
I want to see the titles. To do so, I first extract each list element. Then
I extract the title
field from each of them. To allow me scolling through it,
I apply less
:
$ cat movies.json | jq '.[] | .title' | less
"After Dark in Central Park"
"Boarding School Girls' Pajama Parade"
"Buffalo Bill's Wild West Parad"
"Caught"
"Clowns Spinning Hats"
"Capture of Boer Battery by British"
"The Enchanted Drawing"
"Feeding Sea Lions"
"How to Make a Fat Wife Out of Two Lean Ones"
"New Life Rescue"
"New Morning Bath"
...
Now I want to find errors in the data:
$ cat movies.json | jq '.[] | .title | select(length<=2)'
"It"
"M"
"W"
"FM"
"10"
"Q"
"Da"
"It"
"Ed"
"54"
"Pi"
"Go"
"CQ"
"O"
"RV"
"P2"
"21"
"W."
"9"
"Up"
"42"
"XX"
"It"
$ cat movies.json | jq '[.[] | .title | length] | max'
110
$ cat movies.json | jq '.[] | .title | select(length>=100)'
"Cornell-Columbia-University of Pennsylvania Boat Race at Ithaca, N.Y., Showing Lehigh Valley Observation Train"
"The Green Goods Man; or, Josiah and Samanthy's Experience with the Original 'American Confidence Game'"
# Alternatively:
$ cat movies.json | jq 'max_by(.title | length).title'
"Cornell-Columbia-University of Pennsylvania Boat Race at Ithaca, N.Y., Showing Lehigh Valley Observation Train"
WTF, The Green Goods Man... and Cornell-Columbia-University ... are actually movies! Also 54, RV, M and W are actually titles!
Ok, let's see which value range the year covers:
$ cat movies.json | jq '[.[] | .year] | min'
1900
$ cat movies.json | jq '[.[] | .year] | max'
2018
Let's see which ones were published in 1900:
$ cat movies.json | jq '.[] | select(.year == 1900) | .title'
"After Dark in Central Park"
"Boarding School Girls' Pajama Parade"
"Buffalo Bill's Wild West Parad"
"Caught"
"Clowns Spinning Hats"
"Capture of Boer Battery by British"
"The Enchanted Drawing"
"Feeding Sea Lions"
"How to Make a Fat Wife Out of Two Lean Ones"
"New Life Rescue"
"New Morning Bath"
"Searching Ruins on Broadway, Galveston, for Dead Bodies"
"The Tribulations of an Amateur Photographer"
"Trouble in Hogan's Alley"
"Two Old Sparks"
"The Wonder, Ching Ling Foo"
"Watermelon Contest"
Now, what about the cast
field? How often is there at least one value? What
is the maximum number of cast members we have?
$ cat movies.json | jq '[.[] | .cast | length] | max'
35
$ cat movies.json | jq '[.[] | .cast | select(length>=1)] | length'
27871
$ cat movies.json | jq '[.[] | .cast | select(length>=2)] | length'
26145
$ cat movies.json | jq '[.[] | .cast | select(length>=3)] | length'
13755
$ cat movies.json | jq '[.[] | .cast | select(length>=10)] | length'
394
$ cat movies.json | jq '[.[] | .cast | select(length>=20)] | length'
23
$ cat movies.json | jq '[.[] | .cast | select(length>=30)] | length'
3
# Get the titles for which we have at least 30 actors:
$ cat movies.json | jq '.[] | select(.cast|length>=30) | .title'
"Cars 2"
"The Twilight Saga: Breaking Dawn - Part 2"
"Anchorman 2: The Legend Continues"
Now I would like to see for which actors we have most titles. Also, I would
like to get a list of all actors in the dataset. However, I have no clue how to
do this with jq
.
Workflows
When I see json in the internet, I want my browser to display it in a nice way. JSONView is good for that, but fails if the JSON becomes huge.
Also, usually I start with opening it in Sublime Text. JSON Reindent helps me to see the structure.