A JSON column is a convenient way to store structured data (like a ruby hash or JavaScript object), in your database. For example, you may want to store some metadata from an API call just in case you need it later, but you would rather not waste time creating a database schema to match your API call if you don't need it.
But what happens when 6 months from know your boss starts asking for reports based on data you haphazardly shoved in a JSON column? Do you need to extract the data from that column into a proper schema, or could you just query it as is? This post will explore some edge cases around PostgreSQL's JSON data type to answer these questions.
This post covers JSON support in PostgreSQL 9.3, which is the latest version of postgres at the time of this writing.
First we need to create a table in our database.
create table sales_reports (
store varchar,
report json
);
We created a table with two columns, one for a store name, and a JSON column to store a sales report.
Now we need some sample data to work with.
insert into sales_reports (store, report) values
('Gunter Guitars', '{"date": "2013-12-01", "total":"322"}'),
('Gunter Guitars', '{"date": "2013-12-02", "total":345}'),
('Gunter Guitars', '{"date": "2013-12-03", "total":"367"}'),
('Gunter Guitars', '{"date": "2013-12-04", "total":"328"}'),
('Gunter Guitars', '{"date": "2013-12-05", "total":414.0}'),
('Gunter Guitars', '{"date": "2013-12-06", "total":465}'),
('Gunter Guitars', '{"date": "2013-12-07", "total":"288"}');
If you look carefully, you will notice that I included some strings disguised as numbers in the total field of the report column. For our purposes, let's assume that Gunter himself enters the sales data and is sloppy with his data entry.
Conditions
Selling more than $400 worth of merchandise in a single day is a good day for Gunter. What are those good days?
select * from sales_reports where (report->>'total')::numeric > 400;
store | report
----------------+---------------------------------------
Gunter Guitars | {"date": "2013-12-05", "total":414.0}
Gunter Guitars | {"date": "2013-12-06", "total":465}
(2 rows)
We use the ->>
operator to return the report column's total field as text. Then we need to cast it to the numeric data type so we can check whether or not it's greater than 400. I chose the numeric data type because we have a mix of integer and decimal values, and the ::numeric
handles both. Attempting to cast to ::int
or some other data type would result in an error.
Here is the reference for all of the PostgreSQL JSON functions and operators.
Aggregations
Using JSON operators in where
clause works great, so now let's try them in other parts of the query. For example we could find Gunter's total sales by taking the sum of the report->>'total'
field.
select sum((report->>'total')::numeric) from sales_reports where store = 'Gunter Guitars';
sum
------
2529.0
(1 row)
The sum works as expected, but there are some edge cases we need to consider. Let's see how PostgreSQL handles the various edge cases we may encounter. We will now add some more rows to produce these edge cases.
edge case #1 - the total is missing
insert into sales_reports (report, store) values ('{"date": "2013-12-08"}', 'Gunter Guitars');
edge case #2 - the total is null
insert into sales_reports (report, store) values ('{"total": null, "date": "2013-12-09"}', 'Gunter Guitars');
edge case #3 - the total is not numeric
insert into sales_reports (report, store) values ('{"total": "n/a", "date": "2013-12-10"}', 'Gunter Guitars');
Now what happens when we try to query this data.
select sum((report->>'total')::numeric) from sales_reports where store = 'Gunter Guitars';
ERROR: invalid input syntax for type numeric: "n/a"
We get an error because postgres can't cast "n/a"
to numeric. Let's clean up the data in this row.
delete from sales_reports where report->>'total' = 'n/a';
insert into sales_reports (report, store) values ('{"total": 0, "date": "2013-12-10"}', 'Gunter Guitars');
I couldn't find an obvious way to update a field within a JSON column, so I just deleted and re-inserted the row.
select * from sales_reports where (report->>'total')::numeric > 400;
store | report
----------------+---------------------------------------
Gunter Guitars | {"date": "2013-12-05", "total":414.0}
Gunter Guitars | {"date": "2013-12-06", "total":465}
With clean data, we can execute this query successfully.
select sum((report->>'total')::numeric) from sales_reports where store = 'Gunter Guitars';
sum
--------
2529.0
Indexes
As long as our data is relatively clean, we can confidently query on fields within a JSON column. But if we have lots of rows, we will need an index. Can we create indexes on fields within a JSON column?
First, we need a bunch of test data. The next query will randomly generate one million rows with a sales total between 0 and 1,000. It will take a few seconds to run.
insert into sales_reports
select 'Gunter Guitars', ('{"total": "' || (random() * 1000) || '"}')::json from
(select * from generate_series(1,1000000)) as tmp;
Now let's query this data. We will use explain analyze to view the execution plan.
explain analyze select * from sales_reports where (report->>'total')::numeric > 900.0;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------
Seq Scan on sales_reports (cost=0.00..9348.00 rows=33 width=47) (actual time=0.026..1953.162 rows=100114 loops=1)
Filter: (((report ->> 'total'::text))::numeric > 900.0)
Rows Removed by Filter: 899896
Total runtime: 1962.110 ms
Almost 2 seconds. That's a slow query! You can see that postgres is doing a sequential scan over all the rows in our table to find the desired rows. Let's create an index to speed things up.
create index int_sales_total_idx on sales_reports (cast(report->>'total' as numeric));
One of the great features of PostgreSQL is that you can create indexes on arbitrary expressions. Let's see if our query is any faster now.
explain analyze select * from sales_reports where (report->>'total')::numeric > 900;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on sales_reports (cost=7723.79..23736.53 rows=333337 width=47) (actual time=41.987..76.161 rows=100114 loops=1)
Recheck Cond: (((report ->> 'total'::text))::numeric > 900::numeric)
-> Bitmap Index Scan on int_sales_total_idx (cost=0.00..7640.46 rows=333337 width=0) (actual time=39.528..39.528 rows=100114 loops=1)
Index Cond: (((report ->> 'total'::text))::numeric > 900::numeric)
Total runtime: 81.336 ms
While ~80ms is not the fastest query ever, it's 24 times faster than without an index. If you look at the explain plan, you can see that postgres is now using the index we created.
Joins
It's time to get crazy! Can we join on fields within a JSON column. I wouldn't do this in a real application. I just want to know if it works.
First, we need a new table to join with. Let's create some categories
create table categories ( id int, name varchar );
insert into categories (id, name) values (1, 'Guitars'), (2, 'Amps'), (3, 'Accessories');
Now we have 3 categories. Let's create some new categorized sales_reports.
delete from sales_reports;
insert into sales_reports (store, report) values
('Gunter Guitars', '{"category": 99, "total": 322}'),
('Gunter Guitars', '{"category": 1, "total": 0}'),
('Gunter Guitars', '{"category": 2, "total": 544}'),
('Gunter Guitars', '{"category": 2, "total": 0}'),
('Gunter Guitars', '{"category": 1, "total": 1289}'),
('Gunter Guitars', '{"category": 3, "total": 455}'),
('Gunter Guitars', '{"total": 0}'),
('Gunter Guitars', '{"category": 1, "total": 523}'),
('Gunter Guitars', '{"category": 3, "total": 464}'),
('Gunter Guitars', '{"category": 2, "total": 0}');
INSERT 0 10
I threw in an unknown category and a missing category to see how postgres handles these edge cases. Let's query for the sales totals for each category, which requires a join.
select name, sum((report->>'total')::numeric) from sales_reports
inner join categories on categories.id = (report->>'category')::int
group by categories.id, categories.name;
name | sum
-------------+------
Accessories | 919
Amps | 544
Guitars | 1812
We're still relying on our type casting and our data being reasonably clean, but it can be done.