I was recently looking for a quick way to delete duplicate rows from large table (around half a billion rows) in PostgreSQL. In my case the table didn’t have a primary key – it was the result of importing a bunch of CSV files using the copy command. After some googling it turned out that the quickest way is also the easiest to understand. So here is how its done:
-- Create a new table with only unique rows: create table temp as select distinct on (lon,lat,label) * from buildings; -- Check the before and after counts: select count(*) as before from buildings; select count(*) as after from temp; -- Now drop the original table drop table buildings; -- Rename the temporary one. alter table temp rename to buildings;
Essentially, a new table is created with the duplicated rows removed then the original table is dropped and the new table is renamed. Keep in mind that this method doesn’t specify which row is kept so be careful if that is important to you.
In this case I only needed to have unique values in columns lon, lat and label. You can replace those with what fits you case. If you want to do a full table compare (i.e. your imported CSV files have full duplicated rows) you can use * instead like this:
create table temp as select distinct * from buildings;