database Archives - Kyriakos's Den

I was recently looking for a quick way to delete duplicate rows from large table (around half a billion rows) in PostgreSQL. In my case the table didn’t have a primary key – it was the result of importing a bunch of CSV files using the copy command. After some googling it turned out that the quickest way is also the easiest to understand. So here is how its done:

-- Create a new table with only unique rows:
create table temp as
    select distinct on (lon,lat,label) * from buildings;

-- Check the before and after counts:
select count(*) as before from buildings;
select count(*) as after from temp;

-- Now drop the original table
drop table buildings;
-- Rename the temporary one.
alter table temp rename to buildings;

Essentially, a new table is created with the duplicated rows removed then the original table is dropped and the new table is renamed. Keep in mind that this method doesn’t specify which row is kept so be careful if that is important to you.

In this case I only needed to have unique values in columns lon, lat and label. You can replace those with what fits you case. If you want to do a full table compare (i.e. your imported CSV files have full duplicated rows) you can use * instead like this:

create table temp as
    select distinct * from buildings;

In order to run PHPUnit tests I use a boostrap file which uses mysqldump to export the ever-changing database schema without any data, and import it into the testing database. After using this method for a while I stumbled on a problem with the auto-increment values for the table primary keys. There is no easy way to get mysqldump to avoid including the next auto-increment value in the dump. The problem this causes is that if your tests depend on specific primary key values things will go wrong the second time you run your test suite. There is a way to fix this by piping mysqldump’s output into sed. Here’s the actual command:

mysqldump -d [devDB] | sed 's/AUTO_INCREMENT=[0-9]*//' >

This will work under Linux but sadly not on Windows which doesn’t have sed but there’s a solution for that. Read More

Tag: database

Remove duplicate rows from table in PostgreSQL

How to mysqldump without auto increment values in Windows