Tools for merging similar datasets continuously

2017-09-14 10:14:05

How do I merge similar datasets, not just once but on an ongoing basis? This seems like a reasonably common problem so I'd be interested if there is a tool out there aimed at this problem, as a guided workflow. Or if any general purpose data tools might suit.

The exact problem I'm trying to solve is to curate a dataset of UK public bodies that is based on a number of existing datasets that change over time. The sorts of issues are:

The names of the bodies vary subtly between the datasets

The datasets all get updates which I'd like to add in regularly.

There are thousands of bodies, so plenty of automation is required to match them, yet manual matching is essential too.

Once it is established that an item needs copying into my dataset, the columns of the source datasets (name, abbreviation, website, etc.) will need to be mapped to my dataset's columns. I might want to do some manual tidying of the values, and not have this overwritten on the next time I 'run' the tool.

  • OpenRefine (formerly Google Refine) offers nice tools for data cleansing, e.g. correcting slight spelling variations. You can also script all transformations on the data and re-apply them later for updated datasets.

    Is this what you're looking for?

    2017-09-14 10:24:59
  • Topic Maps is a relationship technology -- standardized in ISO13250 -- that expressly deals with the issue of needing to repeatedly merge disparate data sources.

    When I say "expressly" I mean it has excellent structures for solving the identification problem we all recognize from interacting with the web and merging two topics that are about the same thing.

    Each of the specific issues you cite are handled directly in Topic Maps:

    Identity is based on URIs as public subject identifiers, not just names

    Deduping is automatically handled by the Topic Map engine

    Unlike database-oriented solutions, constant manual tweaking of the data model and identifying disparate sources as being about the same thing are natural actions when working with Topic Maps

    Any piece of information can itself become a first-class member of the Topic Map, with any amount of attendant metadata

    Ontopia is an open-source implementation of Topic Maps which was previously sold commercially as Ontopia Knowledge S

    2017-09-14 10:36:59
  • Duplicate reply to a previous question, but you might like this as its all in Python and very easy to automate.

    I would like to recommend an alternative that I have found that I prefer to open refine.

    https://github.com/datamade/dedupe

    It is a very easy to use python program(recommend using in Linux), that provides many custom options to merge your data sets. You are able to decide how each field is compared. I like this feature, because I have some fields where I only want matches to occur when they are not equal, this field being source, and the values being doc1 or doc2. It handles lat/lon, string, and custom compares. The author is very quick to reply to any questions you might have.

    The greatest thing I like is it allows you to help train the program, by taking samples of your data and asking you, is this a match? Yes, No, Unsure, or Finished

    So even if you don't want to get into the code and make up your custom comparators your able to quickly train the program what you

    2017-09-14 10:48:12
  • I am currently working a similar project and we went with Talend Open Studio with no regret so far (after 1 year in the project).

    Talend offer:

    tons of connectors to get your data from any sources

    nice data cleaning and transormation components to massage your data

    fuzzymatch option (using levenshtein‎ and metaphone) for reconciliation

    job can be exported in a shell script and included in a cron job.

    2017-09-14 10:51:28
  • There was a talk at TransparencyCamp on cleaning open data that touched on merging datasets. You can see the notes for the talk here. The speaker recommends: deDupe, superfastmatch, saucebrush, name-cleaver. There's also a long list of tutorials for using OpenRefine.

    2017-09-14 11:08:10
  • csv2rdf4lod is a tool designed to integrate multiple versions of multiple datasets from multiple source organizations. It produces RDF and provides a hierarchical URI design that preserves where data elements came from. It uses a unix/shell environment (no GUI) and is focused on server-based automation and replication.

    Even if you don't use it, flip through the hierarchical URI design convention. Each dataset "update" that you receive is modeled as a new version and it aligns with the previous versions in One Big RDF Graph depending on how you model the RDF creation.

    Best of luck,

    Tim

    p.s. As suggested, I note that I'm the author of the tool.

    2017-09-14 11:27:07
  • Throwing in my weight into this topic. I work on an open-data project to solve producing dataset compilations from multiple dataset sources. We tackle issues like:

    language (we handle 6 languages used by UN: en, fr, es, ru, ar, zh)

    language script

    character encodings and error detection

    formal, short, variant and abbreviated name matching

    ISO, US (FIPS/USGS/NGA/CBSA) and Canadian (SGC) geographic identifiers

    unit handling

    data representation, etc

    Below is our on-going proposed spec (CUDE):

    http://www.opengeocode.org/cude1.1/csv1.1.php

    Below are the governmental datasets (500+) we have processed and released using the spec:

    http://www.opengeocode.org/cude1.1/index.php

    We haven't released the code yet, just the datasets.

    2017-09-14 11:44:15
  • Google Cloud DataPrep is a great data preparation tool now on beta, pros:

    no size limit

    GCP integration

    great UX

    Serverless

    OpenRefine offers nice tools for data cleansing, but it's ancient and limiting

    2017-09-14 11:47:34