- Measure Strength of Mix Mode Modular Arithmetic based Confusion Layer
- Signing mails with smart card
- UOV signature scheme, how does the affine transformation work? What does the composition of the core map and the affine map yield?
- 50 Minute Layover In Casablanca Airport
- business or tourist visa for attending conference in Taiwan?
- Attending conference in Norway with my spouse
- what fruit tree rootstocks can grow in very shallow soil?
- How much water should I give a mung bean each day?
- “the human drama” ~Camus
- In what sense can moral propositions be objective?
- Expressing intervals of time, and AからBすると
- Reflexive Pronouns - MICH & DICH (cases)
- Teshuva without kaballah le'asid
- Why are we told to either buy or take from our flock the pesach lamb
- The Rambam's “Complete Repentence” + halachical logistics
- Selenium Test Suite fails to execute in Jenkins but works correctly on local machine
- How to select or check multiple checkboxes in Selenium?
- SonarQube: Auto assign issues doesn't work
- What can be done to ensure that a lottery is fair?
- print error in line 17
Tools for merging similar datasets continuously
How do I merge similar datasets, not just once but on an ongoing basis? This seems like a reasonably common problem so I'd be interested if there is a tool out there aimed at this problem, as a guided workflow. Or if any general purpose data tools might suit.
The exact problem I'm trying to solve is to curate a dataset of UK public bodies that is based on a number of existing datasets that change over time. The sorts of issues are:
The names of the bodies vary subtly between the datasets
The datasets all get updates which I'd like to add in regularly.
There are thousands of bodies, so plenty of automation is required to match them, yet manual matching is essential too.
Once it is established that an item needs copying into my dataset, the columns of the source datasets (name, abbreviation, website, etc.) will need to be mapped to my dataset's columns. I might want to do some manual tidying of the values, and not have this overwritten on the next time I 'run' the tool.
OpenRefine (formerly Google Refine) offers nice tools for data cleansing, e.g. correcting slight spelling variations. You can also script all transformations on the data and re-apply them later for updated datasets.
Is this what you're looking for?2017-09-14 10:24:59
Topic Maps is a relationship technology -- standardized in ISO13250 -- that expressly deals with the issue of needing to repeatedly merge disparate data sources.
When I say "expressly" I mean it has excellent structures for solving the identification problem we all recognize from interacting with the web and merging two topics that are about the same thing.
Each of the specific issues you cite are handled directly in Topic Maps:
Identity is based on URIs as public subject identifiers, not just names
Deduping is automatically handled by the Topic Map engine
Unlike database-oriented solutions, constant manual tweaking of the data model and identifying disparate sources as being about the same thing are natural actions when working with Topic Maps
Any piece of information can itself become a first-class member of the Topic Map, with any amount of attendant metadata
Ontopia is an open-source implementation of Topic Maps which was previously sold commercially as Ontopia Knowledge S2017-09-14 10:36:59
Duplicate reply to a previous question, but you might like this as its all in Python and very easy to automate.
I would like to recommend an alternative that I have found that I prefer to open refine.
It is a very easy to use python program(recommend using in Linux), that provides many custom options to merge your data sets. You are able to decide how each field is compared. I like this feature, because I have some fields where I only want matches to occur when they are not equal, this field being source, and the values being doc1 or doc2. It handles lat/lon, string, and custom compares. The author is very quick to reply to any questions you might have.
The greatest thing I like is it allows you to help train the program, by taking samples of your data and asking you, is this a match? Yes, No, Unsure, or Finished
So even if you don't want to get into the code and make up your custom comparators your able to quickly train the program what you2017-09-14 10:48:12
I am currently working a similar project and we went with Talend Open Studio with no regret so far (after 1 year in the project).
tons of connectors to get your data from any sources
nice data cleaning and transormation components to massage your data
fuzzymatch option (using levenshtein and metaphone) for reconciliation
job can be exported in a shell script and included in a cron job.2017-09-14 10:51:28
There was a talk at TransparencyCamp on cleaning open data that touched on merging datasets. You can see the notes for the talk here. The speaker recommends: deDupe, superfastmatch, saucebrush, name-cleaver. There's also a long list of tutorials for using OpenRefine.2017-09-14 11:08:10
csv2rdf4lod is a tool designed to integrate multiple versions of multiple datasets from multiple source organizations. It produces RDF and provides a hierarchical URI design that preserves where data elements came from. It uses a unix/shell environment (no GUI) and is focused on server-based automation and replication.
Even if you don't use it, flip through the hierarchical URI design convention. Each dataset "update" that you receive is modeled as a new version and it aligns with the previous versions in One Big RDF Graph depending on how you model the RDF creation.
Best of luck,
p.s. As suggested, I note that I'm the author of the tool.2017-09-14 11:27:07
Throwing in my weight into this topic. I work on an open-data project to solve producing dataset compilations from multiple dataset sources. We tackle issues like:
language (we handle 6 languages used by UN: en, fr, es, ru, ar, zh)
character encodings and error detection
formal, short, variant and abbreviated name matching
ISO, US (FIPS/USGS/NGA/CBSA) and Canadian (SGC) geographic identifiers
data representation, etc
Below is our on-going proposed spec (CUDE):
Below are the governmental datasets (500+) we have processed and released using the spec:
We haven't released the code yet, just the datasets.2017-09-14 11:44:15
Google Cloud DataPrep is a great data preparation tool now on beta, pros:
no size limit
OpenRefine offers nice tools for data cleansing, but it's ancient and limiting2017-09-14 11:47:34