- Butter and Butter Flavored Shortening
- Is content inside IE conditional comments indexed by search engines?
- Is it acceptable to use conditional comments to block IE rather than UA sniffing?
- “if … or” conditional comments?
- Where are RenPy save files on Mac OS Sierra?
- What are the achievements in Kittens game, and how can one get them?
- How do you reach the secret worlds in Parasol Stars?
- How do I reference multiple cells in one sheet in one cell in another sheet?
- Free storage for documents in Microsoft OneDrive?
- Insert current time in hh:mm:ss am/pm format based on enter other multiple columns
- How can I send a Windows executable via Gmail?
- Why is the top bar showing too much rep? [duplicate]
- Hats are not appearing correctly on expanded usercards in Safari
- Does comments on deleted posts count towards the Just Jesting and Think! hats?
- How to reply to meta questions?
- How to get the current language in a Sitecore Powershell editor (context menu) script?
- Monero Daemon solo CPU miner compared to other solo CPU miners
- Potential mistake in trying to secretly transfer from one bitcoin wallet to another via monero - Theoretical
- How to switch php based loyalty point to Ethereum based tokens?
- OFFTOPIC:- how to merge token from presale to mainsale
Tools for merging similar datasets continuously
How do I merge similar datasets, not just once but on an ongoing basis? This seems like a reasonably common problem so I'd be interested if there is a tool out there aimed at this problem, as a guided workflow. Or if any general purpose data tools might suit.
The exact problem I'm trying to solve is to curate a dataset of UK public bodies that is based on a number of existing datasets that change over time. The sorts of issues are:
The names of the bodies vary subtly between the datasets
The datasets all get updates which I'd like to add in regularly.
There are thousands of bodies, so plenty of automation is required to match them, yet manual matching is essential too.
Once it is established that an item needs copying into my dataset, the columns of the source datasets (name, abbreviation, website, etc.) will need to be mapped to my dataset's columns. I might want to do some manual tidying of the values, and not have this overwritten on the next time I 'run' the tool.
OpenRefine (formerly Google Refine) offers nice tools for data cleansing, e.g. correcting slight spelling variations. You can also script all transformations on the data and re-apply them later for updated datasets.
Is this what you're looking for?2017-09-14 10:24:59
Topic Maps is a relationship technology -- standardized in ISO13250 -- that expressly deals with the issue of needing to repeatedly merge disparate data sources.
When I say "expressly" I mean it has excellent structures for solving the identification problem we all recognize from interacting with the web and merging two topics that are about the same thing.
Each of the specific issues you cite are handled directly in Topic Maps:
Identity is based on URIs as public subject identifiers, not just names
Deduping is automatically handled by the Topic Map engine
Unlike database-oriented solutions, constant manual tweaking of the data model and identifying disparate sources as being about the same thing are natural actions when working with Topic Maps
Any piece of information can itself become a first-class member of the Topic Map, with any amount of attendant metadata
Ontopia is an open-source implementation of Topic Maps which was previously sold commercially as Ontopia Knowledge S2017-09-14 10:36:59
Duplicate reply to a previous question, but you might like this as its all in Python and very easy to automate.
I would like to recommend an alternative that I have found that I prefer to open refine.
It is a very easy to use python program(recommend using in Linux), that provides many custom options to merge your data sets. You are able to decide how each field is compared. I like this feature, because I have some fields where I only want matches to occur when they are not equal, this field being source, and the values being doc1 or doc2. It handles lat/lon, string, and custom compares. The author is very quick to reply to any questions you might have.
The greatest thing I like is it allows you to help train the program, by taking samples of your data and asking you, is this a match? Yes, No, Unsure, or Finished
So even if you don't want to get into the code and make up your custom comparators your able to quickly train the program what you2017-09-14 10:48:12
I am currently working a similar project and we went with Talend Open Studio with no regret so far (after 1 year in the project).
tons of connectors to get your data from any sources
nice data cleaning and transormation components to massage your data
fuzzymatch option (using levenshtein and metaphone) for reconciliation
job can be exported in a shell script and included in a cron job.2017-09-14 10:51:28
There was a talk at TransparencyCamp on cleaning open data that touched on merging datasets. You can see the notes for the talk here. The speaker recommends: deDupe, superfastmatch, saucebrush, name-cleaver. There's also a long list of tutorials for using OpenRefine.2017-09-14 11:08:10
csv2rdf4lod is a tool designed to integrate multiple versions of multiple datasets from multiple source organizations. It produces RDF and provides a hierarchical URI design that preserves where data elements came from. It uses a unix/shell environment (no GUI) and is focused on server-based automation and replication.
Even if you don't use it, flip through the hierarchical URI design convention. Each dataset "update" that you receive is modeled as a new version and it aligns with the previous versions in One Big RDF Graph depending on how you model the RDF creation.
Best of luck,
p.s. As suggested, I note that I'm the author of the tool.2017-09-14 11:27:07
Throwing in my weight into this topic. I work on an open-data project to solve producing dataset compilations from multiple dataset sources. We tackle issues like:
language (we handle 6 languages used by UN: en, fr, es, ru, ar, zh)
character encodings and error detection
formal, short, variant and abbreviated name matching
ISO, US (FIPS/USGS/NGA/CBSA) and Canadian (SGC) geographic identifiers
data representation, etc
Below is our on-going proposed spec (CUDE):
Below are the governmental datasets (500+) we have processed and released using the spec:
We haven't released the code yet, just the datasets.2017-09-14 11:44:15
Google Cloud DataPrep is a great data preparation tool now on beta, pros:
no size limit
OpenRefine offers nice tools for data cleansing, but it's ancient and limiting2017-09-14 11:47:34