- Is there historical evidence for Christ's resurrection?
- Did CIA plot to overthrow Fidel Castro by staging the Second Coming, making him lose his beard or getting him high on radio?
- Did Harry Harlow's “monkey love” experiments cost $50M (or even $1.2B)?
- Did a department store in Tokyo crucify Santa Claus?
- Did Hedy Lamarr and George Antheil invent frequency hopping or spread spectrum communications?
- Were terror attacks proposed to gain political support for a war against Cuba?
- prudent valuation: concentrated positions exit period
- Am I a golfy array?
- Correct way to implement lazy loading
- Choosing a movement strategy based on the type of chess piece
- Computing the powerset of a list
- Replace the content of a file by zero-bytes and keep the same filesize
- Archives xkcd comics
- Is there anything I can do to increase performance in the future once table gets bigger?
- Does the Flash in each of the versions have any inhibitions against speeding round a city naked?
- In “The Hero of Ages” how did Spook know to write in metal when sending his messenger?
- What happens to a padwan whose master died before the training is finished?
- Why did Lorca attend the meeting in person?
- What are some things that only Jedi Masters could do, and not new Jedi Knights (or Padawans)?
- Musical fantasy film with a white face clown (maybe sixties)
Tools for merging similar datasets continuously
How do I merge similar datasets, not just once but on an ongoing basis? This seems like a reasonably common problem so I'd be interested if there is a tool out there aimed at this problem, as a guided workflow. Or if any general purpose data tools might suit.
The exact problem I'm trying to solve is to curate a dataset of UK public bodies that is based on a number of existing datasets that change over time. The sorts of issues are:
The names of the bodies vary subtly between the datasets
The datasets all get updates which I'd like to add in regularly.
There are thousands of bodies, so plenty of automation is required to match them, yet manual matching is essential too.
Once it is established that an item needs copying into my dataset, the columns of the source datasets (name, abbreviation, website, etc.) will need to be mapped to my dataset's columns. I might want to do some manual tidying of the values, and not have this overwritten on the next time I 'run' the tool.
OpenRefine (formerly Google Refine) offers nice tools for data cleansing, e.g. correcting slight spelling variations. You can also script all transformations on the data and re-apply them later for updated datasets.
Is this what you're looking for?2017-09-14 10:24:59
Topic Maps is a relationship technology -- standardized in ISO13250 -- that expressly deals with the issue of needing to repeatedly merge disparate data sources.
When I say "expressly" I mean it has excellent structures for solving the identification problem we all recognize from interacting with the web and merging two topics that are about the same thing.
Each of the specific issues you cite are handled directly in Topic Maps:
Identity is based on URIs as public subject identifiers, not just names
Deduping is automatically handled by the Topic Map engine
Unlike database-oriented solutions, constant manual tweaking of the data model and identifying disparate sources as being about the same thing are natural actions when working with Topic Maps
Any piece of information can itself become a first-class member of the Topic Map, with any amount of attendant metadata
Ontopia is an open-source implementation of Topic Maps which was previously sold commercially as Ontopia Knowledge S2017-09-14 10:36:59
Duplicate reply to a previous question, but you might like this as its all in Python and very easy to automate.
I would like to recommend an alternative that I have found that I prefer to open refine.
It is a very easy to use python program(recommend using in Linux), that provides many custom options to merge your data sets. You are able to decide how each field is compared. I like this feature, because I have some fields where I only want matches to occur when they are not equal, this field being source, and the values being doc1 or doc2. It handles lat/lon, string, and custom compares. The author is very quick to reply to any questions you might have.
The greatest thing I like is it allows you to help train the program, by taking samples of your data and asking you, is this a match? Yes, No, Unsure, or Finished
So even if you don't want to get into the code and make up your custom comparators your able to quickly train the program what you2017-09-14 10:48:12
I am currently working a similar project and we went with Talend Open Studio with no regret so far (after 1 year in the project).
tons of connectors to get your data from any sources
nice data cleaning and transormation components to massage your data
fuzzymatch option (using levenshtein and metaphone) for reconciliation
job can be exported in a shell script and included in a cron job.2017-09-14 10:51:28
There was a talk at TransparencyCamp on cleaning open data that touched on merging datasets. You can see the notes for the talk here. The speaker recommends: deDupe, superfastmatch, saucebrush, name-cleaver. There's also a long list of tutorials for using OpenRefine.2017-09-14 11:08:10
csv2rdf4lod is a tool designed to integrate multiple versions of multiple datasets from multiple source organizations. It produces RDF and provides a hierarchical URI design that preserves where data elements came from. It uses a unix/shell environment (no GUI) and is focused on server-based automation and replication.
Even if you don't use it, flip through the hierarchical URI design convention. Each dataset "update" that you receive is modeled as a new version and it aligns with the previous versions in One Big RDF Graph depending on how you model the RDF creation.
Best of luck,
p.s. As suggested, I note that I'm the author of the tool.2017-09-14 11:27:07
Throwing in my weight into this topic. I work on an open-data project to solve producing dataset compilations from multiple dataset sources. We tackle issues like:
language (we handle 6 languages used by UN: en, fr, es, ru, ar, zh)
character encodings and error detection
formal, short, variant and abbreviated name matching
ISO, US (FIPS/USGS/NGA/CBSA) and Canadian (SGC) geographic identifiers
data representation, etc
Below is our on-going proposed spec (CUDE):
Below are the governmental datasets (500+) we have processed and released using the spec:
We haven't released the code yet, just the datasets.2017-09-14 11:44:15
Google Cloud DataPrep is a great data preparation tool now on beta, pros:
no size limit
OpenRefine offers nice tools for data cleansing, but it's ancient and limiting2017-09-14 11:47:34