Wikidata:Edit groups/QSv2/22062

From Wikidata
Jump to navigation Jump to search

Edit group QSv2/22062

Summary {{{summary}}} Author Bargioni
Number of edits 24,784 (more statistics) Example edit Q6978420

Discussion[edit]

@Bargioni: This batch seems to have been prepared too coarsely. I saw you added VIAF IDs for geographical names (towns etc.) to *any* item that matched it by name, or name and country. This made your batch add a VIAF ID to hospitals in Morocco that are named the same as the town they're in, for example. I can only assume there are hundreds of other such mistakes in the batch. I would suggest considering undoing the whole batch and preparing a more careful one. Asaf Bartov (talk) 13:06, 22 November 2019 (UTC)[reply]

  •  Oppose reverting the whole batch. This is just a synchronisation with VIAF: when a VIAF cluster links to Wikidata, a link from Wikidata to the VIAF cluster is added. Ignoring the VIAF clusters doesn't solve the problem. The best solution in my opinion is the subsequent: after the synchronisation, we should find the wrong VIAFs through apposite queries and not only remove them, but also report their wrongness to VIAF (I created an apposite page, where about 150 cases have already been listed: Wikidata:WikiProject Authority control/VIAF errors). Otherwise, if the errors are kept in VIAF, there is the risk of reinserting them through future imports. Moreover, only VIAFs of some areas are inaccurate (e.g. geographic ones are particularly problematic), while other areas are substantially accurate, so the best way is finding the wrong ones through queries. --Epìdosis 13:41, 22 November 2019 (UTC)[reply]
  •  Support reverting the whole batch. We should fix the erros in VIAF before importing them, lowering the quality of Wikidata information. Pietro (talk) 16:01, 22 November 2019 (UTC)[reply]
@Pietro: Why the whole batch? E.g. the example edit (Orville Coast (Q6978420)) is correct. If the problem are geographical items, it is possible to remove only those VIAFs. If we wait for VIAF getting perfect, we will lose a lot of good VIAF clusters. --Epìdosis 16:24, 22 November 2019 (UTC)[reply]
@Epìdosis: data cleansing is a mandatory phase for any import, and in this case it is missing. VIAF is a standard reference, so if an ID is not used in Wikidata, it is probably controversial or already verified as wrong. If we are not able to go through a cleaning phase before a massive import, a Mix'n'match approach should be used to allow to the users to check the quality of the data at the import time. Pietro (talk) 12:54, 23 November 2019 (UTC)[reply]
They are not so controversial, when those edits appeared in my watchlist they were all reasonable in my fields, so this looks related to specific clusters. We don't have FIAF ID because we have a huge backlog. I know, it takes me 30/60 seconds to find them and add them manually when I end up on an item, so much time that I stopto do it hoping some automatic import could be set up (and this activity's goal seems to set it up in areliable way in the future). So we have thousands of standard ID missing, if there is an area where we need a careful import is this one (not saying this one is the most careful one, but I have seen worse or as bad during routine clean up). I am happy to clean these problems in the time I spare with such import, and so far it was a lot of time. A linear monthly import to check would be good. To me, I did not look too much into that but it looks more a problem of a selective phase than a cleaning phase, there are reliable areas that should have been imported (for sure) while other ones were not ok.--Alexmar983 (talk) 15:13, 23 November 2019 (UTC)[reply]
  •  Oppose reverting the whole batch. @Pietro, Epìdosis, Ijon: Please, Pietro, let us know if you have a good contact in OCLC to ask them to correct VIAF clusters, or even their imports of items from Wikidata. This is something I (and others) tried to obtain before this update. Anyway, in my opinion the power of the Wikidata community is higher than any other to solve clusterization errors (that by the way where present in Wikidata also before this massive import). This means, IMO again, that having clusterization errors in Wikidata is better than have them only in VIAF, top source for libraries and museums. Please, also consider how many correct VIAF ID (P214)s were added during this week. --Bargioni (talk) 16:41, 22 November 2019 (UTC)[reply]
@Bargioni: quality is the value of a dataset, not its size. Lowering the quality of Wikidata dataset introducing links to wrong information means weaking its reliability and this should be avoided at the same level we care for information literacy in Wikipedia. Pietro (talk) 13:04, 23 November 2019 (UTC)[reply]
I teach sometimes information literacy with clean up. We had so many poor imports over the years that I do it on some classes. When i teach cultural heritage IDs one of the excercise is to fix specifically a previous bad import, I am not joking. of all possible ID-related activities this is actually on such a widely used identifiers that it's actually easier.--Alexmar983 (talk) 15:20, 23 November 2019 (UTC)[reply]
  •  Oppose reverting the whole batch. Yes there are problems -- for example, some typical mix-ups between items for modern county administrative areas in the UK county council area (Q21272231) versus historic county of England (Q1138494). (Which re-occur like weeds; and may now have already been fixed). But better, I think, for us to be able to see these and identify them, rather than for the same issue to be hidden away on VIAF's servers.
The question is how to best mark errors, so that they can (i) be picked up, and (ii) not be repeated. Best practice might be to leave them in place, but deprecated, with reason for deprecated rank (P2241) = incorrect identifier value (Q54975531), until this has been picked up by VIAF, and they have changed their link. I think that is the optimum approach (though most of our atandard tools aren't able to edit the rank of a statement to deprecated). In practice though, people will just remove bad links -- so it may be worth regularly tracking what VIAFs are on items, in order to identify which have been removed; if these persist on VIAF (or reappear there), they should perhaps be notified to VIAF for checking, and perhaps not be re-added here (even deprecated). VIAF entries include data for cross-matches that have been considered, but rejected: so where a VIAF has actively been removed here, VIAF should be able to record that fact on their system (and probably would want to). JhealdBatch (talk) 17:40, 22 November 2019 (UTC)[reply]
BTW, I am impressed that some items I created only six weeks ago have already been matched on/by VIAF, and now those matches are being imported here. That's quite impressive! JhealdBatch (talk) 17:43, 22 November 2019 (UTC)[reply]

The problem is not just bad VIAF matches, but that good-faith editors or bots then add further data based on those matches. This causes serous BLP problems, such as dates of death being added for people who are still living. Example: [1]. A programmatic follow-up edit to revert the VIAF addition will leave that other data in place. Andy Mabbett (Pigsonthewing); Talk to Andy; Andy's edits 10:37, 6 December 2019 (UTC)[reply]

clearly we should find a way to ensure that wrong VIAF are not re-introduced by batch import again. --Hannes Röst (talk) 14:18, 29 June 2020 (UTC)[reply]