Wikidata talk:SPARQL query service/WDQS graph split/WDQS Split Refinement

From Wikidata
Jump to navigation Jump to search

property triples maybe unnecessary?[edit]

After more thought (and looking at your example) I don't see a whole lot of use for the property triples on the scholarly subgraph, I think we can leave them out. ArthurPSmith (talk) 14:41, 17 April 2024 (UTC)[reply]

Inconsistency with relation to theses. But also - what does this mean to me as a user?[edit]

I run the NZThesisProject. I naturally looked over the spreadsheet with that in mind and was surprised to find that you have included some types of thesis and not others. I presume this is an oversight as it makes no logical sense for a Doctor of Medicine thesis to be in one graph and a Doctor of Philosophy thesis in the other, for example.

Could I also make a plea for a plain-language (ie less technical) explanation of what this whole split means to me as a Wikidata user? (tip: I have no idea what a truthy graph is, and I'm going to need some excellent tutorials if I'm going to understand how those example federated queries work). I can understand the description of what you propose to do, more or less, but it is extremely unclear to me what it means for me as a user in practical terms, other than that every Sparql query I have (and most of the tools I use) is going to need to be rewritten. How will I link authors, theses and their publications when they are in two different graphs? Is there anything I can do now that I will no longer be able to do under federation? Will tools like OpenRefine reconcile authors and publications the same way as now or will that change too? DrThneed (talk) 20:51, 17 April 2024 (UTC)[reply]

Yes that was definitely unintentional, I think I can edit the list, I'll take a look.
Only SPARQL queries that need data from both graphs would need rewriting. If you have some sample SPARQL queries that you are using now it might be helpful to share them here so that can be assessed. The hope is that most users will only need data from the "main" subgraph, and so no SPARQL rewriting needed. The ones that need rewriting are those that fetch data both about articles and about things that are not articles. Does OpenRefine use SPARQL? Any application that is just talking directly to the regular (non-SPARQL) Wikidata APIs will not require any change. ArthurPSmith (talk) 12:56, 18 April 2024 (UTC)[reply]
Thank you that's really helpful. Re the queries, I have quite a few including Scholias and Histropedia timelines (https://www.wikidata.org/wiki/Wikidata:WikiProject_NZThesisProject/Dashboards_and_queries) ...although I think a lot of those might be alright if I've understood correctly. So a query like this:
Theses in the project that have a main subject that is an instance of a person https://w.wiki/6HyD
would not need rewriting, but if I wanted more information on the people items that are the subjects, e.g. their English description or sitelink, then I would need to adjust it?
And a query like this:
Thesis where the author is linked but not in the thesis project https://w.wiki/9pt5 would need rewriting? DrThneed (talk) 08:27, 21 April 2024 (UTC)[reply]
Thanks for the examples - I believe unfortunately both would need rewriting since they mix in the same query items from the scholarly graph and from the main graph. @DCausse (WMF) do we have theses included in the scholarly graph yet? Or perhaps just an illustration of how one would do this with the current split? ArthurPSmith (talk) 19:03, 22 April 2024 (UTC)[reply]
@DrThneed Thank your for your feedback. Proper documentation is indeed going to be key and thank you for calling out that the current language is lacking clear and less technical explanations, we will try to improve this.
@ArthurPSmith Regarding the queries I have added them to Wikidata:SPARQL_query_service/WDQS_graph_split/Federated_Queries_Examples, the current experimental endpoint does not yet separate these publications out of the main graph so they can't be really tested but they seem simple enough that I don't anticipate any problems with them.
@DrThneed in the coming weeks we will put in place a page Wikidata:Request a query rewrite (in the same vein as Wikidata:Request_a_query but for requesting a rewrite), in my experience, and after rewriting a dozen of queries the patterns and techniques used for rewriting are generally the same.
On OpenRefine, I'm not very knowledgeable but this project does seem very generic and highly configurable and I would be surprised if it had any predefined SPARQL queries in its source code. I think that if it has SPARQL capabilities I am pretty sure that the SPARQL query would have to be given by the user when setting up a project. DCausse (WMF) (talk) 08:49, 23 April 2024 (UTC)[reply]

Choice of subclasses to include[edit]

My two cents, not beeing very knowledgeable on the technical matters behind the split: I am not quite convinced by the way scholarly articles and thesis will be separated from the rest of the written production. From an academic perspective, notably the humanities, a book is still a very valid way to publish new knowledge. IMO a more robust and consistent solution, if harder to implement given the mess that are the subclasses of publication (Q732577) and article (Q191067), would be to put every manifestation (in the FRBR sense of it) of a written document in this separated graph. That would includes books, encyclopedia entries, news articles, basically everything that you can read from a printed source (including "digital" print, but excluding everything handwritten such as manuscripts, inscriptions, letters, etc.). I think this would allow for a more straightforward querying: should one need something published, they should include the subgraph, no matter if it's for the list of Shakespear's editions (editions, not works mind you) or the latest papers on high energy physics. This would also allow a more consistent approach to retrieve references, because one can't know beforehand if the reference to a statement might be a book or an article. --Jahl de Vautban (talk) 08:04, 20 April 2024 (UTC)[reply]

Thank you for the feedback. I understand where you are coming from. This proposal would move quite a lot more Items that are widely used to the second graph. We are trying to avoid impacting too many reusers of the main graph so I fear that is not a good way forward overall. Lydia Pintscher (WMDE) (talk) 15:05, 24 April 2024 (UTC)[reply]
Given the discussion above, it looks like it impacts almost every tool I use and every query I have. So I'm not a fan! I would very much favour keeping theses and dissertations (which are regarded as unpublished items anyway, from an academic perspective) out of the split. DrThneed (talk) 23:38, 3 May 2024 (UTC)[reply]
@DrThneed I would like to re-assure you that there will be a transition period (at least 6months) during which the impacted use-cases will have the time to adapt, we will provide as much support as we can to help the transition (mainly through the Wikidata:Request_a_query_rewrite page). DCausse (WMF) (talk) 08:16, 17 May 2024 (UTC)[reply]

Keep entities with sitelinks[edit]

In accordance with Wikidata:Notability point 1 ("It contains at least one valid sitelink"), I think a better split expression might be something like ?entity wdt:P31/wdt:P279* wd:Q13442814 . FILTER NOT EXISTS { ?article schema:about ?entity}. That would also start to cover the point 3 ("It fulfils a structural need"), even if that point might need more fine-tuning. Maxlath (talk) 14:50, 16 May 2024 (UTC)[reply]

@Maxlath Thanks for raising this. We briefly discussed using sitelinks to inform the nature of the split but for some reason it did not make to the list of suggested improvements and I'm glad that you raise it so that we can have a conversion about it. From a technical standpoint I don't have objections to this idea:
  • A rule like [] schema:about ?entity can be implemented
  • It would not change much the size of the splits with roughly 43,000 papers (only 0.09% of the 44,000,000 papers) moving from the scholarly subgraph to the main subgraph.
So I believe the discussion to have is regarding the overall usability. I'm probably not knowledgeable enough to make an informed judgement on this but here are the points I can think of:
  • Use-cases relying on scientific publications will always have to UNION both subgraphs
  • Use-cases relying on sitelinks (regardless of their nature) could continue to query solely the main subgraph
DCausse (WMF) (talk) 08:04, 17 May 2024 (UTC)[reply]
Yes this is a good point. Probably scholarly works will always need to do the UNION thing anyway just in case something has an unexpected instance of (P31) value. So I don't think this is so concerning, and if it's helpful for other use cases to keep some of them in the main graph that seems ok to me. ArthurPSmith (talk) 21:12, 17 May 2024 (UTC)[reply]

Clinical trials[edit]

Around 390K items on Wikidata are instances of clinical trial (Q30612). Some of these are also instances of scholarly article (Q13442814) and some are not. Now, a clinical trial is not in itself a publication, but a report on it probably is: there is an ambiguity here. In some similar cases, there are two items, one of which is a publication type.

Apart from the ambiguity, splitting the clinical trial items into two parts could be awkward. I find just 284 are also scholarly article items, so the simple approach would be a fix for those. The property ClinicalTrials.gov ID (P3098) doesn't currently apply to any scholarly article items; and there are currently just two hits, perhaps anomalous, for ClinicalTrials.gov ID (P3098) and DOI (P356).

That would come down to saying that data about the clinical trial itself is not bibliographical data. That makes ontological sense. A typical and important potential application, however, of WikiCite data, would be automated compilation of a corpus that could serve for the basis of a systematic review. Science librarians get involved in that process, which tends to involve trawling in multiple databases by topic and nature of the trials. Before just saying clinical trial items stay in Wikidata, related publications are split out, it would be good to consider the requirements of federated queries in this area. Charles Matthews (talk) 08:08, 19 May 2024 (UTC)[reply]