Wikidata talk:SPARQL query service/WDQS backend update/WDQS backend alternatives

Scalability[edit]

I'm not sure I understand the scalability criteria or questions here. Over 7 years ago Oracle ran a trillion-triple graph described here - but obviously it required a pretty high-powered system to do it at the time. Shouldn't it be possible to scale 'any' graph db up by using more and more powerful computing hardware (without clustering)? But I don't see any discussion of the underlying hardware specs/restrictions in this document. I realize WMF probably isn't going to buy million-dollar systems to run WDQS, so is there here some implicit restriction on underlying hardware involved?

I do realize we have to get off Blazegraph, but I'm just wondering why scalability/graph size seems to be such a limiting issue. ArthurPSmith (talk) 17:45, 29 March 2022 (UTC)[reply]

You are correct that more powerful systems address scalability issues. However, our goal is to continue running on the hardware that we currently have in our infrastructure, although that will be upgraded over time. In evaluating the candidates, the goal remains to support Wikidata query on hefty (but still commodity) hardware. AWesterinen (talk) 15:11, 30 March 2022 (UTC)[reply]

@AWesterinen: 128 GB RAM and 1600 GB disk seem kind of small for public production systems. I know AWS has m6i general purpose servers for example that can have up to 512 GB RAM. But cost could certainly be an issue. I guess my feeling just is that computing power generally is scaling faster than Wikidata is growing, so we ought to be able to keep up and in fact improve performance over time by upgrading the hardware. Optimal hardware configuration for any new graph DB will likely need to be somewhat different anyway - the RAM/CPU/IO/disk balance, networking, etc - just because the software presumably uses its resources differently than Blazegraph does. So it seems like different configurations should at least be explored in all this. For example, Apache Jena seems ideal on almost every ground except the scaling issue, so could scaling be addressed by a slightly different hardware configuration? Does 256 GB RAM allow 25 billion tripes, and is that affordable?

On the sharding/federation solution - that really seems like a stopgap - doesn't that mean you would have to double the hardware cost anyway, by having two servers for every 1 we have now? Some sort of cost/benefit needs to be done here I think! ArthurPSmith (talk) 17:45, 30 March 2022 (UTC)[reply]

@ArthurPSmith You are correct in stating that a larger study needs to be done that considers cost, user impact, overall capabilities and performance. That is why we first narrowed down the list of possible alternatives to 4 (which was the goal of the paper). The next step is to do detailed evaluations and why we have started work to create a Wikidata test definition. Scaling up vs scaling out vs maintaining the current system configurations are all options, but the entire execution environment needs to be considered and the complexities and costs justified. Please stay tuned, as this work is just beginning. AWesterinen (talk) 14:03, 31 March 2022 (UTC)[reply]

+1 ! ArthurPSmith (talk) 14:52, 31 March 2022 (UTC)[reply]

+1 So9q (talk) 04:26, 1 April 2022 (UTC)[reply]

@ArthurPSmith,

Scalability is an issue, and if it wasn't a challenge for Oracle (and others) you would seen a number of publicly accessing SPARQL endpoints based on Oracle (or other DBMS Engines).

There's reason why this isn't the case, is due to the following realities, presented by unique DBMS challenges introduced by the World Wide Web (Web). Here are the issues:

1. Unpredictable Query Complexity

2. Unpredictable Query Solution Size

3. Any combination of the above originating from an unpredictable number of user agents (acting on behalf of one or more actual users)

4. Any combination of the above, occurring 24/7, and 365 days a year

The reason why Virtuoso dominates the nodes that comprise the LOD Cloud Pictorial boils down to it being engineered specifically with the challenges above in mind, as far back as 1998. It took the emergence of the DBpedia project for us to be able to showcase this issue (and solution) publicly in a manner that's easy to evaluate objectively.

The solution is called "Anytime Query" and it isn't generally understood due to its sophistication and slight-of-hand re conventional DBMS operations (i.e., its still a fundamentally missing feature).

"Anytime Query" uses a configurable timeout to enable a DBMS instance administrator coerce user-agents into desired behavior (re DBpedia this is its "Fair Use" policy) e.g., setting a rate-limit for wholesale export attempts (which is the basic instinct of most external DBMS apps) that encourages the use of cursors when dealing with large solution sizes etc.. Naturally, you can also disable this functionality if you want a hard timeout to kick in etc..

To fully understand this feature in action, simply look at our live faceted-browsing instances which scale massively as we've demonstrated since 2007, for which I've added a live demo link below.

2. Text Search Query using the patter New York against our Live Wikidata Instance Kidehen (talk) 17:09, 1 April 2022 (UTC)[reply]

@Kidehen: That's nice that Virtuoso has a different approach to timeouts - but obviously it means our users would need to adapt any code querying the service to check for missing data, so that's an added complexity too. I also don't understand the thread you pointed to on "quad" vs "triple" queries - doesn't SPARQL limit itself to the default graph if you don't include the GRAPH keyword? WDQS certainly doesn't require specifying a GRAPH for (most? all?) queries. Anyway, the scalability issue that this white paper seems to have looked at is not the query burden on the system, but just being able to load the graph. It points to a successful loading of the full Wikidata in Jena on a 128 GB RAM system with somewhere around 2 TB of disk, which took about 100 hours, while the person doing it suggested it could be faster if better optimized. And yet somehow the paper concludes that Jena can't scale to 25 billion triples - I actually don't understand that either. Presumably some more testing will help sort this all out. ArthurPSmith (talk) 17:45, 1 April 2022 (UTC)[reply]

Note currently Blazegraph does return partial results, though in an incorrect way - see phab:T169666. GZWDer (talk) 17:57, 1 April 2022 (UTC)[reply]

@Kidehen For Wikidata, "anytime query" will not be enabled since it is non-deterministic. Also, the environment will require a mix of read/write queries, which is different than other LOD sites with large datasets (such as UniProt). AWesterinen (talk) 21:03, 3 April 2022 (UTC)[reply]

I read the report and the thread on Jena and I was also confused by the assertion that it could not scale to 25B+ - I didn't see where that conclusion came from in the references. Even with loading, there appears to be some low-hanging fruit in gzip handling, showing an I/O bottleneck which may not have been previously detected. It'd also be interesting to see tests on server hardware, with a realistic memory size for a project of this scale rather than a developer laptop. While appreciating the desire to be thrifty, I echo the thoughts above that maybe it is better to scale up a bit with hardware if that would help with the issue in the medium term (which could hopefully be tested on e.g. leased bare metal dedicated servers). Wikidata is, if not unique, certainly out of the ordinary and it's not unreasonable to spend a little more on hardware than the average database server. GreenReaper (talk) 22:41, 20 April 2022 (UTC)[reply]

Thank you[edit]

Thank you for keeping us posted. I like the way you are keeping the process transparent. Vojtěch Dostál (talk) 19:18, 29 March 2022 (UTC)[reply]

+1 So9q (talk) 04:23, 1 April 2022 (UTC)[reply]

Thanks for the paper[edit]

Thanks for writing the very readable paper, seems the team did an excellent job on researching all possible solutions within the given set of restrictions. Also seems that only QLever and Virtuoso are real candidates as Blazegraph alternatives, given that they can scale to 25B+ triples without needing to resort to splitting up the graph. Husky (talk) 00:28, 30 March 2022 (UTC)[reply]

That is correct, but a simple split (for example, of the scholarly articles) would not be unreasonable. The net impact would be to require federation of the scholarly article data, if needed in a larger query. But, if running on RDF4J (for example), that could be made transparent to a user. AWesterinen (talk) 15:14, 30 March 2022 (UTC)[reply]

Comment[edit]

It is a very bad idea to split Wikidata to multiple graphs - once a part become larger and larger, it will eventually not scale. Currently Wikidata have 37.6 million items for scientific articles but Microsoft Academic (now a stable dataset) have 209 million; if one endpoint may have 50 million we eventually need to split the scientific article graph to at least five (and how?) once all of them are imported. This is a problem wherever scientific article are stored (i.e. the issue will remain even if we move them out of Wikidata to another Wikibase instance).
Virtuoso's SPARQL implementation is very buggy (as I stated in phab:T303263): Even a query like ?x rdfs:subClassOf* ex:abc may result in error. I does not recommend Virtuoso. You can see also this article: "Fuseki was the only store that executed every benchmark query without errors and returned complete and correct result sets".
Though Jena has the most correct SPARQL 1.1 implementation, Jena's default configuration have a very bad performance for some queries. As I commented in phab:T206560#7775800, some queries takes only 10ms in Virtuoso but more than 10 hours in Jena.

--GZWDer (talk) 19:50, 31 March 2022 (UTC)[reply]

@GZWDer Your points are very valid, but we are operating within the constraints of requirements and budget. The purpose of the document was to find any possible alternative that could address the query issues for the next 5-10 years. Every software release and every design has advantages and problems. Our next step (test definition and execution as noted in phab:T303263) will provide insights to further winnow the list of candidates, or highlight the need to change the constraints. This is why the phab ticket was tagged as high priority and moved into the "scaling" state in mid-March. My current work on defining tests and query/update loads can be tracked at WDQS Testing. AWesterinen (talk) 16:15, 1 April 2022 (UTC)[reply]

@GZWDer,

You claim:

"Virtuoso's SPARQL implementation is very buggy (as I stated in phab:T303263): Even a query like ?x rdfs:subClassOf* ex:abc may result in error. I does not recommend Virtuoso. You can see also this article: "Fuseki was the only store that executed every benchmark query without errors and returned complete and correct result sets".

What does that mean, specifically?

Remember, you can easily demonstrate what these issues are by using any of the publicly accessible Virtuoso instances, starting with our hosted edition of Wikidata at: https://wikidata.demo.openlinksw.com/sparql . Ditto other instances, such as https://dbpedia.org/sparql etc.. Simply enter you problematic queries and then share the sparql results page URL.

That's an easy way to demonstrate your position. Kidehen (talk) 16:47, 1 April 2022 (UTC)[reply]

@Kidehen: For example, see the article "BeSEPPI: Semantic-Based Benchmarking of Property Path Implementations". I also reproduced the issue using a modified version of LUBM dataset.--GZWDer (talk) 17:10, 1 April 2022 (UTC)[reply]

"Though Jena has the most correct SPARQL 1.1 implementation, Jena's default configuration have a very bad performance for some queries. As I commented in phab:T206560#7775800, some queries takes only 10ms in Virtuoso but more than 10 hours in Jena."

Let's say Jena has a perfect SPARQL 1.1 implementation, but it cannot scale.

Let's say Virtuoso has an imperfect SPARQL 1.1, but it scales massively -- as demonstrated by its preeminence in the massive LOD Cloud Knowledge Graph.

Wouldn't it be logical to look to the solution that already has no issue with scalability, is open source, and the effect of Wikidata usage simply leads to more contributions from across the community.

Virtuoso is primarily maintained by us, despite it massive use across the LOD Cloud and enterprises.

2. Configuration File Analysis Spreadsheet for a sampling of live Virtuoso Instances Kidehen (talk) 17:15, 1 April 2022 (UTC)[reply]

I don't think Wikimedia's teams have enough experience to maintain a not will documented C/C++ codebase. GZWDer (talk) 17:31, 1 April 2022 (UTC)[reply]

Long term scalability[edit]

The year 2022 will see 100 million items in wikidata see https://www.wikidata.org/wiki/Wikidata:Statistics.

The growth rate of the past years (see https://wikitech.wikimedia.org/wiki/WMDE/Wikidata/Growth) has flattened somewhat from the initial rate that looked more like an exponential growth into a more "linear" pattern. The difference between exponential and linear is IMHO quit important here since the cost/effort for maintaining the platform is somewhat proportional to the amount of data to be managed. The resources of the WMF are IMHO too limited to cover an exponential growth over a course of 10-30 years which might e.g. end up in 1000 x more data then is managed these days. Even if it currently looks there is no need in the upcoming 5-10 years for a scalability of a factor 1000 i recommend to think about how such a > 1000x bigger scenario could be covered in the long term. I assume the problems will be more on the financing side of things than on the technical side. The relevance discussions of the future might end up having such a financial component - is it affordable to keep the data and history of an item in the long term e.g. for 30/50 or 100 years+?

The scholarly data subgraph is certainly a candidate for this kind of discussion. Grants such as the one to OurResearch see https://blog.ourresearch.org/arcadia-2021-grant/ are currently keeping OpenAlex running and such finance needs might show up for wikidata in the longterm and the needs might in the very long run be greater then the current financing for all wikipedias together.

So scalability IMHO is not just a technical issue. --Seppl2013 (talk) 11:01, 5 May 2022 (UTC)[reply]

Candidate Design - HDT persistence decomposition with distributed/daemon-less query[edit]

The following candidate approach can be contrasted with traditional triple store databases and may provide points for considering triple storage and query architectures.

Background[edit]

Persistence[edit]

Storing all the triples in W3C standard text-based encodings is inefficient for queries. Keeping all the triples in a binary proprietary triple-store format limits the use of the triples to that specific triple store. This creates a gap between standardized persistence and query. The gap is typically closed via a bulk load operation. Queries require the setup and execution of a persistent triple store database daemon process. An alternative is to store triples in HDT format. This format includes a complete index to support direct queries. The format was submitted to W3C in 2011 but did not move forward in the standards process. Details on HDT are below in the References section.

Persistence Limitations[edit]

HDT files are compute-intensive to create. HDT files are read-only. They can be appropriate for archival data that does not change frequently but are not helpful for transactional modification workloads.

Distribution[edit]

HDT files can be stored on traditional filesystems. They are self-contained and can be downloaded in bulk and then queried directly.

Candidate Design[edit]

Create HDT files for Wikidata and manage them separately from the query service. The HDT files could be made available for bulk download. This would allow users to work with large collections of Wikidata for analytics cases without bulk loading. Time would be used as one dimension for organizing triples such that new HDT files would contain updates and new triples while old files contain the historical record. Scaling would be a function of new and modified record activity to grow the collection of files. At some frequency, the archive could be recomputed into snapshots to be combined with new/modified records for a complete history. A hybrid architecture could combine a triple store database persistent daemon for transactional triple creation and modification. The live transactional system would be combined with the archival read-only indexed HDT files to respond to queries. Queries would federate over the archival files and the transactional daemon. At some frequency, the transactional system would export to HDT and flush its state to remain small.

Advantages[edit]

Users working with large copies of Wikidata would use the duplicate data files Wikidata uses in operations. This would allow analysis techniques and optimizations invented by users to be transitioned over to Wikidata. The load on a persistent transactional daemon would be limited to a small subset of most recent records. A query service could be scaled to federate the archive with the daemon on a per-request basis. Each query would run the SPARQL engine separate from the transactional daemon.

Disadvantages[edit]

The efficient query of HDT is currently limited to simple graph patterns. Connecting an optimized SPARQL front-end remains a work-in-progress.

Conclusions[edit]

Tools for efficient SPARQL query of HDT are still limited, preventing adopting this design today. However, aspects of decomposing triples into read-only files for efficient query separate from a traditional persistent triple store database service could be helpful to consider. Apache Jena might be one option for the implementation of HDT, and transactional triples per “HDT Integration with Jena” The Python rdflib-hdt plugin could be another. I have a custom triple store under development that takes this approach but is not yet ready for production use.

References[edit]

Diefenbach, Dennis, and José M. Giménez-García. “HDTCat: Let’s Make HDT Generation Scale.” In Lecture Notes in Computer Science, Vol. 12507. Springer, 2020. https://doi.org/10.1007/978-3-030-62466-8_2.

Fernandez, Javier D., Miguel A. Martínez-Prieto, Claudio Gutiérrez, Axel Polleres, and Mario Arias. “Binary RDF Representation for Publication and Exchange (HDT).” Journal of Web Semantics, June 24, 2018. https://doi.org/10.2139/ssrn.3198999.

Giménez-García, José M., Javier D. Fernandez, and Miguel A. Martínez-Prieto. “HDT-MR: A Scalable Solution for RDF Compression with HDT and MapReduce.” In Lecture Notes in Computer Science, 9088:253–68. Springer, 2015. https://doi.org/10.1007/978-3-319-18818-8_16.

Curé, Oliver, Guillaume Blin, Dominique Revuz, and David Célestin Faye. “WaterFowl: A Compact, Self-Indexed and Inference-Enabled Immutable RDF Store.” In “Lecture Notes in Computer Science,” 8465:302–16. Springer, 2014. https://doi.org/10.1007/978-3-319-07443-6_21. Encyclopediaofdata (talk) 18:16, 20 July 2022 (UTC)[reply]

Priority of Evaluation Criteria? Net Score of Evaluated Engines?[edit]

I'm struggling to understand the overall results of the comparative evaluation here, especially as I know there has been some significant conversation (post-scoring) regarding some of the areas where Virtuoso was graded poorly due to evaluator misunderstanding of Virtuoso administration, features, or other details.

Of particular importance, I'd like to understand whether "Scalability to 10B+ triples" is considered more/less/equally important as "Scalability to 25B+ triples", "Query plan tuning/hints within SPARQL statement", "Data store reload in 2-3 days (worst case)", etc.... This is necessary to apply weights to the scores for individual criteria, and combine all such to find a total score for each engine under evaluation.

I'd also like to know how to get ratings that I believe are inaccurate or unreasonable updated to a more accurate rating — such as that last, above, where the total time we took to clean and load the (admittedly somewhat stale) data dumps from circa March 2020 into Virtuoso Open Source was 22 hours, but your load time must have been much greater, given our assigned rating of "1" for this criterion... So what did you do that we didn't, or vice versa? TallTed (talk) 21:01, 15 August 2022 (UTC)[reply]

Community[edit]

I think it's important to point out that Apache is a Delaware nonprofit, Eclipse is a Delaware and Belgium nonprofit (bizarre, but cool), and Qlever is a German university project (also cool). OpenLink is a Delaware and British company established for the sole purpose of making its owners money. I think that matters, forward-thinking. Int21h (talk) 06:23, 30 August 2022 (UTC) Int21h (talk) 06:31, 30 August 2022 (UTC)[reply]

The license and community of users seems more important to me than the corporate status of the maintainer. Ideally we would use an open platform that we can maintain ourselves, with a community of potential devs / maintainers, among which is at least a few trusted people or orgs that have worked on projects of similar scale (to call on in a pinch). Sj (talk) 19:24, 24 February 2023 (UTC)[reply]

Revisiting this after getting to spend more time with the [academic] dbpedia community: OpenLink is a productive and helpful part of that community and has spent as much time as Qlever and significantly more than the others customizing and testing their tools for Wikidata use cases specifically. So on balance I would put those two in a similar community tier, with pros and cons, and ahead of the others. Sj (talk) 11:40, 17 August 2023 (UTC)[reply]

TerminusDB[edit]

https://terminusdb.com/products/terminusdb/ has not been evaluated as an option? Mitar (talk) 19:42, 19 November 2022 (UTC)[reply]

Scale[edit]

I agree w/ Husky that QLever and Virtuoso seem like the only real options, unless we want to migrate again in the near future. So a related Q which is not addressed here is: how much would it take to improve those in areas where they are lacking?

QLever: add federated queries, write-prevention, named graphs,
Virtuoso: add quick datastore reload (though maybe already fixed, per TallTed above + recent blog posts), improve docs and query plan explanation,
Both: Improve SPARQL 1.1 compatibility (which is certainly on their desired roadmaps), and dataset evaluation

Have these evals been revisited since the first pass? It might be worth pinging all 4 groups to see if they self-report changes in any of these areas [and to ping WD devs to see if anything obvious should be added to the list] Sj (talk) 19:25, 24 February 2023 (UTC)[reply]

QLever has made much progress towards full SPARQL 1.1 compatibility since the evaluation from March 2022. To mention just a few milestones:

1. SPARQL queries are now parsed and processed according to the official grammar, which got rid of QLever's (many) previous idiosnycrasies regarding the query format. This was a big step forward.

2. Support for the official SPARQL API: GET and POST queries, all major content types (query format), all major accept headers (result format), etc.

3. Support for CONSTRUCT queries and chunked transfer of large results.

4. Support for arbitrary expressions in BIND, FILTER, ORDER BY, etc. Doing this efficiently, especially when numbers are involved, was a lot of work and a big step forward.

5. Support for built-in functions like STR, STRLEN, DAY, MONTH, YEAR, RAND. Several functions are still missing, but it's now not much work to add them.

6. Support for SERVICE queries. For example, https://qlever.cs.uni-freiburg.de/wikidata/r7KAiM . This was a big step forward.

7. The one big thing still missing are UPDATE queries, but we have a clear plan and hopefully a first version running soon. To test it, it would be great to have an API to Wikidata's latest changes (the full dataset is updated only once per week).

A more detailed account can be found on [1]. If you have any question, feel free to ask anytime, here or on https://github.com/ad-freiburg/qlever . Hannah Bast (talk) 19:53, 8 March 2023 (UTC)[reply]

QLever is an interesting project with novel ideas. I think I read somewhere, probably in one of Johannes' papers, that due to the way QLever's recent optimizations work to be able to scale to really big datasets, it is currently a Write-Once-Read-Many (WORM) implementation. Is there somewhere I can read about the plan to handle updates? Infrastruktur (talk) 20:47, 8 March 2023 (UTC)[reply]

Yes, it's described here [2] and here [3]. We are working on a first version of this. Since we already have an implementation for SERVICE (which poses similar challenges as UPDATE), much of the foundation work has already been laid.

The idea for UPDATE in a nutshell: For all big knowledge graphs we know of, the number of changes over time is relatively small. For example, for Wikidata it's under a million per day [4]. Then one can maintain two indexes: one for the whole dataset (which can be rebuilt once per day) and one for the changes (which can be rebuilt, say, once per second). To get up-to-date results, QLever can then combine the results from the two indexes (that is not trivial, but doable).

The setup with two indexes has an additional advantage. Many (my guess is: most) users don't mind querying a version of the dataset that is a few hours old. With the above setup, one could easily do that by simply querying the big index only and getting results at maximum speed. If you ask both indexes, there will be a performance overhead. Users can choose whether they are willing to pay that overhead on a query-by-query basis. Hannah Bast (talk) 09:44, 9 March 2023 (UTC)[reply]

Updates on migration?[edit]

Any updates since the March meetings? I see hints on phab of progress being made on discussions about how to split blazegraph (within blazegraph), but nothing about migration. I created a ticket for migrating off of Blazegraph months ago that has little engagement so far (thanks @MPham (WMF): for tagging it). Sj (talk) 11:40, 17 August 2023 (UTC)[reply]