Wikimedia Enterprise plans to provide preprocessed HTML dumps
-
- Sucker
- Posts: 1402
- Joined: Fri Jan 06, 2023 9:08 am
- Location: The Astral Plane
- Has thanked: 1467 times
- Been thanked: 294 times
Wikimedia Enterprise plans to provide preprocessed HTML dumps
"Globally banned" since September 5, 2023 for exposing harassment.
-
- Sucks Admin
- Posts: 4932
- Joined: Sat Feb 25, 2017 1:56 am
- Location: The ass-tral plane
- Has thanked: 1283 times
- Been thanked: 2025 times
Re: Wikimedia Enterprise plans to provide preprocessed HTML dumps
Ah ha ha ha ha haFrom my experience working with the Wiktionary HTML dumps I can say that the data quality
is quite poor: there are stale and missing entries
(https://phabricator.wikimedia.org/T305407).
There are also entire namespaces excluded from the dumps, and more recently there have
been issues with the dumps not getting updated.
So it depends what kind of processing you need to do–in general I find the parsing to be
much easier, hopefully they'll manage to sort out the problems.
-
- Sucker
- Posts: 1402
- Joined: Fri Jan 06, 2023 9:08 am
- Location: The Astral Plane
- Has thanked: 1467 times
- Been thanked: 294 times
Re: Wikimedia Enterprise plans to provide preprocessed HTML dumps
As usual - STILL NOT FIXED after over a year!ericbarbour wrote: ↑Thu May 11, 2023 6:17 amAh ha ha ha ha haFrom my experience working with the Wiktionary HTML dumps I can say that the data quality
is quite poor: there are stale and missing entries
(https://phabricator.wikimedia.org/T305407).
There are also entire namespaces excluded from the dumps, and more recently there have
been issues with the dumps not getting updated.
So it depends what kind of processing you need to do–in general I find the parsing to be
much easier, hopefully they'll manage to sort out the problems.
"Globally banned" since September 5, 2023 for exposing harassment.