Wikimedia Enterprise plans to provide preprocessed HTML dumps

For WMF employee / slave nonsense, developer hijinks, and MediaWiki and related software screw-ups.
Post Reply
User avatar
Bbb23sucks
Sucker
Posts: 1351
Joined: Fri Jan 06, 2023 9:08 am
Location: The Astral Plane
Has thanked: 1285 times
Been thanked: 274 times

Wikimedia Enterprise plans to provide preprocessed HTML dumps

Post by Bbb23sucks » Wed May 10, 2023 8:10 pm

"Globally banned" since September 5, 2023 for exposing harassment.

User avatar
ericbarbour
Sucks Admin
Posts: 4624
Joined: Sat Feb 25, 2017 1:56 am
Location: The ass-tral plane
Has thanked: 1158 times
Been thanked: 1848 times

Re: Wikimedia Enterprise plans to provide preprocessed HTML dumps

Post by ericbarbour » Thu May 11, 2023 6:17 am

From my experience working with the Wiktionary HTML dumps I can say that the data quality
is quite poor: there are stale and missing entries
(https://phabricator.wikimedia.org/T305407).

There are also entire namespaces excluded from the dumps, and more recently there have
been issues with the dumps not getting updated.

So it depends what kind of processing you need to do–in general I find the parsing to be
much easier, hopefully they'll manage to sort out the problems.
Ah ha ha ha ha ha

User avatar
Bbb23sucks
Sucker
Posts: 1351
Joined: Fri Jan 06, 2023 9:08 am
Location: The Astral Plane
Has thanked: 1285 times
Been thanked: 274 times

Re: Wikimedia Enterprise plans to provide preprocessed HTML dumps

Post by Bbb23sucks » Thu May 11, 2023 6:18 am

ericbarbour wrote:
Thu May 11, 2023 6:17 am
From my experience working with the Wiktionary HTML dumps I can say that the data quality
is quite poor: there are stale and missing entries
(https://phabricator.wikimedia.org/T305407).

There are also entire namespaces excluded from the dumps, and more recently there have
been issues with the dumps not getting updated.

So it depends what kind of processing you need to do–in general I find the parsing to be
much easier, hopefully they'll manage to sort out the problems.
Ah ha ha ha ha ha
As usual - STILL NOT FIXED after over a year!
"Globally banned" since September 5, 2023 for exposing harassment.

Post Reply