Document Structure Analysis Algorithms: A Literature Survey [pdf]

roflc0ptic · on Oct 10, 2016

Funny. I ran across this survey I had to write a document structure analysis step for a document analysis pipeline recently. I thought of this paper in particular when watching https://vimeo.com/9270320 ("What We Actually Know About Software Development"), which I ran across on the recent thread on tech talks.

Not a computer scientist (barely a scientist at all!), so I don't know if I'm being unreasonable or not, but I was a little dismayed at the literature here. Is this quality of publication common? Is it getting better? The lack of experimental methods and reproducibility seems abominable.

matt4077 · on Oct 10, 2016

This specifically is a survey where it's uncommon to carry out your own "experiments". You use the numbers from the original papers and hope there's some overlap in the metrics they use.

Reproducibility is a problem, especially in the sense that source code was traditionally not published along with the paper. There are a few reasons for that, i.e.:

- The publish in "publish or perish" isn't talking about github

- The grad student wrote the code and it's 3000 lines of FORTRAN

- The professor wrote the code and it's 6000 lines of ALGOL60 - The university owns the copyright but there's no process for OSS releases - The authors own the copyright and can't wait to turn this into a commercial spinoff.

- It's 30 lines of python, inextricably linked to the internal domain-specific toolset that's 25GB including multiple copies of test data and a lot of external source code with unclear licenses.

- The proof-of-concept is only about 20% of the work of a polished library that's fit to release

It's getting better now with some fields establishing standards for data analysis pipelines, and some large companies with product experience doing a lot of academic work (i. e. facebook/tensorflow). The AI/ML community is probably a shining example considering there are polished, ready-to-use implementations available for almost all publications.

roflc0ptic · on Oct 10, 2016

>> This specifically is a survey where it's uncommon to carry out your own "experiments". You use the numbers from the original papers and hope there's some overlap in the metrics they use.

My comment was unclear. I wasn't criticizing the lit review, I was criticizing the literature it is reviewing. The lit surveyed does a poor job even defining metrics.

The rest is very interesting, and heartening. Thanks for the thoughtful response.

jahewson · on Oct 10, 2016

I've found that it's common for computer science papers to reveal only about 60% of what is necessary to reproduce the result. Crucial steps of the algorithm are omitted. Numerous constants are used with no value ever given. Preprocessing steps are alluded to but never discussed. Datasets are kept private, or worse, charged for.

Anecdotally, I'd say that this seems to have been a worse problem in the 1980s and 90s, judging papers from that era. But it's still an issue today, especially with systems papers.