I reviewed the dataset and got the following points
- The list of datasets didn't cover any web materials, even they had a wide varieties of born-digital materials, but they didn't cover the Web pages. Even some web archives participated in this competition, but they preferred to join with other materials. For example, Internet Archive , the largest and the oldest web archive, participated with Films and Books collections.
- Additional to the ignorance of web pages collections, the social media didn't appear. Library of Congress preferred to participate with "Chronicling American" newspaper collection instead of their valuable Twitter archive collection. I'm not sure if there is a copyright limitation for that.
- Most of the datasets have APIs access points that range between OAI-PMH, XML, and REST services.
In the attached table, I tried to summarize the type of the data in each collection. I may revisit the table again to add additional information about the nature of the data (e.g., Biomedical, history, technical).