| ~Bunbunmaru News~ > Front Page Headlines |
| The End of Shrinemaiden As We Know It |
| << < (13/16) > >> |
| niektory:
As for me, I downloaded a copy of all (accessible to my account) thread HTML pages in paginated and printable forms, forum attachments linked in them, on-site images included in them in <img> tags, and some other random on-site files linked in them. So I think I have everything important, now it's just a matter of hammering it into a more usable form. Should I do that? Just upload the raw data? None of the above? I'm asking because we probably don't need multiple people doing the same work independently. Also, in case the site dies, you can e-mail me at <my username>@tlen.pl. |
| HTFCirno2000:
--- Quote from: Infy♫ on February 17, 2020, 10:01:56 PM ---I spent the past few days scraping the site. Here is every post on shrinemaiden.org that's accessible through an account. it's all in a .csv file of about 800mb. I hope someone else can figure out a way to make it all easy to access. --- End quote --- Wow. That csv file is extremely useful. It is the entire MoTK corpus in an easy to parse format, that is amazing. Well done, I was hoping someone would make a crawler that would construct database like this. :o :o |
| LunarSpotlight:
Since it looks like at least one scraper has finished and a copy has been made available in csv format, I will speed up my script (which had an intentional time delay and is a little more than half completed atm) and get things done with. On the topic of omitting topics: assuming the others are working similar to mine, the forum topics are structured in a way that's easy to lookup (topic=12345 in the URL and then a second number which increases by 30 [posts per page] for every new page). That said, the scraper(s) is/are indiscriminate, and any topics that folks want omitted from the archive would need to know what topic number represents each topic. Topics are numbered in chronological order (date the topic was created), so a category that holds many topics will likely have the numbers for those topics jump around (they're non-sequential). This would make it difficult to omit any one section of topics. With some more analysis, it's possible to re-categorize topics into the section they came from, then pull an entire section of topics, but only after everything's already been retrieved. |
| Suwako Moriya:
I mean, the only people that have any real control over what gets archived are the archivers. (In retrospect, I suppose we could have closed LJ viewership when archival was first suggested, but what's done is done.) Since that .csv is now provided, all I can do at this point is ask people to please try to respect the privacy of those who may have posted personal information in that subforum. I'm thinking in particular of one thread from years ago about potential legal name and career path changes of a former staff member (because not a day goes by where I don't think about that thread, for better or worse <_<), but that's merely one example; I'm sure there are plenty of others I can't recall or didn't read in the first place. |
| Alcoraiden:
Well, I'm glad we were able to pull out a copy of the forum. Dang. End of an era going on here. |
| Navigation |
| Message Index |
| Next page |
| Previous page |