Tuesday, May 16, 2017

A brief visual history of MARC cataloging at the Library of Congress.

The Library of Congress has released MARC records that I'll be doing more with over the next several months to understand the books and their classifications. As a first stab, though, I wanted to simply look at the history of how the Library digitized card catalogs to begin with.




A couple notes for the technically inclined:
1. the years are pulled from field 260c (or if that doesn't exist or is unparseable, from field 008). Years in non-western calendars are often not converted correctly.
2. There are obviously books from before 1770, but they aren't included.
3. By "books", I mean items in the LC's recently-released retrospective (to 2014) "Books all" MARC files. http://www.loc.gov/cds/products/product.php?productID=5. Not the serial, map, etc. files: the total number is just over 10 million items.

See after the break for the R code to create the chart and the initial version Jacob is talking about in the comments.

Friday, April 14, 2017

The history of looking at data visualizations

One of the interesting things about contemporary data visualization is that the field has a deep sense of its own history, but that "professional" historians haven't paid a great deal of attention to it yet. That's changing. I attended a conference at Columbia last weekend about the history of data visualization and data visualization as history. One of the most important strands that emerged was about the cultural conditions necessary to read data visualization. Dancing around many mentions of the canonical figures in the history of datavis (Playfair, Tukey, Tufte) were questions about the underlying cognitive apparatus with which humans absorb data visualization. What makes the designers of visualizations think that some forms of data visualization are better than others? Does that change?

There's an interesting paradox about what the history of data visualization shows. The standards for data visualization being good change seem to change over time. Preferred color schemes, preferred geometries, and standards about the use of things like ideograms change over time. But, although styles change, the justifications for styles are frequently cast in terms of science or objective rules. People don't say "pie charts are out this decade"; they say, "pie charts are objectively bad at displaying quantity."  A lot of the most exciting work in the computer science side of information visualization is now trying to make the field finally scientific. It works to bring scientific research into perception from mere style, like the influential and frequently acerbic work of Tableau's Robert Kosara; or to precisely identify what a visualization is supposed to do (be memorable? promote understanding?) like the work of Michelle Borkin, my colleague at Northeastern, so that the success of different elements can be measured.

I think basically everyone who's thought about it agrees that good data visualization is not simply art and not simply science, but the artful combination of both. To make a good data visualization you have to both be creative, and understand the basic perceptual limits on your viewer. So you might think that I'm just saying: the style changes, but the science of perception remains the same.

That's kind of true: but what's interesting about thinking historically about data visualization is that the science itself changes over time, so that both what's stylistically desirable and what a visualization's audience has the cognitive capacity to apprehend changes over time. Studies of perception can tap into psychological constants, but they also invariable hit on cultural conditioning. People might be bad at judging angles in general, but if you want to depict a number that runs on a scale from 1 to 60, you'll get better results by using a clock face because most people spend a lot of time looking at analog clocks and can more or less instantly determine that a hand is pointing at the 45. (Maybe this example is dated by now. But that's precisely the point. These things change; old people may be better at judging clock angles than young people.)

This reminds me of the period I studied in my dissertation, the period in the 1921s-1950s when advertisers and psychologists attempted to measure the graphical properties of an attention-getting advertisement. Researchers worked to understand the rules of whether babies or beautiful drew more attention, whether the left or the right side of the page was more viewed; but whether a baby grabs attention depends as much on how many other babies are on the page as on how much the viewer loves to look at babies. The canniest copywriters did better following their instinct because they understood that the attention economy was always in flux, never in equilibrium.

So one of the most interesting historical (in some ways art-historical) questions here is: are the conditions of apprehension of data visualization changing? Crystal Lee gave a fascinating talk at the conference about the choices that Joseph Priestley made in his chart of history; I often use in teaching Joseph Priestley's description of his chart of biography, which uses several pages to justify the idea of timeline. In the extensive explanation, you can clearly see Priestley pushing back at contemporaries who found the idea of time on the x-axis unclear, or odd to understand.

This seems obvious: so why did Priestley take pages and pages to make the point?

That doesn't mean that "time-as-the-x-axis" was impossible for *everyone* to understand: after all, Priestley's timelines were sensations in the late 18th century. But there were some people who clearly found it very difficult to wrap their heads around, in much the same way that--for instance--I find many people have a lot of trouble today with the idea that the line charts in Google Ngrams are insensitive to the number of books published in each year because they present a ratio rather than an absolute number. (Anyone reading this may have trouble themselves believing that this is hard to understand or would require more than a word of clarification. For many, it does.) 

That is to say: data visualizations create the conditions for their own comprehension. Lauren Klein spoke about a particularly interesting case of this, Elizabeth Peabody's mid-19th century pedagogical visualizations of history, which depict each century as a square, divided into four more squares, each divided into 25 squares, and finally divided into 9 more for a total of 900 cells.

Peabody's grid, explanation: http://shapeofhistory.net/

There's an oddly numerological aspect to this division that draws it structures by the squares of the first three primes; Manan Ahmed suggested that it drew on a medieval manuscript tradition of magic squares.


Old manuscript from pinterest: I don't really know what this is. But wow, squares within squares!

Klein has created a fully interactive recreation of Peabody's visualization online here, with original sources. Her accompanying argument (talk form here), which I think is correct, includes the idea that Peabody deliberately engineered a "difficult" data visualization because she wanted a form that would promote reflection and investment, not something that would make structures immediately apparent without a lot of cognition.

Still, one of the things that emerged again and again in the talks was how little we know about how people historically read data visualizations. Klein's archival work demonstrates that many students had no idea what to do with Peabody's visualizations; but there's an interesting open question about whether they were easier to understand then than they are now?

The standard narrative of data visualization, insofar as there is one, is of steadily increasing capacity as data visualizations forms become widespread. (The more scientific you are, I guess, the more you might also believe in constant capacity to apprehend data visualizations.) Landmark visualizations, you might think, introduce new forms that expand our capacity to understand quantities spatially. Michael Friendly's timeline of milestone visualizations, which was occasionally referenced, lays out this idea fairly clearly; first we can read maps, then we learn to read timelines, then arbitrary coordinate charts, then boxplots; finally in the 90s and 00s we get treemaps and animated bubble charts, with every step expanding our ability to interpret. These techniques help expand understanding both for experts and, through popularizers (Playfair, Tufte, Rosling), the general public.

What that story misses are the capacities, practices, and cognitive abilities that were lost. (And the roads not taken, of course; but lost practices seem particularly interesting).

So could Peabody's squares have made more sense in the 19th century? Ahmed's magic squares suggest that maybe they were. I was also struck by the similarity to a conceptual framing that some 19th-century Americans would have known well; the public land survey system that, just like Peabody's grid, divided its object (most of the new United States) into three nested series of squares.


Did Peabody's readers see her squares in terms of magic squares or public lands? It's very hard--though not impossible--to know. It's hard enough to get visualization creators nowadays to do end-user testing; to hope for archival evidence from the 19th century is a bridge too far.

But it's certainly possible to hope for evidence; and it doesn't seem crazy to me to suggest that the nested series of squares used to be a first-order visualization technique that people could understand well, that has since withered away to the point where the only related modern form is the rectangular treemap, which is not widely used and lacks the mystical regularity of the squares.

I'm emphatically not saying that 'nested squares are a powerful visualization technique professionals should use more.' Unless your audience is a bunch of Sufi mystics just thawed out of a glacier in the Elburz mountains, you're probably better off with a bar chart. I am saying that maybe they used to be; that our intuitions about how much more natural a hierarchical tree are might be just as incorrect as our intuitions about whether left-to-right or right-to-left is the better direction to organize text.

From the data visualization science side, this stuff may be interesting because it helps provide an alternative slate of subjects for visualization research. Psychometry more generally knows it has a problem with WEIRD (Western, educated, industrialized, rich and democratic) subjects. The data visualization literature has to grapple with the same problem; and since Tufte (at least) it's looked to its own history as a place to find the conditions of possible. If it's possible to change what people are good at reading, that both suggests that "hard" dataviz might be more important than "easy" dataviz, and that experiments may not run long enough (decades?) to tell if something works. (I haven't seen this stuff in the dataviz literature, but I also haven't gone looking for it. I suspect it must exist in the medical visualization literature, where there are wars about whether it's worthwhile to replace old colorschemes in, say, an MRI readout that are perceptually suboptimal but which individual doctors may be )

From the historical side, it suggests a lot of interesting alignments with the literature. The grid of the survey system or Peabody's maps is also the "grid" Foucault describes as constitutive of early modern theories of knowledge. The epistemologies of scientific image production in the 19th century are the subject of one of the most influential history of science books of the last decade, Daston and Gallison's Objectivity. The intersections are rich and considerably more explored, from what I've seen well beyond history of science into fields like communications. I'd welcome any references here, too, particularly if they're not to the established, directly relevant field of the history of cartography. (Or the equally vast field of books Tony Grafton wrote.)

That history of science perspective was well represented at Columbia, but an equally important discipline was mostly absent. These questions of aesthetics and reception in visualization feel to me a lot like art-historical questions; there's a useful analogy between understanding how a 19th century American read a population bump chart, and understanding how a thirteenth century Catholic read a stained glass window. But most of the people I know writing about visualization are exiles from studying either texts or numbers, not from art history. External excitement about the digital humanities tends to get too excited about interdisciplinarity between the humanities and sciences and not excited enough about bridging traditions inside the humanities; one of the most interesting areas in this field going forward may be bridging the newfound recognition of the significance of data visualization as a powerful form of political rhetoric and scientific debate with a richer vocabulary for talking about the history of reading images.

Friday, December 23, 2016

Some notes on corpora for diachronic word2vec

I want to post a quick methodological note on diachronic (and other forms of comparative) word2vec models.

This is a really interesting field right now. Hamilton et al have a nice paper that shows how to track changes using procrustean transformations: as the grad students in my DH class will tell you with some dismay, the web site is all humanists really need to get the gist.

Semantic shifts from Hamilton, Leskovec, and Jurafsky

I think these plots are really fascinating and potentially useful for researchers. Just like Google Ngrams lets you see how a word changed in frequency, these let you see how a word changed in *context*. That can be useful in all the ways that Ngrams is, without necessarily needing a quantitative, operationalized research question. I'm working on building this into my R package for building and exploring word2vec models: here, for example, is a visualization of how the use of the word "empire" changes across five time chunks in the words spoken on the floor of the British parliament (i.e., the Hansard Corpus). This seems to me to be a potentially interesting way of exploring a large corpus like this.


Tuesday, December 20, 2016

OCR failures in 2016

This is a quick digital-humanities public service post with a few sketchy questions about OCR as performed by Google.

When I started working intentionally with computational texts in 2010 or so, I spent a while worrying about the various ways that OCR--optical character recognition--could fail.

But a lot of that knowledge seems to have become out of date with the switch to whatever post-ABBY, post-Tesseract state of the art has emerged.

I used to think of OCR mistakes taking place inside of the standard ASCII character set, like this image from Ted Underwood I've used occasionally in slide decks for the past few years:




But as I browse through the Google-executed OCR, I'm seeing an increasing number of character-set issues that are more like this, handwritten numbers into a mix of numbers and Chinese characters.



Thursday, December 1, 2016

A 192-year heatmap of presidential elections with a y axis ordering you have to see to believe

Like everyone else, I've been churning over the election results all month. Setting aside the important stuff, understanding election results temporally presents an interesting challenge for visualization.

Geographical realignments are common in American history, but they're difficult to get an aggregate handle on. You can animate a map, but that makes comparison through time difficult. (One with snappy music is here). You can make a bunch of small multiple maps for every given election, but that makes it quite hard to compare a state to itself across periods. You can make a heatmap, but there's no ability to look regionally if states are in alphabetical order.

This same problem led me a while ago to try and determine the best linear ordering of US states for data visualizations. I came up with a trick for combining some research on hierarchical and traditional census regions, which yields the following order:

This keeps every census-defined region (large and small) in a block, and groups the states sensibly both within those groups and across them.

Applied to election results, this allows a visualization that can be read both at the state and regional level (like a map) but also horizontally across time. Here's what that looks like: if you know something about the candidates in the various elections, it can spark some observations. Mine are after the image. Note that red/blue (or orange/blue) here are not the *absolute* winner, but the relative winner. Although Hillary Clinton won the national popular vote, and she won New Hampshire in 2016, for example, New Hampshire is red because it was more Republican than the nation as a whole.

Click to enlarge

Friday, September 9, 2016

The efficient plots hypothesis

I'm pulling this discussion out of the comments thread on Scott Enderle's blog, because it's fun. This is the formal statement of what will forever be known as the efficient plot hypothesis for plot arceology. Noble prize in culturomics, here I come.

Monday, August 29, 2016

Language is biased. What should engineers do?

Word embedding models are kicking up some interesting debates at the confluence of ethics, semantics, computer science, and structuralism. Here I want to lay out some of the elements in one recent place that debate has been taking place inside computer science.

I've been chewing on this paper out of Princeton and Bath on bias and word embedding algorithms. (Link is to a blog post description that includes the draft). It stands in an interesting relation to this paper out of BU and Microsoft Research, which presents many similar findings but also a debiasing algorithm similar to (but better than) the one I'd used to find "gendered synonyms" in a gender-neutralized model. (I've since gotten a chance to talk in person to the second team, so I'm reflecting primarily on the first paper here).

Wednesday, July 20, 2016

Why Digital Humanists don't need to understand algorithms, but do need to understand transformations

Debates in the Digital Humanities 2016 is now online, and includes my contribution, "Do Digital Humanists Need to Understand Algorithms?" (As well as a pretty snazzy cover image…) In it I lay out distinction between transformations, which are about states of texts, and algorithms, which are about processes. Put briefly:
Put simply: digital humanists do not need to understand algorithms at all. They do need, however, to understand the transformations that algorithms attempt to bring about. If we do so, our practice will be more effective and more likely to be truly original.
It then moves into one case study; the Jockers-Swafford debate of 2015, large parts of which hung on whether the Fourier transform was a black box and how it its use as a smoothing device might be understood. It's like a lot of what's on this blog, only better thought and edited.

The transformation/algorithm distinction is not a completely firm one, but I have found it extremely useful in a lot of research and teaching problems I've approached over the last year. So in addition to advertising that article for your consumption/fall syllabi production, I wanted to take the occasion to put on github a tiny little germ of a project to provide one-page, transformation-oriented introductions to basic text-analysis concepts that came out of using this thinking for a workshop on text analysis at the NIH in Bethesda, and describe what's in it. I'd love for anyone else to use it, fork it, whatever.

Monday, July 18, 2016

Plot arceology 2016: emotion and tension

Some scientists came up with a list of the 6 core story types. On the surface, this is extremely similar to Matt Jockers's work from last year. Like Jockers, they use a method for disentangling plots that is based on sentiment analysis, justify it mostly with reference to Kurt Vonnegut, and choose a method for extracting ur-shapes that naturally but opaquely produces harmonic-shaped curves. (Jockers using the Fourier transform, and the authors here use SVD.) I started writing up some thoughts on this two weeks ago, stopped, and then got a media inquiry about the paper so thought I'd post my concerns here. These sort of ramp up from the basic but important (only about 40% of the texts they are using are actually fictional stories) to the big one that ties back into Jockers's original work; why use sentiment analysis at all? This leads back into a sort of defense of my method of topic trajectories for describing plots and some bigger requests for others working in the field.

Tuesday, July 5, 2016

Nature publishes flat-earth research paper

I usually keep my mouth shut in the face of the many hilarious errors that crop up in the burgeoning world of datasets for cultural analytics, but this one is too good to pass up. Nature has just published a dataset description paper that appears to devote several paragraphs to describing "center of population" calculations made on the basis of a flat earth.

Monday, May 30, 2016

Literary Dopplegängers and interestingness

I started this post with a few digital-humanities posturing paragraphs: if you want to read them, you'll encounter them eventually. But instead let me just get the point: here's a trite new category of analysis that wouldn't be possible without distant reading techniques that produces sometimes charmingly serendipitous results.

I'll call it dopplegänger books. A dopplegänger is, for any world-historically great work of literature, a book that shares many of the same themes, subjects, and language, but is comparatively obscure, not widely read, and--most likely--of surpassingly mediocre quality.

Edit: Ryan Cordell informs me privately and regretfully that I'm wrong in some of my conclusions here. I said, "I took a grand total of one English literature class in college; does anyone expect me to be right?" But he's worried that my wrongness might reflect poorly on the field of DH, which has a history of critics straw-manning offhand blog posts into terrible representatives of the field. So let me say up front: Persons attempting to find an argument in this post will be prosecuted; persons attempting to find political advocacy in it will be banished; persons expecting me to have anything above a high-schooler's knowledge of English literature will be shot.

Take Huck Finn. In hazy recollection (I haven't read the whole book in probably 10 years), much of what seems great about it is the purely American picaresque of a vision of America. Twain's interest "is in the the boy in whose mouth he puts the story, and in this boy's view of the world as it passes under his eye." Huck "is a true child of the river," and gives us a view of America seen through the eyes of "a perfect vagabond of a youngster, wandering up and down the river at his will, taking in the passing show with open mind, finding it all for to admire."

All those quotes, as you may already have guessed guessed, are not describing Huck Finn at all, but instead come from a review of Charles Stewart's Partners of Providence (1904).


Read the book online through Hathi

The table of contents is pretty fascinatingly close to Huckleberry Finn; the reviewers note the comparison, and it's hard to imagine that the tale of a young boy's adventures up and down the river with an entertaining ethnic (here Irish) sidekick past swindlers and exhibitions and perils wasn't somehow noveled on the most famous humorist in the country.
But there are surely differences as well; I wouldn't be surprised if an in-class discussion on the racial politics Huckleberry Finn couldn't benefit from a brief comparison to Partners' account of "the marooning and subsequent escape of a pair of pugnacious darkies."

Across the c. 4.5 million public domain volumes in the Hathi Trust, there are a surprising number of these, many books that seem (based on Google searches) to languish in deserved obscurity. (I've got a set of tricks that actually finding the pairings more feasible than running 20 trillion pairwise comparisons, but the exact mechanics of that are for another day). But they're interesting; not in a "distant reading" way, but in that they provide some greater focus around the core texts we all read already.


So let me just plug a few books in here and see what comes back. My criteria are just that the original book be canonical.

Huckleberry Finn

Twain is closest to himself; Huck Finn is closest to the later Tom Sawyer books than to Tom Sawyer itself, which should perhaps not be surprising.

But nearest-neighbor searching also reveals a deep vein of western boys literature. We know that this exists; the interesting questions here would probably involve the specific ways (especially dialect: these are mostly first person narratives in highly vernacular styles) that writers imitate Twain.

Publication years also provide a point of departure. All the books here were written substantially later than Huck Finn except for "Live boys in the Black Hills." So if I were going to pick any up, maybe I'd start there.


Moby-Dick

This has fewer straightforward imitators; but the whaling novel is a perfectly well-represented genre.
The closest match is the romance "The Red Eric; or, The whaler's last cruise. A tale" from 1883. Some elements of the contents are provocative, at least; but the similarities are less than perfect. (Red Eric's captain's "insane resolution" is to bring his daughter on a whaling cruise with him, for example).




A few other options include a collection of sea stories,


The Cruise of the Cachalot and Sea-wrack, by Frank Bullen, offer some of the more interesting comparisons. Properly shuffled, it makes sense that Moby Dick's closest companions might include not literature at all, but piecemeal miscellanea from the magazines like this ("Sea-Wrack")


Middlemarch

Middlemarch is somewhat harder to find close matches for uninteresting reasons: since the novel is so long, it was frequently chopped into 2, 3, or 4 parts; and each one of those sections ranks highly on the list.
The nearest novels are by Dinah Craik, who I don't know, but who seems well enough established as a poor man's George Eliot in the scholarly literature. (Googling quickly brought me to the online version of Sally Mitchell's monograph on the author.). "Hannah", the closest, is characterized by Mitchell as "a one-issue novel with a narrow legislative aim."

Fraternity; a romance ... (1910) is a harder nut to crack. It's a rural novel set in Wales and published by Macmillan around 1888, but the only surviving digital copy was (according to library metadata) published in the United States in 1910. (Galsworthy's 1911 novel Fraternity further muddies things here.) It's the subject of a strikingly positive review in the Boston press that explicitly casts it as a diamond in the rough.



I was going to let it go there, but then discovered a whole separate track via this book. The author is one Miss M. M. Holland Thomas, and the novel somehow attracted the intense admiration of JP Morgan for its message of social reform through benevolent patronage. (It is Morgan who paid for the American reprint in 1910.) Does this story have anything to do with a similarity to Middlemarch? Hmm. there's definitely something here about the connections between the English social novel and political intentions. But beyond that, I couldn't say.


The Education of Henry Adams

The absolute closest match is his brother's autobiography. Which should surprise no one, and I'm sure I've encountered the book before. "Early Memories" by Henry Cabot Lodge is also high on the list, which is probably a decent choice as well. But I'll pick as the dopplegänger Cambridge Sketches by Frank Preston Stearns, which hits a number of the same points

The Souls of Black Folk

A real genre-bender of a book, even more than Moby Dick. And even less often reprinted.

The closest match is a fairly dull-seeming hagiography of Booker T. Washington. But I'll take as a shadow "Up stream: an American chronicle" by Ludwig Lewisohn. It seems to be the personal memoir of a German-born Jew who grew up in Charleston, SC before attending Columbia and (eventually) becoming a founding faculty member at Brandeis. The grounds for similarity aren't entirely clear--perhaps some odd combination of self-recognition, music, and the South?--but that's what makes it an interesting track. Some of the 


Autobiography of an ex-colored man

On the topic of great Af-Am literature. This one was suggested to me as a candidate by John Reuland. For this one I'm pasting in a longer list of matches, because we were initially very disappointed at the results. (Very little African American literature on the list).

But on looking at the list, what there is is an extraordinary amount of autobiographical self-help literature about money. So maybe there's some lesson to be gleaned there.


OK, that's enough.

Portrait of the Artist as a Young Man

Again, the matches aren't as clear; a vocabulary-based approach like mine works best thematically distinct themes like riverboats, not with "childhood."

There are some vaguely interesting similarities: at #3, I particularly like "What to read at Winter Entertainments," in which it appears the closest antecedent to Joyce is a stuffed-together hodgepodge of great British writers from the 19th century. Sounds about right.

But as a Doppleganer, I'll take Shaw Desmond's Gods, which seems to cover similar places in the Irish experience of the early 20th century.


On Interestingness

I've thinking about Ted Underwood's "old-fashioned, shamelessly opinionated, 1000-word blog post" from yesterday. There are parts I wholeheartedly agree with, such as the section where he dances near to, but decorously avoids citing, Kieran Healy's magnum opus on what calls for nuance do in contemporary academic discourse. There are parts I don't; I'm increasingly convinced that efforts to apply and invent novel algorithmic practices should be fully central to the work of some humanists, and that calls to return to the primary questions of the disciplines are not just premature but somewhat misguided.*

(Roughly, although I should boil this up into a richer stew at some point: very few people outside a philosophy department think that only academic philosophers should do philosophy; very few people *inside* history departments think that only academic historians should do history. Just as we let political philosophy flourish in politics departments and cultural history flourish in art and music departments, computer programming shouldn't be the sole province of computer science departments.)

Is this interesting? I'm not sure. It's not here-I-come-PMLA interesting, for sure. But then again, I've never deliberately sought out much contemporary literary history written since 1980 or so. For a certain sort of Arnoldian prudish conception of literature, I kind of like the game. Much like my anachronism-searching blog posts, it's a field-and-context approach to literature where the whole is not treated as the object of study itself (the stated purpose of much "distant reading") but as a conveniently large wall on which to reposition the works of literature we're already interested in. What that means for literary history, I think I'm under no professional obligation to say.

Bonus links

A little bonus for those who read through to the end; a temporary link to a live interface to the engine I used for this thing, so you can play along at home. Just go to http://benschmidt.org/similarities/ and you can paste in any text you're interested in. Terms and conditions are: don't link to that page, because this may not scale; and e-mail me or post in the comments if you find any terrible bugs or interesting matches.