The BHSA corpus is the sort of thing that brings tears to your eyes: a freely available morphologically annotated text of the Hebrew Bible, made available for free. I don't know the faith commitments of the people behind it, but they certainly qualify as linguistic saints. :-)
Two things motivate this web page. The documentation shifts around and is an uneven state, so it is a little hard to get started with the data from that perspective. Second, the corpus is created using software called TextFabric, which is written in Python. I don't know Python, and altough I occasionally think that I should take the plunge, I think on the whole it'd be better for me to invest more time in C++. So my goal is to get the data into a SQLite database.
The software I wrote here reads directly from the TextFabric files. Of course the original Python also reads the data files and puts them into sensible data structures. The Python API seems to change from time to time (and this has invalidated at least one previous attempt to put the data into SQLite format), whereas the file formats seem to be more stable. So I thought it best to read from the file formats.
I found it was fairly easy to install TextFabric, following the instructions on their web site. The only real complication I faced was that initially I downloaded the 32-bit version of Python rather than the 64-bit version; TextFabric requires the 64-bit version. Once you follow their instructions, you can open up a browser for the data with this command:
The installation downloads the corpus for you automatically. For me, it installed the data here:
There are several corpora available, but as we're only talking about the BHSA corpus, the data we want are here:
All of the files have a
.tf extension. The data is in plain text format in those files.
otype.tf is a very important one. It contains the following data. The labels are refer to types of data.
clause, etc., are all different object types.
1-426584 word 426585-426623 book 426624-427552 chapter 427553-515673 clause 515674-606361 clause_atom 606362-651541 half_verse 651542-904748 phrase 904749-1172289 phrase_atom 1172290-1236016 sentence 1236017-1300541 sentence_atom 1300542-1414353 subphrase 1414354-1437566 verse 1437567-1446799 lex
The numbers are node indices. You know a node's type by looking at its index. Nodes 1, 2, 3,...426584 have the type
word. Nodes 426585, 426586, 426587,...426623 have the type
The data for each object time is in the other
.tf files. The format of a
.tf is pretty nicely documented. For our purposes, there are a few things to note. Each file is identified on the first line as one of three types.
@nodefiles are the majority. This is where the information is.
@edgefiles represent connections between nodes. There are a few of these. (In the BHSA corpus, it happens that edges are not labeled.)
@configfiles just have metadata.
Each of these files represents what TextFabric calls a feature. Something that was non-intuitive to me was that different objects can have the same features. For instance, the file
book.tf has features for both the
chapter object types.
The features are all documented on the BHSA web site (currently, under “Features” at the bottom of the menu on the left).
I have put my code in a GitHub repository. (It's written in C++ with the Qt framework.) In brief:
otype.tfhas its own table.
@node-type is a column in the appropriate table.
@edge-type file also gets its own table.
@config-type files are ignored.
In each node table, the column
_id is unique, and corresponds to the node index.
I am still working on understanding all of the features.
The other important file is
oslots.tf, which is represented in the database with the
oslots table. This table contains all of the tree-like membership data.
Example: I consult the
verse table and see that Genesis 1:1 has an
_id of 1414354. I query the
oslots table to find the constituents of the verse:
SELECT * FROM oslots WHERE from_node='1414354' ORDER BY to_node ASC; Result: from_node to_node 1414354 1 1414354 2 1414354 3 1414354 4 1414354 5 1414354 6 1414354 7 1414354 8 1414354 9 1414354 10 1414354 11
We just have to know that when it's a
verse in the
from_node, it will be a
word in the
to_node. To make a more helpful query:
# lots of data SELECT * FROM oslots LEFT JOIN word ON oslots.to_node=word._id WHERE from_node='1414354' ORDER BY to_node ASC; # just the highlights SELECT word._id, g_lex_utf8,gloss,pdp FROM oslots LEFT JOIN word ON oslots.to_node=word._id WHERE from_node='1414354' ORDER BY to_node ASC; _id g_lex_utf8 gloss pdp 1 בְּ in prep 2 רֵאשִׁית beginning subs 3 בָּרָא create verb 4 אֱלֹה god(s) subs 5 אֵת <object marker> prep 6 הַ the art 7 שָּׁמַי heavens subs 8 וְ and conj 9 אֵת <object marker> prep 10 הָ the art 11 אָרֶץ earth subs
Locating a specific verse requires a bigger query:
SELECT * FROM verse LEFT JOIN oslots LEFT JOIN word WHERE verse._id=oslots.from_node AND oslots.to_node=word._id AND book='Genesis' AND chapter='1' AND verse='1' ORDER BY to_node ASC;
SELECT * FROM verse LEFT JOIN oslots LEFT JOIN word WHERE verse._id=oslots.from_node AND oslots.to_node=word._id AND OSIS='Ps.143.8' ORDER BY to_node ASC;
SELECT group_concat(g_word_utf8,' ') AS verseText FROM (SELECT g_word_utf8 FROM verse LEFT JOIN oslots LEFT JOIN word WHERE verse._id=oslots.from_node AND oslots.to_node=word._id AND OSIS='Ps.142.8' ORDER BY number ASC);
SELECT group_concat(g_word_utf8,' ') AS verseText FROM (SELECT g_word_utf8 FROM word WHERE OSIS=? ORDER BY number ASC);
Note that the book column in the
verses table has the Latin names.
All contents © 2023 Adam Baker, except where otherwise noted.