A data processing system for constructing a co-occurrence data file relating to a corpus of objects comprises first memory and second memory. A module determines the size of the co-occurrence data file from an inventory of distinct objects in the corpus. A second module divides the size of the co-occurrence data file into blocks occupying a memory space at most equal to the size of a buffer block of the first memory, each block of the file matching inventoried objects with each other. A third module processes each block of the file by reading the corpus and incrementing by one unity a frequency count associated with objects of the block if those objects are grouped in the read corpus to satisfy a co-occurrence criterion, each group of objects corresponding to one co-occurrence data item. At the end of the reading of the corpus, the co-occurrence data and the associated non-null frequency counts corresponding to the co-occurrence data are transferred to the co-occurrence data file in the second memory.
The system constructs exhaustively from the corpus of objects a file able to contain a large volume of co-occurrence data using a storage peripheral, as the second memory of the system, to store the file and relying on a central memory of processor, as the first memory, to inventory the co-occurrence data and to update the associated frequency counts. The invention enables particularly matrix operations to data exceeding the capacity of the central memory.