Code the magic

Friday, November 26, 2010

Column Data Store- (Implemented in C language)

Overview: CDS is a column data store where data is stored not as a collection of records but as a collection of individual columns. CDS supports all the basic operations needed for managing persistent records(select, update, delete, handling errors and memory leaks).

PROJECT COMPONENTS
Base Record
Base record is the record that we want to store and retrieve from the CDS. Every record is given a specific name that is used throughout the software.
An example structure of a record is shown below:
struct employee{
unsigned int emp_no;
char emp_name[25];
int age;
};
Initially we have started our project taking a fixed structure like above(ours is DefenseInfo structure), but after that we generalized our code so that it can work for any schema file without actually writing any structure inside cds.h(see the code below) file. we have solved the problem by writing a Union of all data types and a structure which is having this union type(in cds.h)..

The name “EMP” could be assigned to the above record structure. The first field is assumed to be key field and is of type unsigned int.
Schema File
Schema File is a text file that describes the structure of the record structure. The name of schema file is assumed to be the base record name concatenated with “.sch”. For example, if the base record name is “EMP”, then the schema file name would be “EMP.sch”. The schema file will contain one row for each field of the record showing the following information:
• Name of the field
• Size (in bytes)
• Base data type (only “int” and “string” for now)
Following is the sample EMP.sch schema file:
emp_no:4:int
emp_name:30:string
age:4:int

Note that the ‘:’ symbol is used as a delimiter.
Data Files
Data in a CDS is stored in multiple binary files. For each data record type, there will be one data file each for each data field. In the “EMP” example, there would be 3 data files, one for each field in the base record structure. Each binary file contain multiple data field values where each data field has a fixed size.
Data file names are obtained by concatenating .dat extension to the fully qualified data field name as given in the schema file. For example, for the EMP base record, following would be data files:
1. EMP_emp_no.dat
2. EMP_emp_name.dat
3. EMP_age.dat

------------------------------------

We have done this project in C Programming language ..The datastructures used in our project are Binary Search tree, Hashing and Stack...BST is used for storing the Primary key(it should be an integer and generally 1st column in your schema file .sch,primary key should be unique),Hashing is used for storing Secondary key(it should be string and generally 2nd column in your schema file ".sch", sec keys repetitions are allowed) and stack for storing the offset(think as line number in .dat files) values when deleted and this offset value is used when new record is added.

HOW TO RUN THIS PROJECT:

we have executed our project in latest version of Ubuntu(10) under gcc compiler..
1)Download the project(Column_data_Store_v1.0) below
2)Extract the files into your folder(/home/CDS)
3)now open your terminal and goto (home/CDS/)Column_Data_Store_17/src...
4)now using gcc run the main.c program i.e
> gcc main.c(-g for debugging using gdb)
> ./a.out


This project was done by me and 2 of my other friends as a part of Data Management course.
Here is the download link : Column_Data_Store_v1.0
any doubts mail me krishnateja.mattaparthi@iiitb.org

1 comment: