![]()
Premise and Problem:
As engineers one of our many tasks is to solve engineering problems in computing that impact the productivity, financial aptitude, and viability of people, projects, companies, and more. Ironically, often when we solve these engineering problems we create solutions for social problems in creative and innovative ways. In the programmer community, one generally creates programs and tools to make common tasks and irritating jobs easier. One of the most common tasks that IT staff face is working with employees who do not take proper care of their documents. No matter how much one tries to educate the employees, invariably, they will ignore the advice of the IT department, work on their local hard drive, bypass backups, overwrite saves, and other annoyances. They will then demand that the IT staff fix their mistakes as if it were their job. This program is geared towards that painful task of retrieving older versions of documents that have been deleted, overwritten, or undesirably modified and not properly backed up on offline systems.
Abstract and Objectives:
The Auto-Versioning System (hereafter referred to as AVS) is designed to be seamless to the user. If at all possible, the employees, or user, in the case of home usage, need not be aware of it at all. It will run in the background of the operating system. Installation will be simple and required modifications and customizations will be minimal. The system will be portable to different operating systems, file type independent and will take up as little additional space as possible. The key points of the system are as follows:
The key word is AUTO. The system will require little to no user intervention after installation, and will run with no intrusive displays or interfaces, ideally as a service or boot time run level 5 or 6 application in the background of the operating system.
The AVS will provide versioning without massive increase in hard drive space requirements. The AVS will initially be designed for the purpose of versioning small files (< 10MB). The AVS will scan the directories it is configured to monitor at regular intervals. It will check to see that there is an single backup copy of each file, and compare the last modified time on each file to the one of its backup file. If no backup file is found, it will assume it is tracking a new directory or file and create a backup. This initial backup will be logged as the start of the AVS tracking. When it detects that there has been a modification to a file, it runs an RSYNC pass on the file against the old backup version and generates Deltas of the changes made. The old backup is then deleted and a new one is created. As such, at each modification, only the Delta is conserved instead of a full version of the file. These Deltas will then be tracked by the AVS, and log files generated. Given the compact nature of the Deltas generated from the RSYNC algorithm, the increase in space for tracking beyond the initial copy of each file will be minimal.
The AVS will be file type independent. The RSYNC algorithm does not distinguish between file types. As such, the generation of Deltas will not be dependent on the type, nor will it affect the size. The tracking of changes, user interface for managing the restoring of files, and comparison of files will not be file type based. If necessary, an expansion of the AVS can include specific file type support to allow users to have an increased level of access to compare changes, but this is reserved for future expansion. In addition, further levels of compression of the Deltas might be possible to further minimize space usage, such as ZIP encoding for text files, or audio/video compression for files of that nature. This step is also left as a possible future expansion.
The AVS will be portable across all major platforms. This challenge can either be accomplished by programming in Java (which raises issues of system access levels in a Windows environment), or in a compiled language by compiling around native libraries of different systems. On Windows, ideally it could run as a native service that is controlled via the Services control panel, and possibly included in future releases of Windows. On Linux, it could run at startup, around levels 5 or 6, and work directly with the file system, independent of the user's particular GUI.
Target Audience:
This product targets several niches of computer users. First and foremost, it targets the average home and business user who cannot be bothered with details of backup, versioning, or tracking. The AVS provides a simple, non-intrusive solution to everyday data loss issues. Next, it targets the IT staff who support employees in a medium to large business. With its logging and tracking tools and interfaces, with remote support as an option, the AVS becomes a daily useful tool to help recover lost data for users. Finally, the optional expansion of the AVS into a full file repository and tracking system would make it the perfect system for code tracking in a development environment. It provides an innovative solution for programmers and developers to help them track code changes, branches, or even graphics development, video editing or any environment where file changes are made frequently and need to be easily indexed, and retrieved. As new fields begin to grow with computer-centric solutions, the AVS will become increasingly important.
Stages of Development and Methodology:
All development will follow an engineering approach of obtaining requirements, developing a solution, implementing solution, evaluating, revising and repeat. All code will be written using object oriented methodology for enhanced modularization, ease of expansion and portability.
(In progress) Research, research, research. Read about VAX VMS. Read about existing solutions (CVS, SCM, RCS, etc.), problems, things they do well, things they do poorly. Consult with potential customers about desired features. Learn about file access interfaces on Win32/Linux/Unix. Learn internals of RSYNC algorithm. Research suitability of particular languages for this application.
Build basic shell script to track and generate copies and Deltas of files in a particular directory as they are modified in basic Linux or Win32 environment. Test. Benchmark. Repair. Revise. Document. Repeat until ready to move to stage 2.
Generate MD5 sums of created Deltas, and log file/event log tracking of above.
Create GUI to customize AVS options.
Create interface and write scripts to parse log files, and track changes.
Add restore feature for creation of older file versions based on previous deltas. (Reverse RSYNC pass)
Compile above into single package. Test. Benchmark. Repair. Revise. Document. Repeat until ready to move on.
Compile for alternate operating systems including Win32, Unix, Portables?, etc.
Distribute Beta for testing in user environments. Conduct bug tracking and repairs as incidents are reported from users. Modify, update, test, repeat until ready to move on to finalized product release.
Optional Stages:
One of the base design difficulties is the requirement to double space allocated to all AVS-tracked files. In principle, this sounds like a lot of wasted space, but in practice, the majority of user files on current, large hard drive systems, are actually of small size. So long as the AVS is configured to only backup user files and not system or program files which do not require version tracking (at a basic usage level at least), the size increase will be minimal. However AVS tracking with large user files will still not be possible initially. Therefor, a first possible expansion would be to redesign with support for large files (>10MB). Currently duplication of large files would not be possible as it would put too much load on the system and would increase time and space requirements beyond and acceptable scope. A possible implementation for large files would be to intercept all Write calls at the OS level and track them. Changes in files could then be reconstituted by applying or reversing Write calls to any file at any particular point. This design requires high level OS access which could be difficult. It also removes the need for RSYNC passes altogether.
Further compress Deltas for increased conservation of space on system. Basic possibilities are to do a quick “detection” pass, possibly in PERL, or other Regex intensive language, on the deltas to see if the data is recognized as a particular type. If it is text for example, a rapid ZIP could then be done on the Delta and it could be so flagged in the tracking part of the AVS. When it needs to be accessed, an UNZIP can be quickly called. If the delta is audio, a quick mp3 compression could be done, and then reversed when access is needed. The difficulty is in detecting the file type and conducting the appropriate compression.
Expand into backup system. Although conventional backup is not the primary purpose of the AVS, it would provide a novel solution that minimizes space requirements on existing backup systems and provide increased control on the tracking of previous versions of files. Interface with existing solutions to create seamless backup of Deltas and tracking and monitoring history.
Create user interface for comparing changes. Current system would only show Deltas of the different files, but would now allow the user to perform the proverbial “side by side” look at the files. This optional expansion would almost definitely require specific file type support. Starting with basic file types, create a side-by-side interface where the user could read the file in any incarnation of changes along with any other incarnation of changes, or possibly even more than two at once. For text, this layout would easy, and could even implement some form of user-customizable highlighting. Providing support for standard document and spreadsheet formats (Microsoft, Corel, OpenOffice, etc.), slightly more complicated, but feasible. Audio and video files would be a challenge. The easiest solution would be likely to pipe the files and Deltas into existing, possibly user-selected programs already designed for that particular file type, as opposed to designing this programs from scratch.
Deploy into commercial solution. This project is intended to be shifted to a commercial environment after completion, to be headed by MekTek Solutions. Once a version is released to the public, feedback from the field will start a constant cycle of refining, updating, patching, and new features to be developed for customers. There is an option of contacting operating system manufacturers and promoting interest in the product to be included directly in their system. Promotion by reviewers and user guidelines, FAQ's and other online training will increase user familiarity with the concept, ease of use, and usefulness of the AVS. All code will be released under the GPL as GPL-based components will be used (Rsync). As such, intellectual property rights will not be an issue with this project.
The system can implement security. Deltas can be locked with a particular user, and require password or other protection to access. In this manner, the changes can only be tracked by authorized users. User can customize these features and others at install time, and modify them later if they wish.
The system can support remote management, restore and tracking functionality. This modification expands the AVS for support in a large environment where it is managed by IT staff for numerous users simultaneously. Remote install, updating, restoring, and logging, should all happen without any user intervention, or if at all possible, knowledge. This remote support would be coupled very closely with expansion into a full backup system (3).
Expand the MD5 Sum tracking of Deltas to something more robust to reduce or eliminate any collision possibilities. Some of the options are longer hashes, or a unique identifier proprietary to the AVS, or, user customizable based on environment.
Milestones
Date Stage Description of Deliverables
Sept. 27, 2004 0 Two page summary of existing solutions, pros and cons of each. One page summary of RSYNC algorithm. One page summary of Win32/Linux file system access layer. One page discussion of implementation language suitability. One page summary of customer requirements and desired features.
Oct. 11, 2004 1 Basic shell script to track and generate copies and Deltas of files in a particular directory as they are modified in basic Linux or Win32 environment. One page summary of challenges faced, new problems discovered, and obstacles to overcome. Half page documentation of usage.
Oct. 18, 2004 2 MD5 hash generation and tracking of deltas via log files and event logs. One page summary of challenges faced, new problems discovered, and obstacles to overcome. Half page documentation of usage.
Nov. 8, 2004 3 GUI for user customization of AVS settings, log files, active versioning locations. One page summary of challenges faced, new problems discovered, and obstacles to overcome. Half page documentation of usage.
Nov. 22, 2004 4 Scripts to parse log files and track and inventory changes made to AVS-monitored documents. One page summary of challenges faced, new problems discovered, and obstacles to overcome. Half page documentation of usage.
Dec. 13, 2004 5 Restore feature for recreating older versions of files from tracked Deltas. One page summary of challenges faced, new problems discovered, and obstacles to overcome. Half page documentation of usage.
Jan. 10, 2005 6 Single package compilation of base product for one OS. One page summary of challenges faced, new problems discovered, and obstacles to overcome. Half page documentation of usage.
Jan. 24, 2005 7 Single package compilation of base product for Win32 and Linux. One page summary of challenges faced, new problems discovered, and obstacles to overcome. Half page documentation of usage.
Feb. 14, 2005 8 Two page summary of feedback from customer and beta testing environments. Minor bug fixes.
Feb. 28, 2005 OP Two page discussion of possibilities for expansion to support large size files.
Mar. 14, 2005 OP Script to detect if Deltas are text-based and to enable simple zipping and unzipping of Deltas on creation/restoration. One page summary of challenges faced, new problems discovered, and obstacles to overcome. Half page documentation of usage.
Mar. 28, 2005 OP One page summary of possibility for merging with existing backup solutions such as tape drives, scheduled CD/DVD backups, and offsite servers.
Apr. 8, 2005 N/A Final deliverable is working AVS up to minimum of stage 8 on Win32 and Linux platforms, including all source code, and a minimum of five pages of documentation on installation, configuration, implementation and usage.
Required Facilities
Win32 Operating system computer with administrator privileges.
Linux Operating system computer with root privileges.
5 Gigabytes of local hard drive space for large file support testing.