1. What do you mean by collisions in a hash table? How can they be avoided?
A collision in a hash table occurs when two different keys hash to the same index in the array. Since each index is meant to store one entry, a collision must be resolved.
Ways to avoid/handle collisions:
- Chaining: Store multiple elements at the same index using linked lists or another structure.
- Open Addressing: Look for the next available slot using methods like linear probing or quadratic probing.
- Double Hashing: Use a second hash function to find an alternative index.
- Rehashing: Resize the hash table to reduce the load factor.
- Better Hash Functions: Use functions that distribute keys more uniformly.
2. What are the ways to detect outliers in data?
Outliers can be detected using:
- Statistical methods:
- Z-score: Values with |z| > 3 are often considered outliers.
- IQR (Interquartile Range): Values outside Q1 – 1.5×IQR or Q3 + 1.5×IQR.
- Visualization techniques:
- Box plots and scatter plots highlight extreme values.
- Machine learning techniques:
- Isolation Forest
- One-Class SVM
- DBSCAN clustering
3. What are different ways to deal with outliers?
Outliers can be handled in several ways depending on the context:
- Remove them if they are due to errors or not relevant.
- Cap them using percentile capping (e.g., 1st and 99th percentiles).
- Transform the data using log, square root, etc., to reduce the outlier effect.
- Impute with statistical measures (mean/median) if the outlier is due to missing data.
- Use robust models like decision trees that are less sensitive to outliers.
4. What are some key skills required for a data analyst?
Core skills for a data analyst include:
- Technical skills:
- SQL
- Python or R
- Excel
- Data visualization tools (e.g., Tableau, Power BI, Matplotlib)
- Analytical skills:
- Problem-solving
- Critical thinking
- Statistical knowledge
- Communication:
- Ability to convey insights clearly to non-technical stakeholders
5. What is the data analysis process?
The typical data analysis process involves:
- Define the objective: Understand the business problem.
- Data collection: Gather data from various sources.
- Data cleaning: Handle missing, duplicate, or incorrect values.
- Data exploration: Perform EDA (Exploratory Data Analysis) to understand patterns.
- Analysis/modeling: Use statistical or machine learning models to analyze.
- Interpretation: Derive insights and patterns.
- Visualization & reporting: Present findings using graphs, charts, or dashboards.
- Action: Make data-driven decisions or recommendations.
6. What challenges are faced during data analysis?
Data analysts often face the following challenges:
- Missing or Incomplete Data: Many datasets have null or missing values.
- Inconsistent Data Formats: Variations in how data is recorded (e.g., dates, currencies).
- Duplicate Records: Multiple entries for the same entity.
- Large Volumes of Data: Handling and processing big data efficiently.
- Data Integration: Combining data from multiple sources can be complex.
- Outliers and Noisy Data: These can skew analysis if not handled properly.
- Security and Privacy Concerns: Especially with sensitive or personal data.
7. What is data cleansing?
Data cleansing (or cleaning) is the process of identifying and correcting errors or inconsistencies in the data to improve its quality.
Key tasks in data cleansing:
- Removing duplicate records
- Fixing inconsistent formats
- Filling or removing missing values
- Correcting spelling or data entry errors
- Validating against rules or external standards
8. What are the tools useful for data analysis?
Popular tools used by data analysts include:
- Spreadsheet tools: Microsoft Excel, Google Sheets
- Database tools: SQL, PostgreSQL, MySQL
- Programming languages: Python (Pandas, NumPy, Matplotlib), R
- Visualization tools: Tableau, Power BI, Looker
- Data cleaning tools: OpenRefine, Trifacta
- Big data platforms: Apache Spark, Hadoop (for large-scale analysis)
9. What is the difference between data mining and data profiling?
Aspect | Data Mining | Data Profiling |
---|---|---|
Purpose | Discover hidden patterns and insights | Understand structure, quality, and content |
Focus | Predictive analysis, pattern recognition | Data assessment and summary statistics |
Techniques Used | ML algorithms, clustering, association rules | Metadata analysis, frequency/distribution check |
Output | Actionable insights and trends | Data quality reports |
10. Which validation methods are employed by data analysts?
Common data validation methods include:
- Cross-validation (e.g., k-fold): For evaluating model performance.
- Holdout method: Splitting data into training and test sets.
- Manual validation: Spot-checking data samples for accuracy.
- Data type checks: Ensuring consistency in data types (e.g., string, integer).
- Range and constraint checks: Ensuring values fall within valid ranges.
- Business rule validation: Ensuring data follows specific domain rules.
11. What is an outlier?
An outlier is a data point that differs significantly from other observations. It may be due to variability in the data or an error.
Example: In a dataset of student ages where most are between 18–22, a value of 45 could be an outlier.
Why outliers matter:
- They can skew statistical analyses.
- May indicate errors or rare, valuable events.
- Require careful evaluation before removal or treatment.
Retrieve customers who made consecutive purchases (2 Days)
WITH cte AS (
SELECT
id,
order_date,
LAG(order_date) OVER (
PARTITION BY id
ORDER BY order_date
) AS prev_order_date
FROM Orders
)
SELECT
id,
order_date,
prev_order_date
FROM cte
WHERE DATEDIFF(DAY, prev_order_date, order_date) = 1;